Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records

Background Self-harm occurring within pregnancy and the postnatal year (“perinatal self-harm”) is a clinically important yet under-researched topic. Current research likely under-estimates prevalence due to methodological limitations. Electronic healthcare records (EHRs) provide a source of clinically rich data on perinatal self-harm. Aims (1) To create a Natural Language Processing (NLP) tool that can, with acceptable precision and recall, identify mentions of acts of perinatal self-harm within EHRs. (2) To use this tool to identify service-users who have self-harmed perinatally, based on their EHRs. Methods We used the Clinical Record Interactive Search system to extract de-identified EHRs of secondary mental healthcare service-users at South London and Maudsley NHS Foundation Trust. We developed a tool that applied several layers of linguistic processing based on the spaCy NLP library for Python. We evaluated mention-level performance in the following domains: span, status, temporality and polarity. Evaluation was done against a manually coded reference standard. Mention-level performance was reported as precision, recall, F-score and Cohen’s kappa for each domain. Performance was also assessed at ‘service-user’ level and explored whether a heuristic rule improved this. We report per-class statistics for service-user performance, as well as likelihood ratios and post-test probabilities. Results Mention-level performance: micro-averaged F-score, precision and recall for span, polarity and temporality >0.8. Kappa for status 0.68, temporality 0.62, polarity 0.91. Service-user level performance with heuristic: F-score, precision, recall of minority class 0.69, macro-averaged F-score 0.81, positive LR 9.4 (4.8–19), post-test probability 69.0% (53–82%). Considering the task difficulty, the tool performs well, although temporality was the attribute with the lowest level of annotator agreement. Conclusions It is feasible to develop an NLP tool that identifies, with acceptable validity, mentions of perinatal self-harm within EHRs, although with limitations regarding temporality. Using a heuristic rule, it can also function at a service-user-level.


Introduction
Self-harm is defined by the National Institute for Health and Care Excellence as an "act of selfpoisoning or self-injury carried out by a person, irrespective of their motivation" [1]. Data from several high-income countries indicates that self-harm is increasingly common, particularly in young women [2,3]. During pregnancy and the postnatal year, a time known as "the perinatal period", around 5-14% of women are estimated to experience thoughts of self-harm [4]. Yet there remains an evidence gap around acts of perinatal self-harm [5].
Given self-harm is strongly associated with mental disorder [6], this is likely to be the case for perinatal self-harm. It may therefore be a marker of unmet treatment need. Suicide is a leading cause of maternal death and such suicides are frequently preceded by acts of perinatal self-harm [7,8].
Current evidence regarding the prevalence of perinatal self-harm is mainly derived from studies using administrative hospital discharge datasets which may under-represent the true prevalence [5]. Evidence suggests perinatal self-harm is more common in women with serious mental illness (SMI) [5], meaning this population should be a focus of research.
The widespread use of electronic healthcare records (EHRs) means that large amounts of nuanced clinical information can be centrally stored for large cohorts of service-users. However, free-text documentation means many clinical variables are not readily extractable. A free-text search strategy could identify self-harm synonyms but would lack the contextual "awareness" required to distinguish relevant from non-relevant mentions, such as thoughts of or statements of negation of self-harm.
Natural Language Processing (NLP) can recognise relevant linguistic context (e.g. lexical variation, grammatical structure, negation) and is increasingly used in clinical research to extract information from EHRs [9,10]. The use of NLP to investigate suicidality is relatively new and the literature is small [11]. However, NLP has been used to identify suicidality in EHRs [12][13][14], including those of adolescents with autism spectrum disorders [15]; general hospital [16] and primary care attenders [17].
To our knowledge, only one other group has used NLP to identify perinatal self-harm. Selfharm was identified as part of a composite measure of both thoughts of suicide and acts of selfharm [18,19] and not specifically among women with SMI.
In this study, we aimed to develop an NLP tool for the purpose of identifying acts of perinatal self-harm at a mention and service-user level, within de-identified EHRs of women with SMI.

Data sources
The South London and Maudsley (SLaM) National Institute For Health Research (NIHR) Biomedical Research Centre (BRC) Clinical Record Interactive Search (CRIS) system [20] provides regulated access to a database of de-identified EHRs of all service-users accessing South London and Maudsley NHS Foundation Trust (SLaM), which is the largest secondary mental healthcare provider in the United Kingdom. In this context, "EHR" refers to a single clinical document, within one universal electronic healthcare recording system called the "Electronic Patient Journey System".
CRIS is linked with Hospital Episode Statistics (HES) [21], a database of anonymised clinical, administrative and demographic details of NHS hospital admissions of service-users over the age of 18. By searching for codes indicating delivery, linkage with CRIS has been demonstrated to be a valid way of generating a cohort of women accessing secondary mental healthcare during the perinatal period [22].

Development of coding rules
Self-harm is a complex concept and may be defined in different ways. The clinical validity of describing self-harm based on suicidal intent (e.g. "suicide attempts" versus "non-suicidal selfinjury") has been questioned [23]. The NICE definition of self-harm does not incorporate intent [1]. Therefore, when creating a list of synonyms or "keywords" for self-harm, we conceptualised self-harm broadly and utilised several sources: the secondary mental healthcare clinical expertise of the first author, the general literature on self-harm [24,25] and terms used in other studies of self-harm in EHRs [22,[26][27][28]. See S1 File for a full list of keywords.
Mentions of these keywords within the EHRs were appropriately annotated in a sample of 131 EHRs pre-selected from previous research into self-harm in pregnant women with affective and non-affective psychotic disorders by Taylor et al [22,28]. We devised rules regarding the span of text to annotate as a mention and how to annotate mention attributes (see S2 File).
Span. Only the keyword within the mention, not the surrounding text or whole sentence, was annotated. The keyword was usually a noun and direct synonym of self-harm, e.g. overdose. Occasionally, the keyword was a noun, but not a direct synonym, e.g. in the phrase "she had scratches on her arm" only the indicative noun (i.e. "scratches") was annotated. If the keyword was an adjective that modified a noun, it was annotated along with the noun it described, e.g. in the phrase "she had a self-harming impulse", both "self-harming" and "impulse" were annotated. Where the keyword was a verb, the direct object noun/pronoun that it related to was also annotated, e.g. in the phrase "she cut herself", both "cut" and "herself" were annotated. Occasionally, the verb implied a passive or non-deliberate action. For example: "she climbed out a window and fell off". Falling is a passive or unintentional event, as opposed to jumping. However, in this case the prior act of climbing indicates an active element. Although falling is passive, it was the fall that caused harm. Therefore, the verb, the pronoun it related to and the intervening words "fell off" were annotated.
Attributes and coding rules. We identified three main attributes of mentions of selfharm: status, temporality and polarity. Status specified whether a self-harm event occurred or not. For example, if a mention described thoughts of self-harm, rather than an act of selfharm, they were inferred and annotated to be non-relevant. Mentions of third-party self-harm (e.g. "her mother took an overdose") were annotated as non-relevant. We included an "uncertain" category as in a very small number of cases it was not possible, even with whole document context, to determine whether a mention was referring to an act of self-harm or not.
Temporality specified whether an act was current or historical. We were interested in selfharm occurring during pregnancy and/or the postpartum year. As only EHRs created within the service-user's perinatal period were being annotated, non-perinatal temporality was sometimes obvious e.g. "took an overdose ten years ago". Events which occurred within one month prior to the EHR were coded as current. This time frame is the same as that used in previous work investigating the prevalence of self-harm in the EHRs of a cohort of pregnant women in CRIS [28] and reflects the standard time period often used in clinical interviews that ask about self-harm, such as the Mini-International Neuropsychiatric Interview [29]. Ambiguous references to chronic events were problematic e.g. "chronic history of self-harm". Although this mention describes a chronic occurrence i.e. happening in the past, it also references the fact that the events are potentially ongoing. We decided to code such mentions as current. We initially included an "uncertain" category in order to flag complex cases during manual annotation, although not as an attribute option for the final tool.
Polarity specified whether or not the mention expressed a negation of self-harm (e.g. "she denied self-harm"). The purpose of this attribute was to allow the algorithm to filter out negations. Occasionally negation was written using symbols e.g. "Suicide attempts: X". Here, the meaning of the mention was annotated i.e. polarity negative.

Manual annotation of a reference standard
For the purposes of developing and evaluating the tool's performance, we created a reference standard, manually annotated, corpus of EHRs. First, we randomly sampled 400 EHRs from Taylor's study of self-harm in pregnant SLaM service-users with affective and non-affective psychotic disorders [22,28]. All EHRs were independently double-annotated by three annotators (KA, JK, SV) according to the coding rules, using Extensible Human Oracle Suite of Tools (eHOST) software [30]. We measured pairwise inter-annotator agreement in terms of precision (positive predictive value), recall (sensitivity) and F-score (harmonic mean of precision and recall), as well as kappa [31] (agreement adjusted for chance) for attributes. Agreement scores were calculated using the scikit-learn (version 0.21.3) machine learning library for Python [32]. The final reference standard was created by adjudication of disagreements by KA. This was split into development (N = 320 EHRs, 152 service-users) and test (N = 80 EHRs, 59 service-users) sets.

NLP development
System description. We developed a rule-based tool around spaCy (version 2.1.3), an NLP library for Python. Code for the tool is available online [33]. The tool takes a text as input and applies five processing layers in sequential order, outputting an XML file in which all detected self-harm mentions and their attributes are annotated with XML tags. Each layer of processing adds annotations that are available in subsequent layers. The five processing layers are as follows: 1. Linguistic pre-processing. Sentence detection, tokenisation (segmentation of the text into word tokens), part-of-speech tagging (determining the grammatical category of words), lemmatisation (finding the "root" form of inflected words) and dependency parsing (determining the grammatical relations between words). The tokenisation step includes a set of custom tokenisation rules to deal with errors made by spaCy's default tokeniser (e.g. self-harm, self-injury, fh/o which are incorrectly split into several word tokens). The dependency parsing step identifies syntactic relations such as subject, direct object, modifier and negation. Dependency parsing has been used in prior work on the analysis of clinical texts, for such tasks as relation extraction [34,35], identifying family history [36] and negation detection [37].
2. Lexical rules. This step consists of tagging of words with a given semantic category according to a set of 13 manually created lexicons. These lexicons include terms for self-harm, body parts, as well as relevant negation and temporal markers. A full list of these lexicons and example content is shown in Table 1.
3. Token sequence rules. The final layer of processing consists of a sequence of regular token-based grammars that take into account the context in which words appear. Grammar rules have access to all linguistic features added during pre-processing, as well as semantic categories added during lexical tagging. These rules are applied to detect self-harm expressions in context and correct and update the annotations added by previous processing layers. These rules are used both to detect or exclude mention spans and assign attribute values. A specific set of token sequence rules is used to identify history sections in EHRs. The rule attribute 'name' indicates the unique rule name for development purposes, 'pattern' is the token sequence pattern to match in the text, 'annotation' is the attribute and value that is marked on the recognised token sequence.
4. Negation detection. Negation is detected using the syntactic dependency tree for each sentence. Any mention that heads a 'neg' grammatical dependency is annotated as negative (e.g. "she did not cut herself"). If a mention's governor is a negated reported speech verb (R_SPEECH), the mention is also assigned negative polarity (e.g. "she did not report harming herself"). Finally, any mention governed by a word annotated as NEGATION is also annotated as having negative polarity (e.g. "she denies any self-harm").

Contextual search.
To further assign values to attributes for identified mentions, a contextual search is used to detect markers of temporality and status. A window of ten tokens to the left and right of a mention is used as context. If a token labelled 'past' is found within this window, the mention is labelled as historical. Similarly, if a token labelled 'hedging' or 'modality' is found within the window, the mention is annotated as non-relevant.

Unusual linguistic cases
During development of the coding rules, we identified unusual examples that did not fit with our pre-defined strategy. This led to the refinement of the tool's processing layers. Some examples are detailed in Fig 1, with the relevant keyword highlighted in italics.

Further development: Service-user selection heuristic
In our reference standard, the majority of service-users who had self-harmed perinatally had more than one mention in their EHRs (see S1 Table). Based on this, we explored the use of a service-user selection heuristic, whereby we restricted flagging of service users as true positive cases to only those who had two or more mentions of perinatal self-harm in their EHRs. Table 2 presents micro-averaged pairwise inter-annotator agreement on mention spans and attributes using precision, recall and F-score and Cohen's kappa (39), within the development set of EHRs (N = 320 documents). Due to the very small number of cases of "uncertain" status and temporality, the high degree of class imbalance means macro-averaged figures were not a fair representation of performance (see S2 Table). All figures are rounded up to 2 decimal points.

Evaluation of the tool
We evaluated the tool on two levels: mention and service-user. Table 3 shows the micro-averaged mention-level evaluation statistics from both the development (N = 320 documents) and test (N = 80 documents) datasets. Again, class imbalance for status meant macro-averaging was not appropriate (only 9 mentions of "uncertain" status in reference standard, see S3 Table  for macro-averaged results) so micro-averaged results are presented. Service-user-level performance indicates how well the tool identifies service-users who have at least one recorded "true" self-harm mention in any of their EHRs. A "true" mention has the attribute values status = relevant, polarity = positive, temporality = current. We present results with and without the heuristic rule of at least two positive mentions, derived from both the test set (Table 4). When the tool was run with the heuristic, there were no false positives, meaning there were issues with perfect prediction. Total absence of false positives is unlikely to occur in a very large sample and, in this case, most likely indicates the sample size of the test set (N = 59 service-users) is too small for patient-level analysis. We therefore present service-user results in the development set (N = 152 service-users, Table 5).
Due to class imbalance, we report per-class precision, recall and F-score (e.g. precision MAJ , precision MIN ) as well as the macro-averaged value (e.g. precision MACRO ). The ultimate purpose of this tool is to identify service-users who have self-harmed perinatally within a cohort. For this reason, we also present positive and negative likelihood ratios (LR POS , LR NEG ) and posttest probabilities.

Error analysis
Span errors. To identify remaining weaknesses in the tool, we performed error analysis on the mention-level evaluation of the test set. The most common recurrent span error was the tool missing mentions of "suicide" that had been annotated in the reference standard test set. Whilst death by suicide is not the same as self-harm (which is, by definition, non-fatal), the conceptual line between suicide and self-harm is, in terms of clinician documentation, often blurred. For example, we found clinicians would document: "no history of suicide". Clearly, in a clinical entry on a living service-user, a history of death by suicide is impossible. However, this phrase most likely reflects a clinician's attempt to express that the service-user has no history of attempted suicide, i.e. non-fatal self-harm. There were a small number of instances where the tool erroneously identified phrases not annotated in the test set. This was largely for two reasons. Firstly, there were unusual examples of clinician documentation style that referred to things that were not self-harm e.g. "OD AM", to indicate "once daily in the morning". We had included "OD" in the coding structure, as a synonym for "overdose". Secondly, there were some specific and uncommon examples of selfharm that were not included in the coding structure e.g. "drinking" specific poisonous substances. Finally, the grammatical context of the verb "jump" also proved difficult to capture reliably, as this verb can be used with a variety of prepositions that do not always indicate attempted self-harm e.g. "jump to kill herself"/"jump through a window" are valid mentions of self-harm, while "jump down the stairs" and "jump to conclusions" are not. Attribute errors. Regarding errors on the status attribute, we assumed the modal auxiliary "would" to be a marker of non-relevance in the tool's contextual search, as it would usually indicate a future conditional event that had not yet happened. However, the tool sometimes erroneously considered modals appearing after mentions, for example, "she thought the [selfharm] would kill her". A further recurring status error was that we assumed "risk to self" headings in EHRs indicated the part of clinical assessment known as "risk assessment", which is a discussion of the service-user's future risk to self and would therefore contain non-relevant mentions. However error analysis revealed this phrase was occasionally used as a section header detailing past self-harm events.
Regarding temporality, our default approach was to mark events as current unless there was a clear historical marker. However, we found that temporality indicators became unclear in cases using a coordination, e.g. "no current or past suicide attempts". The tool annotated this mention as current, whilst in the reference standard it was annotated as historical. Assessing temporality was also problematic where there were no contextual markers due to the shorthand note-taking style of the clinician, for example "2 x OD".

Discussion
Self-harm is a conceptually and clinically complex area. Framing it temporally within the narrow time-period of pregnancy and the postnatal year increases the complexity. However, we have shown that it is possible to develop an NLP tool that, with acceptable precision and recall, can identify perinatal self-harm within electronic healthcare records at both a mention and patient level. Given the limitations in existing data on the prevalence of perinatal self-harm [5], this is a significant step forward.
The pair-wise inter-rater agreement suggests that temporality was the hardest attribute for annotators to agree on. This may reflect the high degree of complexity and ambiguity in ways that self-harm is documented. Micro-averaged mention-level evaluation figures reflect this pattern, although precision, recall and F-score for all attributes were all still >0.8. After adjustment for chance, temporality remains the weakest attribute, although kappa is still almost 0.8.
We felt that the test set (Table 4) was too small to evaluate service-user level performance. Using the much larger development set (Table 5), we showed that, by using a heuristic rule of two, we could generate a tool with macro-averaged F-score 0.81 and a high positive likelihood ratio of 9.4 (95% CI 4. [8][9][10][11][12][13][14][15][16][17][18][19]. Overall, scores for kappa were lower than precision/recall/F-score (patient-level kappa 0.62), suggesting some agreement may have been due to chance. However the limitations of using kappa in dichotomous classification system performance analysis with unbalanced datasets should be noted [38], particularly where the sample size is small [39]. How the tool performs in a much larger sample would be an interesting area of further study.
The use of heuristic rules is commonplace in NLP literature [40][41][42] and it is well-recognised that in clinical contexts moving from mention to person-level performance often requires "post-processing" [43]. We believe the use of a heuristic in this case does have face validity, as in reality if a service-user has self-harmed perinatally this is a significant clinical event, meaning it is likely to be followed up at subsequent visits or by other clinicians, i.e. further mentions of it would be generated within the service-user's body of EHRs.
We believe this tool could potentially be adapted to ascertain self-harm in other contexts. Work is currently underway to investigate whether it can be adapted to ascertain self-harm in adolescent populations and among women with eating disorders.

Strengths
We believe this is a novel development in the field of using NLP to investigate self-harm, as it focusses specifically on acts of perinatal self-harm among women with SMI. We used a bespoke NLP strategy developed using both clinical and NLP expertise. Our iterative approach meant that we could use unusual examples encountered in development phases of annotation to refine the tool.

Limitations
Our corpus was relatively small and generalisability to EHRs from other populations and mental healthcare providers is uncertain. The main outcomes of error analysis are that it is often hard to find reliable contextual markers for ambiguous mentions. The use of syntactic coordination (and, or, etc.) often makes this even more problematic. Temporality is notoriously difficult to analyse with NLP and is a field of research in its own right [44,45]. The analysis also reveals something about how clinical note-taking is done e.g. the high variability in the words and formulations used by clinicians.

Conclusions
We have shown, using novel methods and a combination of clinical and linguistic processing expertise, that is possible to develop an NLP tool that will, with acceptable precision and recall, identify perinatal self-harm in electronic healthcare records, albeit with limitations, particularly in terms of defining temporality.