Skip to main content
  • Loading metrics

Exploring optimal granularity for extractive summarization of unstructured health records: Analysis of the largest multi-institutional archive of health records in Japan

  • Kenichiro Ando,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliations Graduate School of Systems Design, Tokyo Metropolitan University, Tokyo, Japan, Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan, National Hospital Organization, Tokyo, Japan

  • Takashi Okumura ,

    Roles Conceptualization, Data curation, Investigation, Project administration, Resources, Writing – original draft, Writing – review & editing

    Affiliation School of Regional Innovation and Social Design Engineering, Kitami Institute of Technology, Hokkaido, Japan

  • Mamoru Komachi,

    Roles Resources, Supervision, Writing – review & editing

    Affiliation Graduate School of Systems Design, Tokyo Metropolitan University, Tokyo, Japan

  • Hiromasa Horiguchi,

    Roles Data curation, Investigation, Resources, Supervision

    Affiliation National Hospital Organization, Tokyo, Japan

  • Yuji Matsumoto

    Roles Funding acquisition, Project administration, Supervision, Writing – review & editing

    Affiliation Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan


Automated summarization of clinical texts can reduce the burden of medical professionals. “Discharge summaries” are one promising application of the summarization, because they can be generated from daily inpatient records. Our preliminary experiment suggests that 20–31% of the descriptions in discharge summaries overlap with the content of the inpatient records. However, it remains unclear how the summaries should be generated from the unstructured source. To decompose the physician’s summarization process, this study aimed to identify the optimal granularity in summarization. We first defined three types of summarization units with different granularities to compare the performance of the discharge summary generation: whole sentences, clinical segments, and clauses. We defined clinical segments in this study, aiming to express the smallest medically meaningful concepts. To obtain the clinical segments, it was necessary to automatically split the texts in the first stage of the pipeline. Accordingly, we compared rule-based methods and a machine learning method, and the latter outperformed the formers with an F1 score of 0.846 in the splitting task. Next, we experimentally measured the accuracy of extractive summarization using the three types of units, based on the ROUGE-1 metric, on a multi-institutional national archive of health records in Japan. The measured accuracies of extractive summarization using whole sentences, clinical segments, and clauses were 31.91, 36.15, and 25.18, respectively. We found that the clinical segments yielded higher accuracy than sentences and clauses. This result indicates that summarization of inpatient records demands finer granularity than sentence-oriented processing. Although we used only Japanese health records, it can be interpreted as follows: physicians extract “concepts of medical significance” from patient records and recombine them in new contexts when summarizing chronological clinical records, rather than simply copying and pasting topic sentences. This observation suggests that a discharge summary is created by higher-order information processing over concepts on sub-sentence level, which may guide future research in this field.

Author summary

Medical practice includes significant paperwork, and therefore, automated processing of clinical texts can reduce medical professionals’ burden. Accordingly, we focused on hospitals’ discharge summaries from daily inpatient records stored in Electric Health Records. By applying summarization technologies, which are well-studied in Natural Language Processing, discharge summaries could be generated automatically from the source texts. However, automated summarization of daily inpatient records involves various technical topics and challenges, and the generation of discharge summaries is a complex process of mixing extractive and abstractive summarization. Thus, in this study, we explored optimal granularity for extractive summarization, attempting to decompose actual physicians’ processing. In the experiments, we used three types of summarization units with different granularities to compare performances of discharge summary generation: whole sentences, clinical segments, and clauses. We originally defined clinical segments, aiming to express the smallest medically meaningful concepts. The result indicated that sub-sentence processing, larger than clauses, improves the quality of the summaries. This finding can guide future development of medical documents’ automated summarization.

1 Introduction

Automated summarization of clinical texts can reduce the burden of medical professionals because their practice includes significant paperwork. A recent study found that family physicians spent 5.9h in an 11.4h workday on electronic health records (EHRs) [1]. In 2019, 74% of physicians spent more than 10h per week [2]. Another study reported that physicians spent 26.6% of their daily working time on documentation [3].

Compilation of hospital discharge summaries is an onerous task for physicians. Because daily inpatient records are already filed in the systems, computers might efficiently support physicians by generating summaries of clinical records. Although research has been conducted to identify certain classes of clinical information in clinical texts [48], there has been limited research on acquiring expressions that can be used to write discharge summaries [914]. Because many summarization techniques have been developed in natural language processing (NLP), the generation of discharge summaries can be a promising application of the technology.

However, automated summarization of daily inpatient records involves various technical topics and challenges. For example, descriptions of important findings related to a patient’s diagnosis require an extractive summary. Our preliminary experiments revealed that 20–31% of the sentences in discharge summaries were created by copying and pasting. This result proves that a certain amount of content can be automatically generated by extractive summarization. Meanwhile, when a patient is discharged from the hospital after surgery without any major problems, it is necessary to summarize the clinical record as the patient “recovered well after the surgery,” even if more details of the postoperative process are described in the records. Therefore, such descriptions cannot be created by copy and paste, and needs to be abstracted. These observations suggest that the generation of discharge summaries is a complex process that is a mixture of extractive and abstractive summarization, and it remains unclear how to process the unstructured source texts, i.e., free-texts. To advance this research field, it is desirable to properly decompose these summarization processes and clarify their interactions.

To this end, this study focuses on the extractive summarization process by physicians. Some recent studies investigated the best granularity units in this type of summarization [15, 16]. However, the granularity of extraction has not been explored for the summarization of medical documents. Thus, we attempted to identify the optimal granularity in this context, by defining three units with different granularities and comparing their summarization performance: whole sentences, clinical segments, and clauses. The clinical segments are our novel concepts to express the smallest medically meaningful concepts and are detailed in the methodology section (Section 3).

This paper is organized as follows. In Section 2, we survey related work. Section 3 describes the materials and methods. Section 4 presents the experiment and its results, and Section 5 discusses the experiment. Finally, Section 6 concludes the paper.

2 Related work

Automated summarization is an actively studied field [1519] with two main approaches: extractive and abstractive summarization. The former extracts contents from source texts, whereas the latter creates new contents. Generally, the abstractive approach provides more flexibility in summarization but often produces fake contents that do not match the reference summary, which is referred to as “hallucination” [1921]. Thus, in the medical field, “extractive summarization” has been mainly used for knowledge acquisition of clinical features such as diseases, prescriptions, examinations, etc. The determination of the optimal granularity would lead to the more reliable information. Secondly, the precise spanning of extraction would read to avoid extraction of unnecessary information, keeping the precision of the processing high.

Meanwhile, Natural Language Processing on unstructured medical text has been focusing on normalization and prediction, such as ICD codes, mortality, or readmission risk [2227]. However, they are not summarization in a narrow sense, that distills important information from the input. Several works targeted acquiring key information such as disease, examination result, or medication from EHRs [6, 8, 28, 29], while these studies collected fragmented information and did not try to generate contextualized passage. There are a line of researches that targeted to help physicians get the point quickly by generating a few key sentences [7, 3032]. However, most studies that producing discharge summaries used structured data as input. [3335]. Some other studies attempted to generate discharge summaries from free-form inpatient records, as we anticipated [914]. In part, an encoder-decoder model was used to generate sentences for abstractive summarization [911]. These studies can create a whole document of discharge summary. However, this approach may result in hallucinations, which limits its clinical use, although data can be corrected manually by physicians before filing. The other studies summarized sentences, using extractive summarization [1114], and unsupervised generation using prompt engineering [36, 37] would further contribute to the performance, although they can not generate entire texts.

For advancing the research on summarization of clinical texts, appropriate language resources are indispensable. In English, public corpora of medical records are available, such as MIMIC-III [38, 39], and [40]. However, the number of resources available in Japanese is highly limited. The largest publicly available corpus is the one used for a shared task in an international conference, NTCIR [41]. A non-profit organization for language resources maintains another corpus, GSK2012-D [42]. However, their data volume is small, and their statistics exhibit significant difference from those of large-scale data, as illustrated in Table 1. This low-resource situation makes the processing of Japanese medical documents more challenging. First, Japanese medical texts often contain excessive shortening of sentences and orthographical variants of terms originating from foreign languages. Besides, Japanese requires word segmentation. Most importantly, there is no Japanese parallel corpus of inpatient records and discharge summaries. Therefore, we built a new corpus as detailed in the next section.

3 Materials, method, and preprocessing

3.1 Target text

Clinical records can be expressed in various dialects and jargons. Accordingly, a study on a single institution would lead to highly biased results in medical NLP tasks because of local and hospital-specific dialects. To explore the optimal granularity for clinical document summarization, it is necessary to conduct a multi-institutional study to mitigate the potential bias caused by the medical records stored in a single EHR source. For this purpose, we designed an experiment using the largest multi-institutional health records archive in Japan, National Hospital Organization Clinical Data Archives (NCDA) [43]. NCDA is a data archive operated by the National Hospital Organization (NHO), which stores replicated EHR data for 66 national hospitals owned by this organization. Thus, the archive has become a valuable data source for multi-institutional studies that span across the country.

On this research infrastructure, informed consent and patient privacy are ensured in the following manner. At the national hospitals, notices about their policy and the EHR data usage are posted in their facilities. The patients who disagree with the policies are supposed to notify the hospital by an opt-out form, to be excluded from the archive. Likewise, minors and their parents can turn in the opt-out form, at will. To conduct a study on the archive, researchers must submit their research proposals to the institutional review board. Once the study is approved, the data are extracted from NCDA, and anonymized to construct a dataset for further analysis. The data are accessible only in a secured room at the NHO headquarters, and only statistics are allowed to be carried out of the secured room, for protection of patients’ privacy.

In this present research, the analysis was conducted under the IRB approval (IRB Approval No.: Wako3 2019-22) of the Institute of Physical and Chemical Research (RIKEN), Japan, which has a collaboration agreement with the National Hospital Organization. The dataset we used for the study, referred to as NHO data hereafter, is the anonymized subset of the archive, which includes 24,641 cases collected from five hospitals that belong to the NHO. Each case includes inpatient records and a discharge summary for patients of internal medicine departments. The statistics of the target data are summarized in Table 1. As shown, the scale of the NHO data is much larger than that of GSK2012-D and MedNLP, which have been used in previous studies [41]. Accordingly, the results obtained using the NHO dataset are expected to be more general.

3.2 Design of the analysis

To identify the optimal granularity of extractive summarization, there are two approaches. One approach is a method that takes n word sequences of arbitrary lengths and compares them as the units for summarization. The other approach is a method that uses predefined linguistic units. Previous studies in this domain have used the latter approach and found that a sentence was a longer-than-optimal granularity unit for extractive summarization, as mentioned in Section 1. Another study adopted a clause as a shorter self-contained linguistic unit [44] instead of a sentence [15]. However, it remains unclear whether the clause performs the best in the summarization of clinical records or there could be further possibilities. In this study, we adopt both of the two methods. However, the examination using linguistic units in Japanese is a little different from that in English. In particular, clauses in Japanese have significantly different characteristics from clauses in English because they can be formed by simply adding a particle to a noun. Owing to this characteristics, Japanese clauses are often very short at the phrase level. Accordingly, they cannot constitute a meaningful unit that carries concepts of medical significance. Therefore, we need a self-contained linguistic unit that has a longer span than a clause in Japanese and expresses the smallest medically meaningful concept.

For this reason, we defined the clinical segment that spans several clauses but is shorter than a sentence. As exemplified in Table 2, segments may comprise clauses connected by a conjunction to form a medically meaningful unit; alternatively, they may be identical to clauses. For the statistical analysis, the clinical segment must be defined formally so that a splitter can automatically divide sentences into segments. We also need a corpus to train the splitter and evaluate its performance.

When designing the clinical segment, we attempted to extract the atomic events related to medical care as a single unit. For example, statements such as “jaundice was observed in the patient’s conjunctiva,” “the patient was diagnosed with hepatitis,” and “a CT scan was performed” would lose their medical meaning if they are further split. In addition, medical events are the central statements in medical documents, whereas non-medical events play a relatively small role. Therefore, in this study, we considered only medical events as a component of self-contained units, and non-medical events were interpreted as noise. In previous studies, a self-contained unit was defined with respect to semantics. In our study, it was extended to a pragmatic unit based on domain knowledge. The details of the six segmentation rules are listed in Table 3.

Based on this definition, we built a small corpus for the segmentation task. We used an independent dataset that included inpatient records and their discharge summaries for 108 cases. This corpus was built because annotation over the NHO data was restricted due to privacy concerns. The statistics of the resulting corpus are given in Table 1 (Our corpus). With respect to the inpatient records, the corpus is closer to real data than in previous studies, except for the number of sentences in a document. For the discharge summary, there are no publicly available Japanese corpora besides the one we built. Because of the summarization process, the sentences contain more words and characters than the source inpatient records. The total number of segments in the corpus was 3,816, the average number of segments per sentence was 2.18, and the average number of segment boundaries per sentence was 1.18. The agreement rate between the participants of the segmentation task and an author is 0.82, which is sufficiently high to be used for further study. The agreement rate is the accuracy of the workers’ labels for the correct boundaries annotated by an author. Across this task, we adopted the labels annotated by one of the authors.

3.3 Preprocessing

Table 4 shows a discharge summary—a type of medical record written by a Japanese physician. As illustrated, it is a noisy document: punctuation marks are missing, and line breaks appear in the middle of a sentence. Sentence boundaries may be denoted by spaces instead of punctuation marks. Therefore, for the further analysis of the three types of extraction units, we first need preprocessing for sentence splitting and segment splitting, which are shown in the upper part of Fig 1.

Fig 1. Outline of our pipeline.

The top block is an example of the inpatient record, and the subsequent blocks indicate the chain of processes up to adding summarization labels.

For sentence splitting, we adopt two naive rules below to define the boundaries of a sentence:

  1. A statement that ends with a full-stop mark.
  2. A statement that ends with a new line and has no full-stop mark.

There is oversimplification here, compared to sentence splitting tasks in medical NLP that have been studied [45, 46]. However, since it is not a focus of this study, we adopted this naive approach for its simplicity. In this process, we also used MeCab [47] as a tokenizer. The MeCab’s dictionaries are mecab-ipadic-NEologd [48] and J-MeDic [49] (MANBYO 201905).

Next, sentences must be automatically split into clinical segments to efficiently analyze the huge dataset, NHO data. We compared several approaches to achieve the best splitting performance. In this study, we used 3,816 annotated segments in the corpus and applied six-fold cross-validation.

We used three rule-based splitters as baselines: a simple rule-based model for splitting by full-stop marks (Full-stop), another simple rule-based model for splitting by full-stop marks and verbs (Full-stop & Verb), and a complex rule-based model for splitting by clauses (CBAP) [50]. To be more precise, in the case of the Full-stop & Verb model, it starts with a verb and splits in front of the next occurring noun except for non-independents. The last model, which included 332 rules that were manually set up based on morphemes, was used to confirm that clinical segments have different boundaries than traditional clauses.

We used SEGBOT [51] as a machine learning method based on a pointer network architecture [52] for the splitting task. The method includes three phases: encoding, decoding, and pointing. An overview is shown in Fig 2. Medical records may include local dialects and technical terms that are not listed on public language resources. Accordingly, the splitter must handle even unknown words. In our approach, each input word is first represented by a distributed representation using fastText [53, 54]. FastText is a model that acquires vector representations of words considering the context. Notably, fastText can obtain vectors of unknown words by decomposing them into character n-grams. These vectors capture hidden information about a language, such as word analogies and semantics.

The performance of the splitter methods is summarized in Table 5. The machine-learning-based SEGBOT outperformed the others, with its F1 score being 0.257 points higher than that of the Full-stop & Verb model, which was the second best. Since this precision of 0.864 is higher than the inter-annotator agreement, it is considered to be almost the upper bound. In addition, CBAP, which is a clause segmentation model, has a low F1 score of 0.411, suggesting that the definitions of the clause and the clinical segment are inherently different. The precision of the model with splitting at the full-stop marks (Full-stop) is only 0.521, indicating that the clinical segment is not always split at the full-stop marks, and that it is necessary to consider the context for splitting. Overall, the results suggest that machine learning is the best fit for the segmentation task. Thus, the data preprocessed by this method are used for the main experiment of this study.

4 Main experiment

In this section, we describe our experimental settings and results of automatic summarization. First, we present the performance metric of the experiments; specifically, the ROUGE score is used as a quality measure for a summary. Next, we describe a summarization model used in the experiments, followed by the datasets used to train the model. Finally, we present the experiments and their results.

4.1 Evaluation metric

Measurement of the summarization quality must be automated to avoid costly manual evaluation. ROUGE [55] has been used as a standardized metric to measure the summarization quality in NLP tasks. Formally, ROUGE-N is an n-gram recall between a candidate summary and the reference summaries. When we have only one reference document, ROUGE-N is computed as follows: (1) where Countmatch(gramn) is the maximum number of n-grams that co-occur in a candidate summary and a reference summary.

When we have several references, ROUGE-L is the longest common subsequence (LCS) score between a candidate summary and the reference summaries. As it can assess word relationships, it is generally considered a more context-aware evaluation measure than ROUGE-N. Specifically, ROUGE-L is computed as follows: (2) (3) (4) where u is the number of reference sentences, and LCS(ri, C) is the LCS score of the union of the longest common subsequences between the reference sentence ri and C, where C is the sequence of candidate summary sentences. For example, if ri = (w1, w2, w3, w4), and C contains two sentences: c1 = (w1, w2, w6, w7) and c2 = (w1, w8, w4, w9), the longest common subsequence of ri and c1 is (w1, w2), and the longest common subsequence of ri and c2 is (w1, w4). The union of the longest common subsequences of ri, c1, and c2 is (w1, w2, w4), and LCS(ri, C) = 3/4.

4.2 Summarization model

In an extractive summarization task, the goal is to automatically assign a binary label to each unit of the input to indicate whether this unit should be included in the summary. Therefore, we adopted a single classification model to cover the three types of units.

Following Zhou et al. [15], we used a model based on BERT [56], as shown in Fig 3. BERT is a pretrained neural network, and its parameters are learned from a large number of documents in advance. BERT is known to achieve a good accuracy even with few training samples. Instead of the original work that adopted BERT as an encoder for extractive summarization, we adopted UTH-BERT [57]. In contrast to the previous Japanese BERT models [5860], which were pre-trained mainly on web data such as Wikipedia, UTH-BERT was pretrained on a large number of Japanese health records and is expected to perform better on documents in the target domain.

Fig 3. Overview of classification model for clinical segments.

Formally, let the i-th sentence contain l segments Si = (si,1, si,2, …, si,l). The j-th segment with k words in Si is denoted by si,j = (wi,j,1, wi,j,2, …, wi,j, k). We add [CLS] and [SEP] tokens to the boundaries between sentences. After applying the UTH-BERT encoder, the vector of tokens is represented as . Next, we apply average pooling at the segment level. The pooled representation is formulated as follows: (5)

Note that segments and clauses do not include the [CLS] and [SEP] tokens in average pooling. Subsequently, we apply a segment-level transformer [61] to capture their relationship for extracting summaries. The model predicts the summarization probability from those outputs as follows: (6) (7) where is a sequence of segments input to the transformer, and is a sequence that is the output of the transformer. The training objective of the model is the binary cross-entropy loss given the gold label yi,j and the predicted probability .

This model does not need to change its structure depending on the input units. For clauses, the span of the segments is replaced by that of the clauses. In the case of sentences, the average pooling is not performed; instead, we input the [CLS] token into the transformer.

4.3 Training data

Our model requires an entire document for training. However, our corpus could be too small to be used for the training of the model, and would compromise the robustness of the model. Accordingly, we used NHO data as training data by assigning pseudo labels. Following previous studies [15, 16], we used the ROUGE scores to automatically assign gold labels to the three units. We used the ROUGE score both to create the gold labels and to evaluate the model. This may seem unusual, but it is a commonly used approach in previous studies. As ROUGE is correlated with human scores [62], the best summary can be obtained by creating a system that maximizes this score during evaluation, regardless of whether this score was used during training. The labeling steps were as follows.

First, we applied the splitter created in Section 3.3 to the NHO dataset and split it into clauses and clinical segments. In this manner, we easily obtained a larger dataset. We used CBAP as a splitter for clauses and SEGBOT as a splitter for clinical segments.

Second, we measured ROUGE-2 F1 for each unit of the source documents (against the discharge summaries), which were then sorted in descending order of their scores. Thus, we obtained a list of units that were important for our summary.

Third, we selected the units from the topmost part of the list. At this stage, we stopped selecting units when the result exceeded 1,200 characters, which was the average length of the summaries in the NHO data.

Finally, we assigned positive labels to the selected units. The entire process yielded the gold standard for the training and evaluation without manual annotation. We randomly selected 1,000 documents each for the development and test sets, and we used the remaining 22,641 documents for the training data.

4.4 Experiments and results

In this experiment, we used the three contextual units, instead of the n-gram units, and evaluated their impact on the summarization performance to determine which unit performs the best. The results of summarization, using the three types of units, are shown in Table 6. Comparing the three types of units in granularity, the model with clinical segments scored the highest in ROUGE-1, ROUGE-2, and ROUGE-L. The model with clinical segments outperformed sentences and clauses in summarizing inpatient records.

Table 2 shows that a sentence can contain multiple events and has room for further segmentation. It is certain that sentences are longer than clinical segments and clauses. However, the relation between clinical segments and clauses are unclear. Because ROUGE-1 and ROUGE-2 are measured on the basis of 1-gram and 2-gram, respectively, smaller units are more advantageous in the ROUGE evaluation. Table 7 shows the statistical relation of the three types of units. The first column shows how many units are included in a sentence on average. The second and third columns show the average number of tokens and characters included in each type of units. The result suggests that segments are longer than clauses on average. Nevertheless, the difference of a clause and a segment is not significant, at least for the average number of characters. Accordingly, the relationship between clause and clinical segment granularity is worthy of a more detailed analysis.

We ensure the order of the three types of linguistic units, by an additional experiment on word-wise relation between clauses and clinical segments. For any two linguistic units in a sentence, there are four possible relationships (Fig 4): “Equal” is where the two match exactly; “Inclusive” is where a segment completely includes a clause; “Included” is where a clause completely includes a segment; and “Overlap” is where the two overlaps.

Fig 4. The four types of relationship between clause and clinical segment.

We obtained statistics of the four relationships, from all inpatient records and discharge summaries in the NHO data. The results are shown in Table 8. We found that 59.6% of them have the same boundaries. This is influenced by the many short sentences that have no boundaries. Then, “Inclusive” shared 20.0% of the relations. The sum of “Equal” and “Inclusive” turned out to be 79.6%, which is six times more than “Included” that shared only 13.1%. The figures gives the detailed dynamics of the relation between segments and clauses, shown just as 11.83 and 10.74 characters/unit in Table 7. Although the difference in the average length between segment and clause is small, there is a significant difference between segments and clauses in their relative sizes, when compared by each corresponding pair of the actual units.

Table 8. The Relationships between clauses and clinical segments.

In sum, Clinical segments exhibited the best performance in ROUGE and it lies between sentences and clauses in their size. Combining the results in this section, we can conclude that the segment units we introduced in this paper are better and optimal units that lie between sentence and clause units.

5 Discussion

The result that extractive summarization with sentences is less effective than with other granularities is consistent with previous studies [15, 16]. Given the consistency of these results, this could be a universal property that must be exploited in further summarization tasks in NLP research.

In the summarization of medical documents, the experimental results of using linguistic units suggest that physicians create discharge summaries by capturing clinical concepts from the inpatient records. On the other hand, sentences and clauses performed poorly, probably because they were chunked only with syntactic information and did not deal with medical concepts. Accordingly, automatic summarization in the medical field requires not only syntactic information but also high-level semantic and pragmatic information related to domain knowledge. Clinical segments are reasonable candidates as atomic units that carry medical information. Therefore, clinical segments can potentially be used to quantify the quality of medical documentation and to acquire more detailed medical knowledge expressed in texts.

Limitations in the current study and analysis are twofold: language and cultural dependency. Firstly, Japanese grammar and Japanese medical practices are very different from those of European languages, and there can be differences in the description, summarization, and evaluation processes. Accordingly, this pipeline using extractive method might be applicable only to Japanese clinical setting. In particular, the clinical segment was defined for Japanese, only labeled corpus for Japanese exists, so it is not naively applicable to other languages. However, the idea of capturing medical concepts may be useful for other languages. Also, more researches at various institutions would be preferable to confirm the generalizability of our results, although our study used the largest multi-institutional health records archive in Japan. Secondly, in some countries with different cultural background, dictation is used in clinical records and their summaries [63]. In this regard, Japanese hospitals do not use dictation to produce discharge summaries, which could result in frequent copying and pasting from sources to summaries. This custom could have contributed to using extractive texts in the discharge summaries in Japan. The analysis of the influence of this customary difference is left for future work.

6 Conclusion

In this study, we explored the best granularity for the automatic summarization of medical documents. The result indicated clinically motivated semantic units, larger than clauses, are the best granularity for the extractive summarization.

Ohter contributions of this study are summarized as follows. First, we defined clinical segments that captured clinical concepts and showed that they can be reliably split automatically by a machine learning-based method. Second, we identified the optimal granularity of extractive summarization that can be used for automated summarization of medical documents. Third, we built a Japanese parallel corpus of medical records with inpatient data and discharge summaries.

The results of this study suggest that the clinical segments that we have introduced are useful for automated summarization in the medical domain. This provides an important insight into how physicians write discharge summaries. Previous studies have used other entities to analyze medical documents [6466]. Our results will help to provide more effective assistance in the writing process and automated acquisition of clinical knowledge.


The authors would like to thank Dr. Yoshinobu Kano and Dr. Mizuki Morita for their cooperation in our previous research that served as the foundation for this study. We also thank Ms. Mai Tagusari, Ms. Nobuko Nakagomi, and Dr. Hiroko Miyamoto, who served as annotators.


  1. 1. Arndt BG, Beasley JW, Watkinson MD, Temte JL, Tuan WJ, Sinsky CA, et al. Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. The Annals of Family Medicine. 2017;15(5):419–426. pmid:28893811
  2. 2. Leslie Kane MA. Medscape Physician Compensation Report 2019; 2019 [cited 2021 Aug 6]. Available from:
  3. 3. Ammenwerth E, Spötl HP. The Time Needed for Clinical Documentation versus Direct Patient Care. A Work-sampling Analysis of Physicians’ Activities. Methods of Information in Medicine. 2009;48(01):84–91. pmid:19151888
  4. 4. Hirsch JS, Tanenbaum JS, Lipsky Gorman S, Liu C, Schmitz E, Hashorva D, et al. HARVEST, a Longitudinal Patient Record Summarizer. Journal of the American Medical Informatics Association. 2014;22(2):263–274. pmid:25352564
  5. 5. Feblowitz JC, Wright A, Singh H, Samal L, Sittig DF. Summarization of Clinical Information: A Conceptual Model. Journal of Biomedical Informatics. 2011;44(4):688–699. pmid:21440086
  6. 6. Aramaki E, Miura Y, Tonoike M, Ohkuma T, Masuichi H, Ohe K. TEXT2TABLE: Medical Text Summarization System Based on Named Entity Recognition and Modality Identification. Proceedings of the BioNLP 2009 Workshop. 2009; p. 185–192.
  7. 7. Liang J, Tsou CH, Poddar A. A Novel System for Extractive Clinical Note Summarization using EHR Data. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019; p. 46–54.
  8. 8. Reeve LH, Han H, Brooks AD. The Use of Domain-Specific Concepts in Biomedical Text Summarization. Information Processing & Management. 2007;43(6):1765–1776.
  9. 9. Diaz D, Cintas C, Ogallo W, Walcott-Bryant A. Towards Automatic Generation of Context-Based Abstractive Discharge Summaries for Supporting Transition of Care. AAAI Fall Symposium 2020 on AI for Social Good. 2020;.
  10. 10. Shing HC, Shivade C, Pourdamghani N, Nan F, Resnik P, Oard D, et al. Towards Clinical Encounter Summarization: Learning to Compose Discharge Summaries from Prior Notes. ArXiv. 2021;abs/2104.13498.
  11. 11. Adams G, Alsentzer E, Ketenci M, Zucker J, Elhadad N. What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021; p. 4794–4811.
  12. 12. Moen H, Heimonen J, Murtola LM, Airola A, Pahikkala T, Terävä V, et al. On Evaluation of Automatically Generated Clinical Discharge Summaries. Proceedings of the 2nd European Workshop on Practical Aspects of Health Informatics. 2014;1251:101–114.
  13. 13. Moen H, Peltonen LM, Heimonen J, Airola A, Pahikkala T, Salakoski T, et al. Comparison of Automatic Summarisation Methods for Clinical Free Text Notes. Artificial Intelligence in Medicine. 2016;67:25–37. pmid:26900011
  14. 14. Alsentzer E, Kim A. Extractive Summarization of EHR Discharge Notes. ArXiv. 2018;abs/1810.12085.
  15. 15. Zhou Q, Wei F, Zhou M. At Which Level Should We Extract? An Empirical Analysis on Extractive Document Summarization. Proceedings of the 28th International Conference on Computational Linguistics. 2020; p. 5617–5628.
  16. 16. Cho S, Song K, Li C, Yu D, Foroosh H, Liu F. Better Highlighting: Creating Sub-Sentence Summary Highlights. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020; p. 6282–6300.
  17. 17. Erkan G, Radev DR. LexRank: Graph-Based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research. 2004;22(1):457–479.
  18. 18. Mihalcea R, Tarau P. TextRank: Bringing Order into Text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004; p. 404–411.
  19. 19. Haonan W, Yang G, Yu B, Lapata M, Heyan H. Exploring Explainable Selection to Control Abstractive Summarization. Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence. 2021;(15):13933–13941.
  20. 20. Dong Y, Wang S, Gan Z, Cheng Y, Cheung JCK, Liu J. Multi-Fact Correction in Abstractive Text Summarization. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020; p. 9320–9331.
  21. 21. Cao M, Dong Y, Wu J, Cheung JCK. Factual Error Correction for Abstractive Summarization Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020; p. 6251–6258.
  22. 22. Sakishita M, Kano Y. Inference of ICD Codes from Japanese Medical Records by Searching Disease Names. Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP). 2016; p. 64–68.
  23. 23. Lee HG, Sholle E, Beecy A, Al’Aref S, Peng Y. Leveraging Deep Representations of Radiology Reports in Survival Analysis for Predicting Heart Failure Patient Mortality. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021; p. 4533–4538.
  24. 24. Lu Q, Nguyen TH, Dou D. Predicting Patient Readmission Risk from Medical Text via Knowledge Graph Enhanced Multiview Graph Convolution. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021; p. 1990–1994.
  25. 25. Komaki S, Muranaga F, Uto Y, Iwaanakuchi T, Kumamoto I. Supporting the Early Detection of Disease Onset and Change Using Document Vector Analysis of Nursing Observation Records. Evaluation & the Health Professions. 2021;44(4):436–442. pmid:33938254
  26. 26. Nakatani H, Nakao M, Uchiyama H, Toyoshiba H, Ochiai C. Predicting Inpatient Falls Using Natural Language Processing of Nursing Records Obtained From Japanese Electronic Medical Records: Case-Control Study. JMIR Medical Informatics. 2020;8(4):e16970. pmid:32319959
  27. 27. Katsuki M, Narita N, Matsumori Y, Ishida N, Watanabe O, Cai S, et al. Preliminary Development of a Deep Learning-based Automated Primary Headache Diagnosis Model Using Japanese Natural Language Processing of Medical Questionnaire. Surgical neurology international. 2020;11. pmid:33500813
  28. 28. Gurulingappa H, Mateen-Rajpu A, Toldo L. Extraction of Potential Adverse Drug Events from Medical Case Reports. Journal of biomedical semantics. 2012;3(1):1–10.
  29. 29. Mashima Y, Tamura T, Kunikata J, Tada S, Yamada A, Tanigawa M, et al. Using Natural Language Processing Techniques to Detect Adverse Events from Progress Notes due to Chemotherapy. Cancer Informatics. 2022;21. pmid:35342285
  30. 30. Lee SH. Natural Language Generation for Electronic Health Records. NPJ digital medicine. 2018;1(1):1–7. pmid:30687797
  31. 31. MacAvaney S, Sotudeh S, Cohan A, Goharian N, Talati I, Filice RW. Ontology-Aware Clinical Abstractive Summarization. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019; p. 1013–1016.
  32. 32. Liu X, Xu K, Xie P, Xing E. Unsupervised Pseudo-labeling for Extractive Summarization on Electronic Health Records. Machine Learning for Health (ML4H) Workshop at NeurIPS 2018. 2018;.
  33. 33. Hunter J, Freer Y, Gatt A, Logie R, McIntosh N, Van Der Meulen M, et al. Summarising Complex ICU Data in Natural Language. AMIA annual symposium proceedings. 2008;2008:323.
  34. 34. Portet F, Reiter E, Gatt A, Hunter J, Sripada S, Freer Y, et al. Automatic Generation of Textual Summaries from Neonatal Intensive Care Data. Artificial Intelligence. 2009;173(7):789–816.
  35. 35. Goldstein A, Shahar Y. An Automated Knowledge-based Textual Summarization System for Longitudinal, Multivariate Clinical Data. Journal of Biomedical Informatics. 2016;61:159–175. pmid:27039119
  36. 36. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language Models are Few-Shot Learners. ArXiv. 2020;abs/2005.14165.
  37. 37. Goodwin T, Savery M, Demner-Fushman D. Towards Zero-Shot Conditional Summarization with Adaptive Multi-Task Fine-Tuning. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020; p. 3215–3226.
  38. 38. Johnson AE, Pollard TJ, Shen L, Li-Wei HL, Feng M, Ghassemi M, et al. MIMIC-III, a Freely Accessible Critical Care Database. Scientific data. 2016;3(1):1–9. pmid:27219127
  39. 39. Voorhees EM, Hersh WR. Overview of the TREC 2012 Medical Records Track. Proceedings of the twentieth Text REtrieval Conference. 2012;.
  40. 40. Özlem Uzuner, Goldstein I, Luo Y, Kohane I. Identifying Patient Smoking Status from Medical Discharge Records. Journal of the American Medical Informatics Association. 2008;15(1):14–24.
  41. 41. Aramaki E, Morita M, Kano Y, Ohkuma T. Overview of the NTCIR-12 MedNLPDoc Task. In Proceedings of NTCIR-12. 2016;.
  42. 42. Aramaki E. GSK2012-D Dummy Electronic Health Record Text Data [Internet]. Gengo-Shigen-Kyokai; 2013 Feb [cited 2021 Aug 6]. Available from:
  43. 43. National Hospital Organization [Internet]. 診療情報集積基盤 (In Japanese); 2015 Aug 5- [cited 2021 Aug 6]. Available from:
  44. 44. Vladutz G. Natural Language Text Segmentation Techniques Applied to the Automatic Compilation of Printed Subject Indexes and for Online Database Access. Proceedings of the First Conference on Applied Natural Language Processing. 1983; p. 136–142.
  45. 45. Kreuzthaler M, Schulz S. Detection of Sentence Boundaries and Abbreviations in Clinical Narratives. BMC Medical Informatics and Decision Making. 2015;15(2):1–13. pmid:26099994
  46. 46. Griffis D, Shivade C, Fosler-Lussier E, Lai AM. A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain. AMIA Joint Summits on Translational Science Proceedings. 2016; p. 88–97.
  47. 47. Kudo T. MeCab: Yet Another Part-of-Speech and Morphological Analyzer. Version 0.996 [software]; 2006 Mar 26 [cited 2021 Aug 6]. Available from:
  48. 48. Sato T, Hashimoto T, Okumura M. Implementation of a Word Segmentation Dictionary Called Mecab-ipadic-NEologd and Study on How to Use It Effectively for Information Retrieval. Proceedings of the Twenty-three Annual Meeting of the Association for Natural Language Processing. 2017; p. NLP2017–B6–1.
  49. 49. Ito K, Nagai H, Okahisa T, Wakamiya S, Iwao T, Aramaki E. J-MeDic: A Japanese Disease Name Dictionary based on Real Clinical Usage. Proceedings of the Eleventh International Conference on Language Resources and Evaluation. 2018;.
  50. 50. Maruyama T, Kashioka H, Kumano T, Tanaka H. Development and Evaluation of Japanese Clause Boundaries Annotation Program. Journal of Natural Language Processing. 2004;11(3):39–68.
  51. 51. Li J, Sun A, Joty SR. SegBot: A Generic Neural Text Segmentation Model with Pointer Network. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. 2018; p. 4166–4172.
  52. 52. Vinyals O, Fortunato M, Jaitly N. Pointer Networks. Advances in Neural Information Processing Systems 28. 2015; p. 2692–2700.
  53. 53. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017;5:135–146.
  54. 54. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning Word Vectors for 157 Languages. Proceedings of the Eleventh International Conference on Language Resources and Evaluation. 2018;.
  55. 55. Lin CY. ROUGE: A Package for Automatic Evaluation of Summaries. Proceedings of the Workshop on Text Summarization Branches Out. 2004; p. 74–81.
  56. 56. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019; p. 4171–4186.
  57. 57. Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLOS ONE. 2021;16(11):1–11. pmid:34752490
  58. 58. Kurohashi-Kawahara Laboratory. ku_bert_japanese [software]; 2019 [cited 2021 Aug 6]. Available from:
  59. 59. Inui Laboratory. BERT models for Japanese text [software]; 2019 [cited 2021 Aug 6]. Available from:
  60. 60. National Institute of Information and Communications Technology. NICT BERT 日本語 Pre-trained モデル [software]; 2020 [cited 2021 Aug 6]. Available from:
  61. 61. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. Advances in Neural Information Processing Systems 31. 2017; p. 6000–6010.
  62. 62. Liu F, Liu Y. Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2008; p. 201–204.
  63. 63. Cannon J, Lucci S. Transcription and EHRs: Benefits of a Blended Approach. Journal of American Health Information Management Association. 2010;81(2):36–40. pmid:20218195
  64. 64. Skeppstedt M, Kvist M, Nilsson GH, Dalianis H. Automatic Recognition of Disorders, Findings, Pharmaceuticals and Body Structures from Clinical Text: An Annotation and Machine Learning Study. Journal of Biomedical Informatics. 2014;49:148–158. pmid:24508177
  65. 65. Wu Y, Lei J, Wei WQ, Tang B, Denny JC, Rosenbloom ST, et al. Analyzing Differences between Chinese and English Clinical Text: A Cross-Institution Comparison of Discharge Summaries in Two Languages. Studies in Health Technology and Informatics. 2013;192:662–666. pmid:23920639
  66. 66. Pradhan S, Elhadad N, South BR, Martinez D, Christensen L, Vogel A, et al. Evaluating the State of the Art in Disorder Recognition and Normalization of the Clinical Narrative. Journal of the American Medical Informatics Association. 2015;22(1):143–154. pmid:25147248