Learning to diagnose common thorax diseases on chest radiographs from radiology reports in Vietnamese

Deep learning, in recent times, has made remarkable strides when it comes to impressive performance for many tasks, including medical image processing. One of the contributing factors to these advancements is the emergence of large medical image datasets. However, it is exceedingly expensive and time-consuming to construct a large and trustworthy medical dataset; hence, there has been multiple research leveraging medical reports to automatically extract labels for data. The majority of this labor, however, is performed in English. In this work, we propose a data collecting and annotation pipeline that extracts information from Vietnamese radiology reports to provide accurate labels for chest X-ray (CXR) images. This can benefit Vietnamese radiologists and clinicians by annotating data that closely match their endemic diagnosis categories which may vary from country to country. To assess the efficacy of the proposed labeling technique, we built a CXR dataset containing 9,752 studies and evaluated our pipeline using a subset of this dataset. With an F1-score of at least 0.9923, the evaluation demonstrates that our labeling tool performs precisely and consistently across all classes. After building the dataset, we train deep learning models that leverage knowledge transferred from large public CXR datasets. We employ a variety of loss functions to overcome the curse of imbalanced multi-label datasets and conduct experiments with various model architectures to select the one that delivers the best performance. Our best model (CheXpert-pretrained EfficientNet-B2) yields an F1-score of 0.6989 (95% CI 0.6740, 0.7240), AUC of 0.7912, sensitivity of 0.7064 and specificity of 0.8760 for the abnormal diagnosis in general. Finally, we demonstrate that our coarse classification (based on five specific locations of abnormalities) yields comparable results to fine classification (twelve pathologies) on the benchmark CheXpert dataset for general anomaly detection while delivering better performance in terms of the average performance of all classes.


Introduction
Radiography has always been one of the most ubiquitous diagnostic imaging modalities so far, while chest X-ray (CXR) is the most commonly performed diagnostic X-ray examination [1].CXRs has an important role in clinical practice, effectively assisting radiologists to detect pathologies related to the airways, pulmonary parenchyma, vessels, mediastinum, heart, pleura and chest wall [2].In recent years, great advances in GPU approach to medical image report labeling that exploits both the scale of available rule-based systems and the quality of expert annotations.
Dictionary-based heuristics are another popular way for creating structured labels from free-text data.For instance, MedLEE [14] utilizes a pre-defined lexicon to convert radiology reports into a structured format.Mayo Clinic's Text Analysis and Knowledge Extraction System (cTAKES) [15] tool combines dictionary and machine learning methods, and uses the Unified Medical Language System1 (UMLS) for dictionary inquiries.Dictionary-based NLP systems have a key flaw is that they do not always establish high performance when handling in-house raw clinical texts, especially those with misspellings, abbreviations, and non-standard terminology.On top of that, the mentioned systems only cover English language and cannot handle non-English clinical texts.Languages other than English, including Vietnamese, do not have sufficient clinical materials to build a medical lexicon.In nations where English is not the official language, this has been a huge obstacle in building clinical NLP systems.In current work, our data pipeline can be applied for the available data in PACS and HIS, which can assist minimize data labeling costs, time, and effort while reducing radiologists' involvement in the workflow.We propose a set of matching rules to convert a typical radiology report to the normal/abnormal status of classes.
Other than the above-mentioned differences in labeling methods, our label selection is also different from previous studies.So far, most of the studies were developed for classifying common thoracic pathologies or localizing multiple classes of lesions.For instance, most deep learning models were developed on the MIMIC-CXR [16] and CheXpert [17][18][19] datasets for classifying 14 common thoracic pathologies on CXRs in recent years.The earlier dataset ChestX-ray14 [20], an expansion of ChestX-ray8 [21], including the same set of 14 findings has been used to develop deep learning models [22,23].Nevertheless, these approaches are far different from how Vietnamese radiologists work.In clinical practice, a CXR radiology report always includes four descriptions that correlate to four fixed anatomical regions of the thorax: chest wall, pleura, pulmonary parenchyma and cardiac.Therefore, it is not practical for Vietnamese radiologists to utilize a CAD system that provides suggestions for the presence of 14 diseases.Typically, when examining a CXR image, radiologists analyze that image by region; consequently, it is more convenient for the system to indicate the abnormality of each area, eliminating the need to match the lesion type with the region being viewed.To address the realistic demand of Vietnamese radiologists, we developed a system to classify CXRs into 5 classes depending on the position of pathologies: chest wall, pleura, parenchyma, cardiac abnormality and the existence of abnormalities in the CXRs, if any.When tested on the benchmark CheXpert dataset, we found that this coarse classification produces results comparable to the detailed classifier of 14 findings in terms of abnormal class and gives better results in terms of macro average F1 score of all classes.
Our work was developed on the dataset collected at Phu Tho General Hospital -a Vietnamese provincial hospital.To develop trainable images with corresponding labels, DICOM files in PACS are matched with radiology reports retrieved from HIS.By extracting data from radiology reports, generating normal/abnormal status of 5 classes and treating it as the ground-truth reference, we can conclude that there were positive results when classifying CXRs according to 5 groups of pathologies, which are modeled after the radiologist's description in their medical report.Unlike the automatic data labeling methods mentioned above, our proposed method is simple yet accurate by filtering the descriptions alluding to no findings first, then searching for phrases implying abnormalites in each position.Therefore, the labeling process is strictly controlled through stages, making it easy to detect errors and correct them.In addition, adding a manual step to the labeling process helps us deal with misspellings, which was neglected by the previous method.In this step, we also find infrequent phrases, adding them to our list of phrases indicating abnormality to make it more complete.Furthermore, a report always includes descriptions corresponding to four fixed anatomical regions of the thorax, thus by generating set of labels matching these regions, we can minimize the chance that a label is uncertain.

Dataset building pipeline
Our proposed pipeline consists of five steps: (1) data collection, (2) PA-view filtering, (3) XML parser, (4) data matching and (5) data annotation.Fig 1 illustrates the above five steps in detail.Firstly, DICOM files stored in PACS will be acquired and filtered to retain only posterior-anterior (PA) view CXRs by the PA classifier application programming interface (API).Meanwhile, radiology reports stored in HIS as XML files will be parsed to attain some specific information.Afterward, DICOM files and radiology reports belonging to the same patient will be matched to generate pairs of DICOM-XML files of the same examination.Once a DICOM file has been determined to match with an XML files, that DICOM file will be converted to JPG format and the XML file will be the subject of a labeling tool to generate a set of corresponding labels.At the end of the procedure, we can obtain a trainable dataset which includes JPG images and their corresponding labels.

Data collection
We retrospectively collected chest radiography studies from Phu Tho General Hospital, which were performed within five months from November 2020 to March 2021, along with their associated radiology reports.The ethical clearance of these studies was approved by the Institutional Review Board (IRB) of Phu Tho General Hospital.With this approval, the IRB allows us to access their data and analyze raw chest X-ray scans using our VinDr's platform, which will be used for data filtering.The need for obtaining informed patient consent was waived because this retrospective study did not impact clinical care or workflow at the hospitals, and all patient-identifiable information in the data has been removed.
We decided to select four types of pathologies because of their prevalence in the medical reports and clinical practice.An example of a typical description extracted from a radiology report is shown in Fig 2 .The description is divided into four main categories: lungs, cardiac, pleura and chest wall by most Vietnamese radiologists.From the four groups of pathology, we create an annotation set consisting of five classes, with the first four classes corresponding to these four groups and the other indicating the presence of abnormalities on CXRs, if any.

PA-view filtering
The collected data was mostly of Posterior-Anterior (PA)-view CXR, but also included a large number of outliers such as images of body parts other than chest, low-quality images or images with different views than PA-view.To guarantee that only CXRs of PA-view will be retained, we ran an API that is powered by VinDr's platform2 .The API takes a DICOM file as an input and returns the probability that the image saved in that file is a PA-view CXR.The DICOM file will proceed to the next stage of data pre-processing if this probability exceeds 0.5 -a normalized threshold; else, the file will be marked as ignored.

XML parser
We use the same procedure for the XML parsing and data matching process as in our previous study [31], shown in Fig 3 .The figure illustrates the procedure of extracting radiology reports from HIS.Each assessment and treatment session was saved in the Extensible Markup Language (XML) file format by HIS.A session includes all information of the patient between check-in and check-out time.The XML parser can read the header of a session that includes SESSION ID, PATIENT ID, CHECK IN TIME, and CHECK OUT TIME.These attributes are shared among all radiology reports belonging to the same session and will be used to link to the corresponding DICOM file.All reports are also interpreted using the XML parser to obtain the SERVICE ID, REPORT TIME, and DESCRIPTION properties.Only reports with a SERVICE ID matching the values expressly assigned by the Vietnamese Ministry of Health for chest radiography were preserved to exclude extraneous reports.

Data matching
To match the DICOM file with the corresponding XML file, we have simulated the algorithm in [31], which is depicted in  One problem we encountered here is that one DICOM file matched multiple reports and vice versa, because their STUDY TIME attributes were separated by a period of less than 24 hours.In such a short period of time, the examination results are often the same, the reason for taking additional radiographs may be due to the poor quality of the first image.Therefore, the description from the reports is usually the same, and this DICOM file is assigned to one of the matched reports.In several cases where the descriptions in the reports are different, the DICOM file will be given to a radiologist to review and match the correct report.

Data annotation
After extracting descriptions that match the DICOM files, we developed a simple labeling algorithm that takes the radiologists' description as input and returns a list of five binary elements, corresponding to the presence or absence of abnormalities belonging to 5 classes.Fig 5 illustrates the major steps of data annotation, which is implemented in semi-automated manner, including (1) pattern filtering, (2) keyword detection (3) abnormality interpolating and (4) manually labeling.
Pattern Filtering The dataset we obtained from Phu Tho General Hospital is unbalanced, with the majority of the images exhibiting no pathology.We have obtained 1,568 different templates from all the descriptions.Filtering descriptions that are elements of the predetermined set of templates (specifically 11 templates imply no findings) would help us save a significant amount of time when it comes to data labeling.A CXR is considered normal if one of 11 templates exactly appears in the DESCRIPTION of the corresponding radiology report.
Keyword detection After pattern filtering, most of the instances without pathologies are retained.In this step, we have to handle most of the abnormality descriptions and some remaining normality ones.Keyword detection is divided into four sub-stages, which could be performed simultaneously, to detect keywords indicating abnormalities in the chest wall, pleura, parenchyma, and mediastinum.To find keywords for each class, e.g.chest wall, we break down the radiologist's description into September 13, 2022 6/18  4 categories (categories are separated by "-" (dash) in the radiology descriptions).From the sentences in the chest wall category, we gather keywords indicating abnormalities, such as "fracture", "osteoporosis", "bone fusion surgery" to create the fixed set of keywords.Descriptions containing keywords in the chest wall set will be annotated as 1 for the corresponding class, similarly for the pleura, parenchyma, and cardio classes.Some common keywords setting for the four classes are listed in Table 1.
Abnormality interpolating The first four classes have been annotated at the keyword detection stage, here the abnormality class labeling is implemented by inferring from those others.Abnormality value will be set to 1 (positive) if any of the other classes are noted as anomalies or has any other anomaly even though it does not belong to the four groups above.Manual Labeling Descriptions that neither belong to the 11 normality templates nor contain any of the keywords in the four fixed sets have a high probability of being misspelled or describing rare pathologies or including pathologies that cannot be assigned to one of the four main regions.To handle such cases, we inspected them to correct spelling mistakes manually, then forwarded confusing descriptions to a radiologist of Phu Tho General Hospital for annotating.These cases account for less than 0.5% of the total descriptions, thus labeling the remain is not a time-consuming task, that minimizes the doctor's involvement in data labeling.
Over five months, we obtained the total number of 12.367 XML files and 12,376 DICOM files coresponding to 11,088 studies.10,847 DICOM files were PA chest radiographs, and 10,002 of them matched with information extracted from XML files.Table 2 details the number of positive and negative samples of the five classes in the collected dataset.For model development, we split the dataset into training and validation sets with the ratio of 7/3 and one constraint is that the distribution of each class in training and validation sets is approximated to the distribution of the original dataset.

Quality control
To ensure the quality of the dataset is guaranteed, we randomly take 5% of the data to inspect if there are any inappropriate images or labels that do not match the corresponding report.If any incorrectness is found, we will find out and correct it, then the 5% selection process is repeated until no more errors are detected.The inspection was carried out by a medical student majoring in radiology and was double checked by a radiologist of Phu Tho General Hospital.

Labeler results
We evaluate the effectiveness of the proposed labeling procedure by manually labeling the samples and considering the result as the ground truth.F1-score will be used as the main metric to evaluate the quality of our labeling tool.

Evaluation Set
The reported evaluation set consists of 3001 radiology reports from 3001 instances -that totally overlap with the reports in the validation set.We manually annotated these radiology reports without access to additional patient information.We labeled whether there is any abnormality in chest wall, pleura, pulmonary parenchyma and cardio following a list of labeling conventions that was agreed upon ourselves.After we independently labeled each of the 3001 reports, disagreements were resolved by consensus discussion or radiologist's consultation.The resultant annotation serves as ground truth on the report in evaluation set.

Evaluation results
After having the results as the radiologists' annotation, combined with the set of labels generated by our method, the evaluation results of each class are listed in Table 3, with the metrics of precision, recall and F1 score.Overall, our labeling pipeline delivers the high values of F1 score in all classes, with the lowest figures of 0.9926 and 0.9985 -being recorded in pleura and parenchyma classes, respectively.In chest wall, cardio and abnormal classes, our tool delivers the highest performance, without any mislabeled samples.

Model development
Chest X-ray interpretation with deep learning methods usually relies on pre-trained models developed for ImageNet.Nevertheless, it was proved that architectures achieving remarkable accuracy on ImageNet are unlikely to give the same performance when experienced on the CheXpert dataset and the choice of model family deliver better improvement than image resizing within a family for medical imaging tasks [24].
September 13, 2022 9/18 We decided to choose the model family that has been proved to be highly efficient for CXR interpretation -ResNet50 [32], DenseNet121 [33], Inception-V3 [34] and EfficientNet-B2 [35].We also leverage large public CXR datasets such as CheXpert to develop pre-trained models and compare the use of some benchmark chest X-ray datasets for transfer learning to ImageNet pre-trained models.Furthermore, the unbalance between classes has a negative impact on our dataset; for example, the chest wall class has a positive/negative ratio of 0.003.To address this problem, along with the conventional Binary Cross Entropy Loss (BCE), we used and assessed other loss functions established for multi-label imbalanced datasets, such as Asymmetric Loss (ASL) [25] and Distribution-balanced Loss (DBL) [26].
For each model architecture, we use the Adam optimizer (beta1 = 0.9, beta2 = 0.999 and learning rate = 1e-3), cooperating with Cosine annealing learning rate with gradual warm-up scheduler, a batch size of 16, three different loss functions: cross-entropy, distribution-balanced and asymmetric loss, image sizes of 768 and 1024.
Training was conducted on a Nvidia GTX 1080 with CUDA 10.1 and Intel Xeon CPU ES-2609.For one run of a specific model, we train for 160 epochs and evaluate each model every 413 gradient steps.Finally, checkpoint with the highest F1-score will be considered the best model for each training procedure.
We also used the nonparametric bootstrap [27] to estimate 95% confidence intervals for each statistic.There are 3,000 replicates are drawn from the validation set, and the statistic is calculated for each replicate.This procedure generates a distribution for each statistic, by reporting the 2.5 and 97.5 percentiles, the confidence intervals are obtained and significance is assessed at the p = 0.05 level.

Experimental result
In this work, chest X-ray classification models were trained on the training set detailed in Table 2.The models are distinguished from each other based on four attributes: (1) model architecture, (2) pre-trained dataset, (3) loss function and (4) image size, while sharing the common training procedure.First, we compare the effect of using pre-trained datasets and the impact of some loss functions on the multi-label problem.We choose ImageNet and CheXpert to transfer their knowledge to our target data.BCE -a common loss function, ASL and DBL -the two loss functions for multi-label issue were used in our experiment.The reported metrics are macro average (Av.)F1-score, AUC, sensitivity and specificity of the five classes.We only use ResNet50 architecture to compare these aspects with the same setup hyper parameters.
As we can see in Table 4, model using ASL and CheXpert dataset as pre-trained-initial parameters give the best result.All the metrics are higher than that of the others, especially when using ASL.This loss function always gives big value but is very effective because it heavily "penalizes" misclassified positive samples and hardly penalizes easy negative one.CheXpert is also useful in spite of containing similar patterns to our target data.We decide to use pre-trained model by CheXpert and ASL for later experiments.
To discover which family of architectures really fits our dataset, we do more experiments with Inception-V3, DenseNet121 and EfficientNet-B2, which are reported to perform well with radiographic images; and two sizes of image 768 and 1024.The result is shown in Table 5, which indicates that bigger image sizes do not give rise to better results, but affect training time.In the matter of model architectures, EfficientNet-B2 outperforms the others.In conclusion, model with EfficientNet-B2 architecture and input size of 768 delivers the best performance.
Detailed result of our best model is also presented in Table 6.By using ASL, the chest wall class has improved significantly when increasing to nearly 32% compared to the model using BCE and not using CheXpert as pre-trained.The pleura class has less September 13, 2022 10/18 The same procedure is also applied to build the two models of fine classification (detection of 14 pathologies) and coarse classification (detection of abnormalities in 4 locations in CXR images), in order to evaluate the effectiveness of the coarse classification compared to the fine classification.We use the CheXpert benchmark dataset to build and evaluate two models sharing the same configurations to retain the sense of objectivity.The data in the CheXpert dataset are labeled with 14 classes corresponding to 13 abnormalities in the chest radiograph and an implication of no findings.We infer where the lesion is in the 4 considered positions based on the type of lesion indicated in the CheXpert dataset.Table 7 shows the mappings between CheXpert data labels (14 classes) and the proposed set of labels (5 classes).Comparison of coarse and fine classification on Table 8.Based on the results shown in the Table 8, it can be seen that the coarse classification method gives a higher F1 score in both the abnormal class and the macro average F1 score.
We also plot Grad-CAMs [28] to give the visual explanations of how the model fulfil predictions.(pleural effusion) were correctly highlighted.The results are attained when performing with the EfficientNet-B2 architecture, the input size is 768x768, using the CheXpert dataset to build the pretrained model and apply the asymmetric loss function.

Conclusion
In current work, we propose a semi-automatic process of building an accurate CXR dataset, which can take advantage of the resources stored in PACS and HIS systems, especially minimizing the intervention of radiologists.We also suggest a coarse classification method based on the location of abnormalities in radiographs, which can address the realistic demand for Vietnamese radiologists and be more efficient than classification based on pathology types.Finally, we demonstrate that building pre-trained models using large CXR datasets can significantly improve performance compared to using ImageNet datasets.The models finetuned from CheXpert pre-trained models with asymmetric loss function achieve significant gains over ImageNet pre-trained models, which we believe will serve as a strong baseline for future research.We also believe that this method will be applied for other languages which

Figure 1 .
Figure 1.Overview diagram of the process of collecting and building medical image dataset.The process consists of five steps: data collection from PACS and HIS, PA-view filtering, XML parser, data matching and data annotation.
(a) Vietnamese radiology description (b) Translation of Vietnamese radiology description

Figure 2 .
Figure 2. The description in a typical radiology report in Vietnam.The description is divided into four main categories: chest wall, pleura, lungs (parenchyma) and cardiac.

Fig 4 .
Since the HIS and PACS are linked by PATIENT ID, this key is used by the matching algorithm to determine whether the DICOM file and radiography report belong to the same patient.Moreover, REPORT TIME must be within 24 hours of STUDY TIME, which is a regulated protocol of the hospital.Finally, STUDY TIME has to be between CHECK IN TIME and CHECK OUT TIME.If all of the conditions are fulfilled, the DICOM file and the radiology report are matched.

Figure 3 .
Figure 3. Radiology reports extraction process for CXR examinations collected from HIS [31].The original Vietnamese counterparts are put inside square brackets.

Figure 4 .
Figure 4. Algorithm for matching a DICOM file obtained from PACS with a radiology report collected from HIS.

Figure 5 .
Figure 5. Semi-automated data annotation pipeline.The system consists of 4 steps, the first 3 steps are automatic and the last one is carried out manually.

Table 1 .
Examples of Vietnamese keywords indicate abnormalities in chest wall, pleura, parenchyma, cardiac classes and abnormality out of these four group.English translations are enclosed in square brackets.

Table 2 .
Number of instances which contain five labeled observations in training, validation and the whole dataset.

Table 3 .
Evaluation results of proposed labeling tool.Evaluation was performed on 3001 samples of the validation set.

Table 4 .
Experimental results with different pre-train datasets and loss functions.Model pre-trained on CheXpert dataset and using Asymmetric loss function yields the best performance.samplesthanthechest wall, but the results do not improve much after using ASL, possibly because the chest wall class has a more diverse number of abnormal manifestations in our data, so the model focused more on this class.Fig6illustrates plots on all tasks.The model achieves the best AUC on pleura class (0.96), and the worst on chest wall class (0.81).The abnormal class recorded 0.87 AUC, the parenchyma and cardiac classes witness figures of 0.86 and 0.92, respectively.

Table 5 .
Experimental results with different backbones and input sizes.Model with EfficientNet-B2 architecture and input size of 768 delivers the best performance.