Development and validation of artificial intelligence to detect and diagnose liver lesions from ultrasound images

Artificial intelligence (AI) using a convolutional neural network (CNN) has demonstrated promising performance in radiological analysis. We aimed to develop and validate a CNN for the detection and diagnosis of focal liver lesions (FLLs) from ultrasonography (USG) still images. The CNN was developed with a supervised training method using 40,397 retrospectively collected images from 3,487 patients, including 20,432 FLLs (hepatocellular carcinomas (HCCs), cysts, hemangiomas, focal fatty sparing, and focal fatty infiltration). AI performance was evaluated using an internal test set of 6,191 images with 845 FLLs, then externally validated using 18,922 images with 1,195 FLLs from two additional hospitals. The internal evaluation yielded an overall detection rate, diagnostic sensitivity and specificity of 87.0% (95%CI: 84.3–89.6), 83.9% (95%CI: 80.3–87.4), and 97.1% (95%CI: 96.5–97.7), respectively. The CNN also performed consistently well on external validation cohorts, with a detection rate, diagnostic sensitivity and specificity of 75.0% (95%CI: 71.7–78.3), 84.9% (95%CI: 81.6–88.2), and 97.1% (95%CI: 96.5–97.6), respectively. For diagnosis of HCC, the CNN yielded sensitivity, specificity, and negative predictive value (NPV) of 73.6% (95%CI: 64.3–82.8), 97.8% (95%CI: 96.7–98.9), and 96.5% (95%CI: 95.0–97.9) on the internal test set; and 81.5% (95%CI: 74.2–88.8), 94.4% (95%CI: 92.8–96.0), and 97.4% (95%CI: 96.2–98.5) on the external validation set, respectively. CNN detected and diagnosed common FLLs in USG images with excellent specificity and NPV for HCC. Further development of an AI system for real-time detection and characterization of FLLs in USG is warranted.

Introduction Hepatocellular carcinoma (HCC) is the fourth leading cause of cancer death worldwide [1]. Screening abdominal ultrasonography (USG) has been shown to be cost-effective in reducing mortality from hepatocellular carcinoma (HCC) by 37% [2][3][4][5]. However, worldwide surveillance rates remain low, ranging from 6.7-28.0% [5][6][7][8][9]. One significant barrier to timely HCC screening is inaccessibility to high-quality ultrasound with interpreting radiologists, particularly in rural areas [10]. Developing an artificial intelligence (AI)-assisted USG image analysis system may potentially facilitate USG screening programs, increase the surveillance rate and improve the survival of HCC patients.
AI systems have shown potential in facilitating radiologic image interpretation [11]. Abdominal USG is one of the most challenging imaging modalities in the field of AI-based medical image analysis for several reasons. First, the quality of USG images varies among different devices and operators [12]. Second, USG images have a low signal-to-noise ratio making the identification of small lesions from the background difficult. Additionally, a single abdominal USG image often contains several organ structures, often including the liver, gallbladder, kidney, bile duct and blood vessels. The position and orientation of these structures in USG images are not consistent and standardized as with CT or MRI images, therefore, differentiating true lesions from normal structures and pseudo-lesions can be challenging. Although previous studies on AI neural networks demonstrated 88-96% accuracy in the diagnosis of focal liver lesions (FLLs) in still USG images, the size of the training datasets were small with only internal tests being performed [13][14][15]. Whether these AI systems can be applied in other clinical settings has yet to be investigated.
In the current study, we used a large number of off-line USG images to develop an AI-assisted USG image analysis system for detection and diagnosis of various FLLs including HCC, cyst, hemangioma, focal fatty sparing (FFS), and focal fatty infiltration (FFI). To strengthen generalizability of our AI system, we evaluated its performance on images from both an internal test set and external validation datasets (i.e. images from different hospitals using different machines and different sonographers).

Dataset
This retrospective study was approved by the Institutional Review Board of the Faculty of Medicine, Chulalongkorn University (IRB No. 423/61 and 646/62). Data was collected upon approval from the director and/or ethics committee of King Chulalongkorn Memorial Hospital, Bangkok, Thailand; Mahachai Hospital, Samut Sakhon, Thailand; and Queen Savang Vadhana Memorial Hospital, Chonburi, Thailand. Requirement for informed consent was waived due to the retrospective nature of this study. All ultrasound examinations were de-identified and analyzed anonymously. Images from upper abdominal USG performed between 2010 and 2019 were retrospectively retrieved from the Picture Archiving and Communication System (PACS) of three different hospitals. All data were still images taken as snapshots during ultrasound. They had been stored in Digital Imaging and Communications in Medicine (DICOM) format. All images were acquired using curvilinear transducers and allocated into 3 datasets: training set, internal test set and external validation set. The training set and the internal test set were retrieved from the same patient batch at the main study site, King Chulalongkorn Memorial Hospital, Bangkok, Thailand. All images from this batch were randomly allocated in a 9:1 ratio of the training set to the internal test set. Allocation design ensured that all images from the same patient were assigned to the same set making the image sets independent of each other without any duplicated patients. The external validation set was acquired from Mahachai Hospital, Samut Sakhon, Thailand and Queen Savang Vadhana Memorial Hospital, Chonburi, Thailand to further validate the performance of the AI system. The external validation images were completed by different sonographers using a variety of USG machine models. We included USG studies with all ranges of image qualities from new and older machines to ensure that the AI system can be generalized to other datasets. A total of 17 different ultrasound machine models were included in this study (S1 Table in S1 File).
Five of the most commonly encountered liver lesions, including HCCs, cysts, hemangiomas, FFSs and FFIs were selected for this study (Fig 1) [16,17]. The definitive diagnoses of FLLs were verified using pathology and/or MRI/CT reports. Pathology reports were reviewed first. If not available, MRI and CT reports were then considered. Exclusion criteria were USG studies without further investigation for definitive diagnoses of FLLs and USG studies in which the lesion characteristics were altered by prior treatments. It is noted that in each USG study, there were images with and without FLLs. The normal images without FLLs, which were randomly selected in a 1:1 ratio, were used as negative controls for training the AI system to learn to distinguish FLLs from normal organ structures. An equal number of both types of images facilitated the training process for AI to correctly detect FLLs while minimizing false positivity. In contrast, for the internal test set and external validation set, all negative control images were included in order to replicate the real-world situation in which rare instances of FLLs emerge among a vast number of images showing normal liver and other normal organs.
Since some patients had more than 1 USG study and some studies had more than one image containing FLLs, the following protocol was used to select and include images in the dataset. For the training set, we included all FLL images of all USG studies of each patient in order to diversify images for the AI training. By contrast, in the internal test set and external validation set, we included up to 2 images with FLLs per study and up to 2 studies per patient. For the USG study having >1 image with FLLs, 2 images containing different FLLs were randomly selected. If there were >1 image containing the same FLL, 2 images taken at different probe positions were randomly selected. If there were >1 image with the same FLL taken at identical probe position, only 1 image was randomly selected.

AI system architecture
The AI framework used in this study was a convolutional neural network (CNN) [18]. CNNs are currently the preferred technique for several types of image analyses due to its structured layering characteristic that can detect complex features of the input images, where the shallow layers detect simple features such as dots and lines and the deeper layers detect more complex features, such as curves and loops [18]. In the present study, we adopted a CNN architecture called RetinaNet [19] which takes an image as input and creates a set of bounding boxes surrounding the FLL along with its class (predicted diagnosis) and its confidence in predicting that particular diagnosis. Confidence value range from 0 to 1, with a value of 1 being the most confident. The confidence threshold can be adjusted according to clinical relevance; for example, the confidence threshold may be lowered to increase the detection rate for HCC if needed in a certain patient population. The overall performance of the CNN, therefore, varies by different confidence thresholds. In this study, the confidence threshold was selected such that the F2 score was optimized on a tuning set, which was a subset of the training set (Details in S3 Appendix and S8 Fig in S1 File). The selected confidence threshold was then used in both the internal test set and external validation cohorts. Since each diagnosis was independent, it was possible for RetinaNet to output multiple diagnoses for a single lesion. This approach resembled the usual practice of reporting differential diagnoses of FLLs by radiologists.

Ultrasound image preprocessing
During image preprocessing, all patient identification information and the peripheral areas in the USG images were cropped out. We identified the coordinates of fan-shaped USG region by 'Sequence of Ultrasound Regions' DICOM header, in order to ensure that the cropped image contained only the fan-shaped USG region where annotations and dimension measurements had been cropped out. We also removed markers which were made by sonographers in some images (S1 Appendix in S1 File). The images were then resized to 1333 pixels wide and 800 pixels tall before inputted into the CNN.

AI system development process
Training phase. A supervised training method was implemented to train the AI system. In order to generate an image training dataset, pathology and/or MRI/CT reports were reviewed by experienced sonographers to identify labels, which were the locations and definitive diagnoses of FLLs in each USG image [20]. A hepatologist (R.C.) subsequently verified the labels to ensure their accuracy. Images in the training set were fed into the AI system to train it to predict the location and diagnosis of the FLLs (Fig 2). RetinaNet codes were adopted from an open-source repository [21,22]. The codes were then modified and optimized for analyzing USG images. In this work, RetinaNet was composed of backbone ResNet50 and the detection and diagnosis heads. The backbone ResNet50 extracted the hierarchy of features, and the detection and diagnosis heads subsequently processed these features and outputted locations and diagnoses of FLLs [23].
The training was done in two main steps. First, the backbone ResNet50 was trained on a publicly-available image dataset called Microsoft Common Objects in Context (MS-COCO), which comprises 330,000 images of 1.5 million object instances [24]. Subsequently, the whole CNN, both backbone and heads, was fine-tuned on our USG images in the training set. The CNN was trained for 500,000 iterations (25 epochs × 20,000 steps per epoch) on USG images. The initial learning rate was 0.0001. To enable the CNN to recognize diverse configurations of images and to maximize the number of training images, image augmentation was performed by horizontal translation, vertical translation, rotation, scaling, horizontal flipping, motion blur, contrast, brightness, hue and saturation adjustment at each iteration [25]. The training hyperparameters are shown in the S8 Table in S1 File. During training, a tuning set was used to monitor performance of the CNN. We selected an epoch that optimized mean average precision [26] on the tuning set for final evaluation on the internal test set and the external validation set.
Evaluation phase. The performance of the developed AI system was evaluated first on the internal test set, then on the external validation set.

Performance evaluation metrics
We separately evaluated detection and diagnosis, the two primary tasks of the CNN. Evaluation of detection rates and diagnosis performance were performed on a per-lesion basis. The definitions of the evaluation metrics are described below.
Detection task. An FLL was counted as correctly detected if the CNN generated a bounding box around it and the box overlapped with the true location of FLL, which was assessed using Intersection-over-Union (IoU). In this study, an IoU of greater than 0.2 was a cut-off for a correct detection by the CNN (S2 Appendix and S1 Fig in S1 File). We opted to use this cutoff because FLLs in USG images often have indistinct boundaries, especially for FFSs and FFIs (S2 Fig in S1 File). The detection rate was calculated by dividing the number of FLLs correctly detected by the number of total FLLs. Detection rates stratified by ground truth diagnoses were also evaluated. In contrast, a false positive detection was counted when the AI system outputted a bounding box on an area that did not contain FLLs (e.g. liver parenchyma, normal organ structures, etc.). Evaluation of false positive detections was performed on a per-image basis.
Diagnosis task. We used the following metrics to evaluate AI diagnostic performance: where TP, TN, FP and FN are the number of true positive, true negative, false positive and false negative diagnoses, respectively.
We used a "one-versus-all" method to evaluate diagnostic performance for each FLL diagnosis [27]. For example, when evaluating diagnostic performance for HCC, other diagnoses were counted as a single non-HCC class: where sensitivity HCC is the diagnostic sensitivity for HCC.
TP HCC is the number of true positive diagnoses for HCC, where the definitive diagnosis is HCC and the AI system correctly diagnosed the lesion as HCC.
FN HCC is the number of false negative diagnoses for HCC, where the definitive diagnosis is HCC, but the AI system falsely diagnosed the lesion as either cyst, hemangioma, FFS or FFI.
In cases where multiple diagnoses reached the confidence threshold and hence were predicted by the AI system, only the diagnosis with the highest confidence value was selected as the AI prediction.
Calculation of overall detection rate and overall diagnostic performance. After calculating detection and diagnostic performance metrics for each definitive diagnosis of FLLs, we pooled the performance results from the 5 FLL diagnoses into an overall performance result. Because the numbers of each FLL diagnosis in our dataset were imbalanced, overall performance, including overall detection rate, overall diagnostic sensitivity and specificity, were pooled by an unweighted average, to minimize the effect of imbalanced number of FLL diagnoses. For example, where DR overall is the overall detection rate. DR HCC , DR cyst , DR hemangioma , DR FFS and DR FFI are the detection rates for HCC, cyst, hemangioma, FFS and FFI, respectively.
where sens overall is the overall diagnostic sensitivity. sens HCC , sens cyst , sens hemangioma , sens FFS and sens FFI are the diagnostic sensitivities for HCC, cyst, hemangioma, FFS and FFI, respectively.

Statistical analysis
Performance of the CNN was reported by detection rates, false positive detection rates, diagnostic sensitivities, specificities, accuracies, positive predictive values (PPVs), and negative predictive values (NPVs) with 95% confidence intervals (95% CI). Detection and diagnostic performance of each type of FLL as well as overall performance for all FLL diagnoses were calculated. Performance on the internal test set and external validation set were compared using two-tailed z-test for difference of proportion. Python version 3.7 (Python Software Foundation, Delaware, USA) and IBM SPSS Statistics for Windows, version 22 (SPSS Inc., Chicago, Ill., USA) were used for data analyses. A p-value of <0.05 was considered statistically significant.

Baseline characteristics
A total of 40,397 images with 20,432 FLLs were included in the training set, while 6,191 images with 845 FLLs and 18,922 images with 1,195 FLLs were included in the internal test set and external validation set, respectively. Baseline characteristics of each dataset is described in Table 1. Table 2.

Performance of CNN in detection and diagnosis on the internal test set and external validation set are summarized in
Lesion detection performance. On the internal test set, the CNN had an overall lesion detection rate of 87.0% (95%CI: 84.3-89.6). The median IoU was 0.788 (range: 0.202-0.978) (S3 Fig in S1 File), suggesting an exceptional agreement between the predicted and true location of the FLL. Compared to the internal test set, the overall detection rate in the pooled external validation set was significantly lower (75.0% (95%CI: 71.7-78.3), p < 0.001), with the median IoU of 0.781 (range: 0.201-0.970) (S3 Fig in S1 File).
The false positive detection rate was 3.7% (226/6191) and 5.1% (970/18922) in the internal test set and external validation set, respectively. The images with false positive detections were reviewed. Blood vessel in the liver was the most common falsely identified structure as FLLs  Table and S4 Fig in S1 File). Likewise, 114 and 273 images with false negative detection in the internal test set and the external validation set were reviewed. Characteristics for incorrect detection included being a small lesion <1 cm (27.4%), having an uncommon location of that particular diagnosis (8.0%), lesion with atypical characteristics (7.8%), illdefined lesion (7.5%), and lesion obscured by shadow artifacts or not completely visible (6.2%) (S3 Table and S5 Fig in S1 File) Table 2). The overall performance of the CNN in diagnosing any FLLs in the external validation set was similar to that of the internal test set, with the sensitivity, specificity, accuracy, PPV and NPV of 84.9% (95%CI: 81.6-88.2), 97.1% (95%CI: 96.5-97.6), 95.3% (95%CI: 94.7-95.9), 81.9% (95%CI: 78.4-85.4), and 97.1% (95%CI: 96.6-97.7), respectively. In subgroup analyses of each type of FLL, the diagnostic performance in the external validation set was also comparable to the performance of the internal test set as displayed in Table 2.
Confusion matrix for classification results in the internal test set and external validation set is shown in Table 3. After reviewing misclassified images, we found that the most common cause was atypical characteristics of FLLs (30.1%, 56/186) (S4 Table and S6 Fig in S1 File).
Subgroup analyses. The AI system detection and diagnostic performance was further stratified by FLL sizes (S5 Table in S1 File) and background liver parenchyma (cirrhosis vs. non-cirrhosis) (S6 Table in

Discussion
The CNN developed in our study using an advanced structured AI learning system demonstrated a consistently high diagnostic performance on USG images from both an internal test set and an external validation set. It achieved overall diagnostic sensitivity and specificity of 83.9% and 97.1% on the internal test set and 84.9% and 97.1% on the external validation set. Regarding detection task, our AI system was able to detect 85.3% of HCCs in the internal test set and 78.3% in the external validation set (p = 0.16). However, averaging all included FLL diagnoses, the detection rate of the external validation set was significantly lower than the internal test set (75.0% vs 87.0%; p <0.001). Factors that may be at play include the increased heterogeneity of image characteristics from different ultrasound machine models in the external validation set, compared to the training set (S1 Table in S1 File). This finding underscores the importance of image diversity in the training dataset. To enhance practicality, we propose to train the AI system with additional USG videos which contain numerous image frames to better detect FLLs.
For the diagnosis task, the performance results were consistent between the internal test set and the external validation set. The AI system achieved overall sensitivities of 83.9% and 84.9%, and specificities of 97.1% and 97.1% on the internal test set and external validation set, respectively. Our AI system had lower sensitivity for FLL diagnosis than the sensitivities of 93.8%-98.8% shown in previous studies, with comparable specificities of 94.3-98.9% in the previous reports [13][14][15]. The lower sensitivity may have been due to the wider spectrum of FLL diagnoses and characteristics. In the two previous studies, only HCCs, cysts and hemangiomas were selected for testing [14,15]. In the current study, FFSs and FFIs were additionally included as both diagnoses are encountered frequently in liver cancer surveillance settings with prevalence rates of FFS and FFI previously reported as 6.3% and 9.2%, respectively [17,28].
Misclassifications of FLLs by the AI system may be explained by the fact that different types of FLLs can appear very similar on USG images. Moreover, some lesions may have atypical characteristics. We found that HCCs and hemangiomas were sometimes interchangeably misdiagnosed ( Table 3). This may be because our sample contained a considerable number of hemangiomas with atypical characteristics (18.8% of all hemangiomas) with 11.8% of hemangiomas appearing as hypoechoic lesions in fatty liver background and 7% of hemangiomas as giant hemangioma with heterogeneous echogenicity in contrast to typical well-defined round hyperechoic lesion (S6 Fig in S1 File). This is supported by our findings that diagnostic sensitivity of HCC increased when the size of lesion increased, while diagnostic sensitivities of hemangioma decreased when the size of lesion increased. We specifically had designed our AI system to output diagnoses of detected FLLs as differential diagnoses. This should be clinically useful as physicians will be able to decide what is the most likely diagnosis of FLL by incorporating the AI diagnosis together with their clinical information. We further analyzed whether HCC appeared in the top-k predicted differential diagnoses. Top-1 (equal to diagnostic sensitivity reported in the main results section), top-2 and top-3 sensitivities for diagnosing HCC were 73.6%, 90.8% and 96.6%, respectively in the internal test set and 81.5%, 89.0% and 93.6%, respectively in the external validation set (S7 Table in S1 File). This provides evidence that the AI system can characterize HCC with low miss rate.
The unique approach of our study is the development and testing of an AI system that can both detect and diagnose FLLs from USG still images. This novel AI system could automatically detect and classify FLLs without the need for human help for guiding the location of FLLs. Images with all ranges of qualities were included that help strengthen our findings on the practicality of using this AI method. We found that the CNN was able to handle such variation reasonably well. We believe that with more data, the performance of the AI system could be further improved.
The AI development flow can be divided into the following stages: 1) pre-clinical stage using single-site retrospective dataset, 2) validation on external cohorts, and 3) evaluating usefulness of AI systems in real clinical settings by prospective or randomized-controlled trial study designs [29]. In this study, we validated the performance on external validation cohorts (i.e. 2 nd stage of AI development flow) with satisfactory results. Currently, our AI system works off-line on still USG images. Since the ultimate goal is to implement an AI system in clinical practice, we are now incorporating USG videos as training materials to leverage our AI system to perform real-time analysis while a USG procedure is being performed.

Conclusion
Given the structured training framework, the CNN has shown good performance for the detection and diagnosis of FLLs in USG images. HCCs can be detected and diagnosed with satisfactory performance. To fulfill our goal of assisting in the detection and diagnosis of FLLs during USG performed by non-radiologists, an AI system for real-time detection and analysis is warranted.