Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks

The number of images taken per patient scan has rapidly increased due to advances in software, hardware and digital imaging in the medical domain. There is the need for medical image annotation systems that are accurate as manual annotation is impractical, time-consuming and prone to errors. This paper presents modeling approaches performed to automatically classify and annotate radiographs using several classification schemes, which can be further applied for automatic content-based image retrieval (CBIR) and computer-aided diagnosis (CAD). Different image preprocessing and enhancement techniques were applied to augment grayscale radiographs by virtually adding two extra layers. The Image Retrieval in Medical Applications (IRMA) Code, a mono-hierarchical multi-axial code, served as a basis for this work. To extensively evaluate the image enhancement techniques, five classification schemes including the complete IRMA code were adopted. The deep convolutional neural network systems Inception-v3 and Inception-ResNet-v2, and Random Forest models with 1000 trees were trained using extracted Bag-of-Keypoints visual representations. The classification model performances were evaluated using the ImageCLEF 2009 Medical Annotation Task test set. The applied visual enhancement techniques proved to achieve better annotation accuracy in all classification schemes.


Introduction
With respect to the last decade, ten times more medical images are taken, increasing the number of images per body region per patient to 200-1000 [1]. This huge increase can be traced back to two major facts: rapid advances in technology and significant importance of medical images. Medical images contain relevant information that is valuable to physicians. It provides a reliable source of anatomical and functional information for accurate diagnosis, effective treatment planning as well as research work [2,3]. The advances of software and hardware in PLOS  information technology sector and digital imaging in the medical domain have made the acquisition and storage of images in hospitals possible [4]. This large image collection aids medical professionals and improves diagnosis. However, radiologists are challenged by the amount of data. They have to maintain a high interpretation accuracy of radiological images, but also maximize efficiency in terms of the increasing number of images per body region. Computer-based assistance is needed for image interpretation, categorization and annotation [5], as these are beneficial for content-based image retrieval (CBIR) systems and computer-aided diagnosis (CAD) [6].
Deep learning techniques [7] have improved prediction accuracies in object detection [8], speech recognition [9] and in domain application such as medical imaging [10,11]. Hence, two Deep Convolutional Neural Network (dCNN) systems were adopted for image classification. To compare and evaluate the performance of applied dCNN systems, a traditional classifier was modeled in addition.
This paper evaluates the effect of several image enhancement techniques on the prediction accuracy rate on radiographs. To analyze this value, several classification schemes were acquired from the ImageCLEF 2009 Medical Annotation Task dataset. All images used at the training and testing stages were preprocessed with the various presented image enhancement techniques. Finally, the obtained image annotation performance accuracies are compared and discussed.

Related work
Several approaches to Information Retrieval (IR) in Medical Domain as objective have been designed. KHRESMOI was a large EU-funded project aimed at creating a multilingual and multimodal-based search system for biomedical information and documentation [12]. The GNU Image-Finding Tool (GIFT), an outcome of the Viper Project, enables users to perform query-by-example (QBE) search and improves result quality with relevance feedback [13]. In [14], Parallel Distributed Image Search Engine (ParaDISE) was proposed. This search engine enables the indexing and retrieving of images using present visual and text features. The Lucene Image Retrieval (LIRE), a lightweight open source library, provides image retrieval using visual features such as color and texture [15]. The IRMA-code, a mono-hierarchical multi-axial classification code for medical image was proposed in the Image Retrieval in Medical Applications (IRMA) [16]. The IRMA-code describes the modality of the images, orientation of the image, examined body region and the biological system investigated.
Positive results have been achieved by image preprocessing using input color enhancement techniques. In [17], superior values were obtained by using dual deep convolutional neural networks and color input enhancement [18] to detect malignancy in digital mammography images. As computer-aided assistance is needed in image interpretation [19] and improved prediction accuracies have been obtained using deep convolutional neural networks [7], the objective of this paper is to create an automatic image annotation system using deep learning and image enhancement techniques. These annotated radiographs are fundamental for medical image retrieval systems.
The aim of this presented approach is to apply several image enhancement techniques on radiographs, to increase the overall prediction accuracy of classification models. This is fundamental for implementing image retrieval systems.

Dataset
The dataset adopted for evaluation was distributed at the ImageCLEF 2009 Medical Annotation task [20,21]. The training set consists of 12,671 grayscale images and the official evaluation set has 1,732 grayscale images. Each radiograph in the training set is annotated with a 13-character string.

Classification schemes
5 different classification schemes are used for evaluation, which were derived by using the complete IRMA code, as well as splitting the code to its' four axes.
IRMA. The 13-digit code used for annotation is known as the IRMA code and was proposed in [16]. The IRMA coding system is hierarchical and consists of four axes: the technical Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks code (T) for image modality, the directional code (D) for body orientations, the anatomical code (A) referring to body region examined, and the biological code (B) for the biological system examined [16]. The code results in a string of 13 characters, ie. TTTT-DDD-AAA-BBB, which can be seen in Fig 1. The IRMA classification scheme contains altogether 197 individual classes, which represent the total distinct combinations of all four axes.
(T) technical scheme. The (T) technical classification scheme is the technical axis of the IRMA code. It consists of a 4-character string and denotes physical source, modality position, techniques and sub-techniques [16]. The T-scheme has 6 classes. A random excerpt of radiographs from the training set annotated with the t-scheme is shown in Fig 2. (D) directional scheme. The (D) directional classification scheme is a 3-character string and denotes the orientation plane of the radiographs, such as coronal, sagittal and transversal [16]. This scheme is made up of 34 classes. A random excerpt of radiographs from the training set annotated with the d-scheme is shown in Fig 3. (A) anatomical scheme. The (A) classification scheme stands for the complete coding of anatomical regions which are present in the human body. The A-scheme defines nine major body regions, where each region has 2 hierarchical sub-regions [16]. In total, the anatomical scheme has 97 individual classes and each class is represented by a 3-character string. A random excerpt of radiographs from the training set annotated with the a-scheme is shown in Fig 4. (B) biological scheme. The (B) biological classification code categorizes the organic system scanned into ten major parts [16]. The B-scheme contains 11 classes and is represented by

Image enhancement
In this section, the three experiments adopted for enhancing visual representation before the classification and annotation of the radiographs are explained.

Image layering
For image recognition tasks, convolutional neural networks trained on large datasets produce favorable results. Considering the number of images in the ImageCLEF 2009 Medical Annotation Task, the adaptation of Transfer Learning with pre-trained neural, such as Inception-v3 [22] and Inception-ResNet-v2 [23], networks was chosen. These pre-trained Deep Convolutional Neural Network (dCNN) models were designed to extract amongst other features, color information from the images [24,25]. However, the radiographs distributed for at the Image-CLEF 2009 Medical Annotation Task are grayscale images and have single color channel with values [0, 255]. To fully utilize the capabilities of dCNNs, two extra color layers are augmented to each radiograph, completing the RGB frames with the enhanced slices.
The first extra layer was obtained using the image processing technique: Contrast Limited Adaptive Historization Equation (CLAHE) [18]. CLAHE is a contrast enhancement method, modified from the Adaptive Histogram Equation (AHE). It is designed to be broadly applicable and having demonstrated effectiveness, especially for medical images [26]. The second layer was generated by applying the Non Local Means (NL-MEANS) preprocessing method. This is a digital image denoising method, based on a non local averaging of all present pixels in an image [27]. The effect of applying NL-MEANS to a randomly chosen radiograph from the ImageCLEF 2009 Medical Annotation Task Training Set is shown in Fig  7. The NL-MEANS output images were obtained using the following parameters:  Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks The augmented RGB-Image is obtained by adding the two layers to the original grayscale radiograph, as shown in Fig 8.

Image padding
There are variations regarding the height and width of the radiographs distributed for the Ima-geCLEF 2009 Medical Annotation Task. The upper and lower extremities are usually narrow with less width size, while head scans are wider with less height size. To obtain size similarity over all images, a fixed size was defined. All radiographs in the dataset were resized to [512 x 512] by padding the input images, which can be seen in Fig 9. The images are padded with their repetition, other alternatives are padding with a constant value or noise as well as image squashing.
Both image layering and padding, as explained in subsections Image Layering and Image Padding, are applied successively; the output image is shown in Fig 10.

TensorFlow
For the dCNNs, TensorFlow-Slim (TF-slim), a lightweight package for defining, training and evaluating models in TensorFlow [28] with pre-trained models, was adopted. To optimize prediction performance, the models were fine-tuned with all trainable weights and best hyperparameter configuration in the second training phase. Inception-v3. The pre-trained model Inception-v3 [22] which was trained for the Ima-geNet [24] Large Visual Recognition Challenge 2012 [29], was used to fine-tune the classification model. To optimize classification accuracy, a grid search was used to obtain best hyperparameters configurations. For the Inception-v3 classification models, the following hyperparameter configuration was applied:  For all other parameters not mentioned above, the default values as proposed in TF-Slim [28] were adopted.
Inception-ResNet-v2. The pre-trained model Inception-ResNet-v2 [23] which is a variation of the Inception-v3 using the ideas presented in [30,31], was used to fine-tune the classification model. For the Inception-ResNet-v2 classification models, the following hyperparameter configuration was applied: For all other parameters not mentioned above, the default values as proposed in TF-Slim [28] were adopted.

Random Forest
Random forest (RF) [32] models with 1000 deep trees were trained to compare accuracy performances amongst classification models. These RF-models were trained using visual image representation obtained with the Bag-of-Keypoints (BoK) [33] approach. For whole-image classification tasks, BoK approach has achieved high classification accuracy results [34,35]. BoK is based on vector quantization of affine invariant descriptors of image patches [33]. The simplicity and invariance to affine transformation are advantages that come with this approach.
All functions applied to render visual models are from the VLFEAT library [36]. Dense SIFT (dSIFT) [37] applied at several resolutions were uniformly extracted with an interval of 4 pixels using the VL-PHOW function. Computational time was sped up by computing k-means clustering with Approximated Nearest Neighbor (ANN) [38] on randomly chosen descriptors Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks using the VL-KMEANS function. This partitions the observations into k clusters so that the within-cluster sum of square is minimized.
A maximum number of 20 iterations was defined to allow the k-means algorithm to converge and cluster centers were initialized using random data points [39]. A codebook containing 1,000 keypoints was generated as k = 1,000. Using the VL-KDTREEBUILD function, the codebook was further optimized by adapting a kd-tree with metric distance L2 for quick nearest neighbor lookup. Parameters used to tune BoK and RF are:

Results
Image class prediction was computed using five classification schemes: the complete IRMA code and its 4 axes separately. The performance of modeled classifiers on different classification schemes are listed in Tables 1-3, for Random Forest, Inception-v3 and Inception-ResNet-v2, respectively.
Evaluation was performed on the official test set and all models were trained with the complete training set distributed at the ImageCLEF 2009 Medical Annotation Task. Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks The best prediction performances per classifier model and image input obtained on the different classification schemes are displayed in Tables 4-8 for easier understanding. Evaluation was calculated for using the ImageCLEF 2009 Medical Annotation Task test set.

Discussion
It can be seen from all result tables, better prediction accuracies are obtained with the enhanced radiographs. This is observed for all three classification models and all five schemes adopted. However, there is not one enhancement technique that outperforms the rest, it varies with the classification scheme used, which can be explained by the no free lunch theorem [40]. Certain enhancement techniques perform better at some classification schemes. Image Layered achieves best results when trained with the Bag-of-Keywords and Random Forest. Image Padding performs best with models trained with the deep learning system Inception-v3. For models trained with Inception-ResNet-v2, Image Layered/Padding leads to better results. Best prediction performance was obtained with the following model and enhancement technique combination:  As the number of classes increase, the prediction accuracy rate decreases. The anatomical and IRMA schemes are class imbalanced, having less or no image representing some classes. Hence, the uncertainty of the models is high at these images. The prediction results of the IRMA scheme is lowest, as it contains the highest number of classes of sparse representations. However, a hierarchical classification can be used to tackle this task, as the results in the individual axes perform well.
Following the shown results, a more robust and certain model can be obtained with a balanced class distribution of the images in the training set. An ensemble of models trained with several image enhancement techniques should be applied with majority vote, to achieve the optimal training model and enhancement technique combination.

Conclusion
In this paper, grayscale radiograph enhancement methods aiming to achieve better classification and annotation performance is presented. Two extra color layers are augmented to simulate RGB-channeled images, as Deep Convolutional Neural Networks (dCNN) use color information for training. Due to variations in width size and height size, the radiographs are padded with cropped patches to fill up the defined size [512 x 512].
The dCNN systems Inception-v3 and Inception-ResNet-v2 were applied as image classification models. The traditional machine learning algorithm Random Forest (RF), trained with Bag-of-Keypoints visual representation, was adopted for performance comparison. Five classification schemes, each having different number of classes and categorization focus, were applied to evaluate the image enhancement techniques.
This works shows that enhancing the radiographs before training and classification, proves to obtain positive results. This is observed for the models trained with the deep learning systems Inception-v3 and Inception-ResNet-v2, as well as the traditional combination of Bag-of-Keypoints and Random Forest. For all five classification schemes, better prediction accuracies were achieved when the enhanced radiographs were used.
Prospective evaluation of annotating radiographs can be based on multi-modal image representation and hierarchical class annotation, as positive results have been presented in recent approaches.