Automatic image annotation method based on a convolutional neural network with threshold optimization

In this study, a convolutional neural network with threshold optimization (CNN-THOP) is proposed to solve the issue of overlabeling or downlabeling arising during the multilabel image annotation process in the use of a ranking function for label annotation along with prediction probability. This model fuses the threshold optimization algorithm to the CNN structure. First, an optimal model trained by the CNN is used to predict the test set images, and batch normalization (BN) is added to the CNN structure to effectively accelerate the convergence speed and obtain a group of prediction probabilities. Second, threshold optimization is performed on the obtained prediction probability to derive an optimal threshold for each class of labels to form a group of optimal thresholds. When the prediction probability for this class of labels is greater than or equal to the corresponding optimal threshold, this class of labels is used as the annotation result for the image. During the annotation process, the multilabel annotation for the image to be annotated is realized by loading the optimal model and the optimal threshold. Verification experiments are performed on the MIML, COREL5K, and MSRC datasets. Compared with the MBRM, the CNN-THOP increases the average precision on MIML, COREL5K, and MSRC by 27%, 28% and 33%, respectively. Compared with the E2E-DCNN, the CNN-THOP increases the average recall rate by 3% on both COREL5K and MSRC. The most precise annotation effect for CNN-THOP is observed on the MIML dataset, with a complete matching degree reaching 64.8%.


Introduction
With the continued development of network technology and the growing popularity of multimedia devices, network image data are growing at an exponential rate. Taking WeChat (a communication software) as an example, the daily number of uploaded images in WeChat moments now exceeds a hundred million [1]. In this information explosion era, organizing and retrieving unlabeled images has become a research interest in the field of image management [2]. Unlike previous single-label classified images, most images currently contain rich semantic content, where a common image normally contains several keywords or labels [3]. On the one hand, artificial annotation has low efficiency, and it can only complete annotation for a limited number of images. On the other hand, although artificial annotation achieves relatively high annotation accuracy, its outcomes are likely to be affected by subjective elements. Furthermore, the manpower and time costs are high. To solve these problems, experts and scholars have proposed automatic multilabel image annotation, namely, assigning multiple labels to an image that contains rich semantic information to complete image annotation by computer [4,5]. These labels, covering multiple image semantics, can be used to reasonably and effectively manage these image data. Thus, the massive number of images in these networks can be better used and analyzed, and the subjective errors and costs of artificial annotation can be greatly reduced. Based on the aforementioned information, this study focuses on automatic multilabel image annotation based on deep learning.

Related works
Automatic image annotation can be widely applied in the fields of image retrieval and image classification. To date, a number of automatic annotation methods have been proposed, and these methods can be roughly divided into two categories: one is based on traditional machine learning, and the other is based on deep learning. For the traditional machine learning-based category, Zou et al. [6] proposed a multiview multilabel (MVML) learning algorithm, integrating multifeature (view) and ensemble learning simultaneously and utilizing the complementarity among the views and the base learners of ensemble learning to improve the annotation accuracy. The accuracy can be enhanced by integrating multiple classifiers used for prediction. However, the training process of the model may be complex and inefficient. Tan et al. [7] proposed an approach called multilabel classification based on low-rank representation (MLC-LRR). First, a low-rank constrained coefficient matrix is calculated by using low-rank representation in the image feature spaces. Then, a feature-based graph is defined, the global relationship between images is captured, and a semantic graph is constructed. Finally, the multilabel classifier is trained by combining these two graphs. However, this method hinges heavily on the proportion of annotated images and does not consider the semantic differences between the underlying features and high-level features. Hu et al. [8] proposed a MIML-KNN method based on metric learning. The Laplacian matrix label is learned to derive the label correlations by minimizing the label manifold regularizer. They proposed a novel MIML objective function and constructed the MIML-KNN classifier using Hausdorff distances. This method calls for large amounts of calculation for massive training data. If there are wrong data beside values of the classified class, classification errors and poor fault tolerance will be introduced to the training data. Yang et al. [9] proposed automatic image annotation based on multiview deep representation. To process various keywords and select appropriate features in the image, they suggested a multiview stacked autoencoder (MVSAE) framework, which was used to establish a correlation between the underlying visual features and the high-level semantic information to realize automatic image annotation. Tian et al. [10] discussed a Gaussian mixture model (GMM)-based automatic image annotation method. They used the GMM for training model construction and rival penalized expectation maximization (RPEM) for posterior probability assessments. Additionally, they constructed label similarity graphs to avoid polysemy occurring during annotation and then used the rank-two relaxation heuristics algorithm to deeply explore the correlation between candidate labels. Joshua et al. [11] proposed an automatic image annotation method based on a multiclass support vector machine with hybrid kernels. They used the linear binary pattern-discrete wavelet transform (LBP-DWT) technique to extract image features from the horizontal, vertical and diagonal directions. Tang et al. [12] proposed a semisupervised adaptive hypergraph learning method for automatic image annotation, and they used limited annotated data and abundant unlabeled data to improve annotation performance. However, due to the existence of semantic gaps, the annotation effect is still undesirable for images with complex backgrounds. The above traditional machine learning methods usually extract features by manual labor during the image feature extraction process, which unavoidably gives rise to subjective errors, resulting in errors in image information extraction and poor experimental accuracy.
In recent years, deep learning-based methods have become a major focus in research on automatic image annotation [13]. Image features are extracted by convolution operations [14], and the relationship between image features and labels is set up by using a deep neural network training model. In 2006, Hinton [15] first proposed to effectively train features in a training set using a deep neural network. Later, Markatopoulou et al. [16] proposed a deep convolutional neural network (DCNN) architecture in which the trained DCNN was treated as an independent classifier to evaluate the direct output of the whole network and train it as a feature generator. Wang et al. [17] developed a dual model based on the multilabel selection algorithm, integrating a discriminative model with a nearest-neighbor-based model. Laib et al. [18] proposed a potential topic model based on a latent Dirichlet allocation and a convolutional neural network for event classification and image annotation. Based on initial labels extracted from the CNNs and initial labels of possibly user-defined tags, the event categories and final annotations of the images were estimated through a refinement process based on the expectationmaximization (EM) algorithm. Ke et al. [19] developed an end-to-end automatic image annotation model based on a deep convolutional neural network (E2E-DCNN) and multilabel data augmentation. A deep CNN structure was adopted for adaptive feature learning, in which the cross-entropy loss functions were first used to construct an end-to-end annotation structure for training, and Wasserstein generative adversarial networks were used for multilabel data augmentation. The deep neural network model has made headway in the field of image labeling. However, there are still some deficiencies. The deep learning annotation models represent improvements in the models themselves. However, for the semantic content of different images, differences between images are not fully considered. Annotation should be treated indiscriminately whether the method of setting the threshold or the ranking function is used to label the test images. The same threshold, such as 0.5, is set for each label, or the ranking function is used to uniformly assign the first few probability labels. As a result, multiple or fewer labels occur when the label number of the images is unknown. The deep neural network is a model containing multilayer nonlinear operations. It has a powerful representation ability and can learn many complex structures. However, a more complex structure may easily result in overfitting [20].
To solve the problems of multiple or fewer labels and overfitting caused by deep networks, this study proposes a convolutional neural network with threshold optimization (CNN-THOP). First, to speed up the training speed of CNNs and prevent overfitting to some extent, batch normalization (BN) [21] is added before the activation layer of traditional CNNs. The purpose of introducing BN is to realize standardization and linear transformation for data, which enables the activated input values to fall within a domain that is sensitive to inputs. Thus, a small change can cause an increased gradient, thereby preventing the problem of gradient loss during backpropagation. Next, the CNN is constructed to learn image features, and a backpropagation algorithm is used to train the model to obtain parameters. Finally, the model obtained by training is used to predict the test set to obtain a probability matrix. Then, threshold optimization is performed on the probability matrix. Based on the comparisons of label correlations under different thresholds with the corresponding actual label, an optimal threshold for each class of labels is determined. Compared with previous methods using a fixed threshold and fixed label number, the threshold optimization method in this study provides more flexible and precise image annotation.
The novelty of this study is as follows: 1. A BN layer is added into the CNN model. To solve the problem of the gradient loss of the bottom network during backpropagation due to an increase in the number of network layers, this study introduces a BN layer between the convolution layer and the activation layer to standardize the data before entering the next layer of the network. This treatment enables the feature values of each layer to fall within the domain where the activation function is sensitive. Thus, even a small change can cause the loss function to produce a great change.
2. An effective threshold optimization algorithm is proposed. To avoid the drawbacks of a possible empty label set for some images due to a fixed threshold or overlabeling or downlabeling caused by the Top k algorithm, this study proposes a threshold optimization algorithm. Using this method, each class of labels is separately analyzed. The optimal threshold for each class is determined based on the correlations of the predicted labels under different thresholds with the authentic labels. Then, annotation is completed with a label whose prediction probability is no less than the optimal threshold. This addition addresses the problems faced by the fixed threshold method and the Top k algorithm, realizes flexible annotation and improves annotation accuracy.

Convolutional neural network
A convolutional neural network [22] (CNN), a kind of feedforward neural network, is essentially a multilayer perceptron. It was proposed by Hubel and Wiesel [23] in the 1960s in their research on neurons used for local sensitivity and direction selection in the cerebral cortex of cats, and breakthroughs were then made by Cun et al [24] on the MNIST handwritten digital dataset. The main architecture of CNNs includes an input layer, a convolutional layer, a pooling layer, a fully connected layer and a final output layer. The number of network layers is deepened by superposing the convolutional layer and pooling layer. The local connection and weight sharing method adopted by CNNs discards some neurons, thereby reducing the risk of overfitting. However, parameter sharing among different neurons decreases the number of weights, making the network easier to optimize. A CNN automatically extracts image features through the convolutional operation of the convolutional layer, which reduces the error rate of information loss compared with artificial feature extraction in the traditional method and has achieved great success in the vision field of computers. The convolutional layer, the core of a CNN, extracts features from the output of the previous layer in the CNN. During this process, multiple convolution kernels are used for convolutional operations to finally obtain multiple feature maps. The convolutional operation formula is shown in Formula (1) [25]: where y i j represents the feature map output by the jth convolutional kernel at the ith layer; M j represents all the feature maps at the i−1 layer; k i j represents a convolutional kernel at the ith layer; b i j represents the bias corresponding to the y i j features at the ith layer; f() represents the function operation; and � represents the convolution operation. To obtain as many features as possible, multiple convolutional kernels are used during the convolution process, which inevitably causes information redundancy. To reduce the feature dimension, a pooling operation is adopted after convolution. At present, the commonly used pooling operations include maxpooling and mean-pooling. After the pooling process, the dimensions of the feature maps are reduced. As max-pooling can retain the texture information of the image well, the max-pooling method is adopted in this study.
The loss function measures the difference degree between the predicted value and the real value for the output. For a binary classification problem, the sigmoid activation function and cross-entropy loss function are usually adopted at the output layer to calculate the value of the prediction label. The multilabel annotation problem in this study can be transformed into a binary classification problem on each label. Therefore, the output layer also adopts the sigmoid activation function to calculate the predicted value, and each label is independently distributed and free from mutual influence. Therefore, the binary cross-entropy loss function is used as the loss function of the network model in this study, as in Formula (2) [26]: where n is the number of labels, y i is the real value of the ith class labels,ŷ i is the predicted value of the ith class labels, and loss is the loss function of a single sample.

CNN model fusing the threshold optimization
A number of mature CNN networks have been used for feature extraction, including AlexNet [24], VGGNet [27], GoogLeNet [28] and ResNet [29], which have been considered to reach a new height of effectiveness in image classification. Considering that VGGNet has a simpler architecture than GoogLeNet and ResNet but with sufficient width and depth for feature extraction, we use VGG16 as the base for model improvement. The architecture of the CNN adopted in this study is shown in Fig 1. The input layer of the network model consists of images of the same size, all of which are 224×224×3, where 3 represents three channels, i.e., R, G and B. The middle processing layer includes five groups of convolutions, in an imitation of the VGG16 network architecture. After each group of convolutions, a max-pooling is connected. A total of five pooling layers are adopted, and BN is added before each activation function to speed up the convergence. The convolutional kernels used in the five groups of convolutional operations are the classic 3×3 size, and the number of convolutional kernels is 64, 128, 256, 512, and 512. The pooling layer adopts max-pooling, and the pooling windows are all a 2×2 size. A dropout operation is conducted to prevent overfitting, and the probability is set to 0.5. Subsequently, a flattening operation is performed to flatten the data for full connection. In the final output layer, there are two fully connected layers. The first one uses 1024 nodes, and the number of nodes adopted in the last fully connected layer is flexibly designed according to the number of classes designed in the dataset. In the whole network structure, only the activation function of the last output layer adopts the sigmoid activation function, and the other activation functions all adopt a rectified linear unit (ReLU). The optimizer adopts a stochastic gradient descent (SGD). The initial value of the learning rate is 0.005, and the learning rate automatically updates and decreases.

Modifications
Accelerated convergence speed based on BN. The unsteady input distribution of the hidden layer neuron during the training process of the deep neural network prevents the network from learning in a stable way. Meanwhile, with the deepening of the network layer, the distribution of the activated input value gradually deviates, approaching the limited saturated zone of the nonlinear function and causing the gradient to minimize until it disappears during backpropagation. When training a new batch of data, the network is required to relearn the characteristics of this batch, meaning that with the deepening of the network layer number, the gradient gets smaller and the training becomes increasingly difficult, with a slower convergence speed. Therefore, this study integrates the BN [21] to fix the activation distribution of each hidden layer neuron and standardize the output of the neurons on the previous layer (normal distribution). With this step, the activation input value can be pulled back to the linear area from the nonlinear area, thereby increasing the derivative value and enhancing the gradient. Meanwhile, to improve the expression ability of the network, another transformation is carried out and then input to the neuron at the next layer. In this way, we can address the training data increasingly deviating with the deepening of layers and fix the distribution of input data for each layer of the network at the same time so that the activated input values all fall within an area sensitive to the input. Therefore, a small change will be responsible for a major change in the loss function in that the increased gradient can greatly accelerate the convergence speed, thereby addressing the gradient of the underlying network disappearing during backpropagation.
The significance of BN is to standardize the data output from the preceding layer of the network. During the standardization process, the average of each small batch of activation values that are input into the network has to be calculated. The calculation equation is as follows [21]: Based on the small batch average, the variance of the activation values for the small batch is calculated as follows [21]: Based on the obtained average and variance, the data for the small batch are standardized, and the standardization equation is as follows [21]: To improve the representation ability of the network, the standardization outcomes are subjected to linear transformation, and the equation is as follows [21]: where x is the activation value of a hidden layer neuron before transformation, m is the number of instances in the batch processing, μ B is the mean value of m instances in the batch training, s 2 B is the variance of m instances in the batch training,x i is the transformation created by subtracting the mean of m instances in the batch processing from the original activation x corresponding to a neuron before the result is divided by the variance, ε is the error, γ and β are parameters learned during the training stage, and y i is the normalized network response.
The specific BN operations are as follows: Input: Sample x: B = {x 1. . .m } to be entered into the activation function, and parameters to be learned: γ and β Output: Normalized network response y i ; Step 1: Calculate the sample mean value μ B ; Step 2: Calculate the sample variance s 2 B ; Step 3: Standardize the sample data to obtainx i ; Step 4: Continuously iterate and train parameters γ and β, output y, and obtain the new value, i.e., y i , through the linear transformation of γ and β.
In the network test stage, the activation x of a neuron can form a normal distribution, which approximates the linear area of the nonlinear area, thereby augmenting the derivative value and the backpropagation mobility and accelerating the network convergence. Furthermore, to improve the representation ability of the network, another linear transformation is performed, and the BN for the test samples is as follows:

Optimal threshold set by fusing threshold optimization
Previously, a fixed threshold (e.g., 0.5) or Top k (e.g., k = 5) was normally used to determine the label assignment for automatic image annotation. A serious drawback of using a fixed threshold is that it can result in an empty label set for some images when the prediction probabilities of the labels are all lower than the preset threshold. On the other hand, Top k considers the differences of images in content and semantics; that is, it assigns an equivalent number of labels to all images, which may lead to overlabeling or downlabeling. Targeting these fixed threshold and Top k problems, a threshold optimization algorithm is used in this study to set an optimal threshold for each class of labels. The CNN is used to test images in the test set and obtain an array of probabilities. The array element is the prediction probability for each class of label for each image. It is necessary to set an optimal threshold for each class of label to determine whether this label is assigned to the said image. The threshold optimization algorithm designed in this study is described as follows: Input: The test set's label probability array predicted by the model Step 2: Read the i-th column element (i starts from 1) out, which represents the i-th label of all images in the test set and is labeled y_ prob; Step 3: Take the j-th value in the threshold (j starts from 1) and compare with elements in y_prob in turn. If it is greater than or equal to j, it is set to 1; otherwise, it is 0 and recorded as y_ pred; Step 4: Perform the Matthews operation on the forecast label y_pred and real label y_test[:,i] to obtain the operation result a; Step 5: Repeat steps 2 and 3 until j is traversed; Step 6: Calculate the index for the position of the maximum in a; Step 7: Take out the threshold best_threshold [i] corresponding to the index position in the threshold as the threshold of the ith class label; Step 8: Repeat Steps 2-7 until all labels are traversed.
It is worth mentioning that the Matthews correlation coefficient function is used in step 4. Compared with other correlation coefficients, such as Pearson's correlation coefficient, the Matthews correlation coefficient considers the real and false positivity and negativity of the label. In addition, this coefficient is not affected by the imbalance of datasets and therefore serves as a balanced criterion of measurement. It is one of the best correlation assessment methods in the field of image classification.
The basic flow chart of threshold optimization is shown in Fig 2.

Multilabel image annotation framework
The multilabel automatic image annotation framework designed in this study is shown in Fig 3. First, the training dataset is input into the CNN for training to obtain a labeling model. In this study, BN is added before the activation layer of the CNN to accelerate the convergence speed. Second, the trained model is used to predict the test dataset to obtain the prediction probability for the label. Then, the threshold optimization algorithm is employed for the threshold optimization of the prediction probability to obtain the best threshold, which, to a certain extent, solves the problem of overlabeling or downlabeling caused by a fixed number of labels. Finally, labels corresponding to the optimal threshold are used to label images to obtain the final annotation results.

Experimental data
To verify the effectiveness of the CNN-THOP proposed in this study for image annotation, we use free, publicly available datasets: MIML [30] on natural scenes provided by Learning and Mining from Data (LAMDA) of Nanjing University, COREL5K [31] collated by the Corel Company and MSRC [32] from Microsoft Research Cambridge. Details of the datasets are shown in Table 1.

Experimental design
In this study, we conduct an image annotation simulation experiment based on the deep learning library Keras. To objectively evaluate the experimental results, we use a variety of evaluation indexes, including the average precision AP, average recall rate AR and F1. Meanwhile, to explain the annotation effect more accurately, a new evaluation index, the complete matching degree (CMD), is defined in this study as the degree to which the result of the labeled word tested is completely consistent with the real labeled word of the image when testing each picture as another evaluation standard of the experimental result.

Experimental verification of BN operation
To verify that an added BN can accelerate the convergence speed, we use the MIML dataset to train the network architecture with an added BN and the architecture without an added BN. With the increase in the number of iterations, the variation in accuracy is shown in Fig 4. Due to the differences in training duration, we conduct 200 iterations for the network with an added BN and 100 iterations for the network structure without a BN to illustrate the role of BN. The abscissa in Fig 4(A) and 4(B) is Epoch, and the ordinate is Accuracy. As shown in Fig  Fig 2. Flow chart of

PLOS ONE
CNN-THOP-based automatic image annotation 4(A), the accuracy rate for approximately 40 iterations reaches 80% without an added BN, while the accuracy rate for 20 iterations only reaches 75%. As shown in Fig 4(B), with an added BN, it only requires approximately 15 iterations for the accuracy to reach 80% and 5 iterations to reach 75%, indicating that BN can greatly accelerate the convergence speed.

Verification of the optimal threshold
Given the relatively large number of label classes in the COREL5K and MSRC datasets and the numerous thresholds involved, it is inconvenient to display them one by one, so an ellipsis is used here. For the MIML dataset, as only 5 classes of labels are involved, all of them are selected for display. The above three datasets are used to verify the effectiveness of the CNN-THOP proposed in this study in setting the optimal threshold for each class of labels. To find the optimal threshold, experiments are carried out to compare each group of thresholds. Each group of thresholds in Table 2 are the optimal thresholds corresponding to the models with the same iterations. The optimal threshold is detected by loading the optimal threshold corresponding to each model into the optimal model. As shown in Table 2, the newly added evaluation index CMD makes the evaluation of experimental results more firm and rigorous. First, the overall analysis of the experimental results for each group of thresholds reveals that the difference is not large, indicating that the optimal model obtained in the training has a good effect. Therefore, it is generally believed that this CNN model is not problematic for the automatic annotation of multilabel images in this dataset. Second, although the difference between the results obtained for each group of thresholds is not large, considering that the CMD is the highest, thresholds marked with an asterisk in Table 2 are selected as the best threshold for each dataset. Among them, CMD reaches 64.8%, 41.2% and 58.6% in the three datasets in this experiment, indicating the effectiveness of the optimal threshold once again.

Comparison with other CNNs
In this study, we use a modified VGG16 as the image feature extractor. We investigate the influence of different CNN architectures on the experimental outcomes based on three datasets. The results are summarized in Table 3. As shown in Table 3, on datasets with an appropriate size, the experimental results improve as the network architecture deepens. This is because deeper-level architecture can extract higher-level features. On MSRC, however, a deeper-level architecture (ResNet101) does not perform better. This is possibly because the scale of the dataset is limited, whereas the network layers are too deep, which leads to overfitting. Therefore, we use VGG16 for modification to construct the network architecture.

Comparison with other image annotation methods
Since it is inconvenient to fully display the hundreds of classes involved in the three datasets, we randomly extract some classes from these three datasets and conduct experiments to compare the accuracies among the CNN-THOP proposed in this study and the traditional multiple Bernoulli relevance model [33] (MBRM) and spatial spectrum kernel [34] (SSK) combined with the context-based keyword propagation [35] Table 4. As shown in Table 4, compared with the other four algorithms, the CNN-TH method proposed in this study exhibits a noticeably higher annotation precision for most classes in the three datasets. For some classes with obvious features, such as desert and sunset in the MIML dataset, flower, bird and people in the COREL5K dataset, and tree and car in the MSRC dataset, the annotation precision is noticeably higher than other classes. For some classes, such as mountain, sea, beach, dinosaur, and cliff, the annotation precision is low due to the existence of similar features. For example, it is difficult to distinctly recognize two classes, cliff and mountain, and there exists a semantic gap between sea and beach owing to similar features. However, the annotation precision for the dog class in the MSRC dataset is the lowest, reaching only 44.4%, which may be attributed to the small number of image designs in the dog class and insufficient learning of the features for the dog class.
In addition, we randomly select 2,000 pictures in 20 classes from the three datasets, i.e., MIML, COREL5K and MSRC, to constitute a new dataset for an experimental comparison of a single class accuracy, as shown in Fig 5.

PLOS ONE
As shown in Fig 5, the annotation precision based on the deep learning method is superior to that of the traditional algorithm, showing that the features extracted by the CNN are more comprehensive and approximate to people's semantic understanding of images compared to those features extracted artificially. Compared with the E2E-DCNN and CNN-AT, the CNN-THOP proposed in this study improved the annotation precision for images of a single class. First, the CNN-THOP is based on the VGG16 model. By setting parameters and adjusting the network structure, the model structure is more applicable to the dataset in this study. Second, the model, by fusing the threshold optimization, sets an optimal threshold for each class to avoid omissions during the annotation process. Therefore, the annotation precision is significantly improved.
To verify the effectiveness of the CNN-THOP in automatic image labeling, we compare it with traditional methods, including the multiple Bernoulli relevance model (MBRM) and spatial spectrum kernel (SSK) combined with context-based keyword propagation (CBKP) (SSK +CBKP), as well as commonly used deep learning methods in recent years, such as the convolutional neural network and adaptive thresholding (CNN-AT) and the end-to-end automatic image annotation model based on a deep convolutional neural network (E2E-DCNN). The experimental results are shown in Table 5.
As shown in Table 5, the CNN solutions are more suitable for multilabel annotation compared with the traditional machine learning methods, as the CNNs achieve noticeably better effects in terms of the investigated indexes. In the natural scene image MIML dataset, the average precision is improved by 27%, 18%, 7%, 7%, 4% and 2% compared to that of the other six methods. In the COREL5K dataset, the recall is improved by 33%, 25%, 21%, 15%, 3% and 2%.

PLOS ONE
Compared with the MBRM model, the average precision and average recall of the CNN-THOP are improved by 34% and 33%, respectively, in the MSRC dataset, and its average recall is improved by 3%, compared with the E2E-DCNN. Overall, the proposed method in this study achieves the best performance on the MIML dataset. This is because fewer label classes are involved in this dataset compared with other datasets, which reduces the complexity. In contrast, the number of label categories in Corel5k reaches 260, which contains a large number of low-frequency labels. Therefore, the effect of the proposed method on this dataset is not satisfactory.
To further validate the effectiveness of the proposed CNN-THOP, pairwise t-tests are performed to assess the precision of the different methods over the three datasets. The results are shown in Table 6.
As shown in Table 6, the CNN-THOP significantly improves the annotation precision for the three datasets compared with the MBRM, SSK+CBKP and the method used in the literature [38] (p<0.05). Although the annotation precision of the CNN-THOP does not show a significant difference compared with the CNN-AT, CNN-ECC and E2E-DCNN, the precision increases by 11%, 9% and 6%, respectively. These findings indicate that the proposed method in this study is effective for multilabel image annotation.
In addition, we mix MIML, COREL5K and MSRC to form a large dataset containing 7,591 images in 287 classes and conduct an experimental comparison in terms of average precision, average recall, F1 value and CMD, as shown in Fig 6. As shown in Fig 6, the annotation precisions for the CNN-AT, the E2E-DCNN, and the CNN-THOP proposed in this study are much higher than those of the traditional methods, i.e., the MBRM and SSK+CBKP, indicating that at present, the deep learning method in the field of image labeling is superior to the traditional machine learning method. This once again demonstrates that CNNs are superior to artificial methods for feature extraction. Therefore, the annotation precision is significantly higher than that in traditional methods. As shown in Fig 6, of the four evaluation indexes, CMD is the lowest because CMD is a more rigorous evaluation index. CMD indicates that neither excessive labeling nor labeling omissions occur during the labeling process, thereby realizing accurate labeling and verifying the effectiveness of the method in this study once again.   Fig 7, the method proposed in this study is more effective than other methods in automatic image annotation, as most images can be annotated with completely correct labels without the issue of multiple or fewer labels. Compared with other methods, this method solves the problem caused by a fixed number of labels. The problem of individual wrong labels or fewer labels may be caused by the fact that the number of images for the class in question is too small and the model does not fully learn this feature. For the labeling results of the traditional MBRM and the deep learning E2E-DCNN, it is clear that there are numerous problems with overlabeling or downlabeling caused by the fixed number of labels, making it difficult to achieve precise annotation. Overall, the method in this study is more accurate and effective at automatic image annotation.

Conclusions
To address a fixed number of labels appearing during the multilabel image annotation process and label annotation according to the ranking function, we propose in this study the application of a CNN-THOP for image annotation. First, a CNN model is used to predict the probability for each class of labels. Due to the merits of the VGG16 network architecture, we improved the CNN structure in this study based on VGG16. A BN added within the CNN significantly accelerates the convergence speed, and the network structure and parameters are adjusted to make them more suitable for the datasets in this study. Next, a threshold optimization algorithm finds an optimal threshold for each class of labels. Finally, only when the prediction probability of labels of a class in question is greater than or equal to the optimal threshold will this class of labels be assigned. The setting of the optimal threshold solves the problem of overlabeling or downlabeling caused by a fixed number of labels, making labeling more rational and effective. The experimental results for three public datasets-MIML, COR-EL5K and MSRC-indicate that the average precision of the CNN-THOP is 80.7%, 52.7% and 76.1%, respectively, and the average recall is 77.6%, 58.3% and 78.3%, respectively. The F1 value reaches 78.7%, 55.3% and 77.1%, respectively, and the CMD also reaches 64.8%, 41.2% and 58.6%, respectively. Compared with other methods, the parameters in this study are greatly improved, demonstrating that the method proposed in this study can be used for effective image annotation. The deficiency of this study is that the improved VGG16 model is a

PLOS ONE
shallow CNN and cannot fully extract higher-level image features, and its accuracy declines in terms of the prediction probability, resulting in deviations in subsequent threshold optimization. A future study will be carried out to examine two factors. 1) In terms of feature extraction, the artificial feature extraction and convolution operation will be combined to perfect the feature extraction and avoid omissions. We will further deepen the CNN structure in this study and extract higher-level features with reference to the characteristics of VGG16. 2) We will optimize the threshold optimization algorithm to make it more reasonable and efficient to find the optimal threshold for each class of labels.