CM-supplement network model for reducing the memory consumption during multilabel image annotation

With the rapid development of the Internet and the increasing popularity of mobile devices, the availability of digital image resources is increasing exponentially. How to rapidly and effectively retrieve and organize image information has been a hot issue that urgently must be solved. In the field of image retrieval, image auto-annotation remains a basic and challenging task. Targeting the drawbacks of the low accuracy rate and high memory resource consumption of current multilabel annotation methods, this study proposed a CM-supplement network model. This model combines the merits of cavity convolutions, Inception modules and a supplement network. The replacement of common convolutions with cavity convolutions enlarged the receptive field without increasing the number of parameters. The incorporation of Inception modules enables the model to extract image features at different scales with less memory consumption than before. The adoption of the supplement network enables the model to obtain the negative features of images. After 100 training iterations on the PASCAL VOC 2012 dataset, the proposed model achieved an overall annotation accuracy rate of 94.5%, which increased by 10.0 and 1.1 percentage points compared with the traditional convolution neural network (CNN) and double-channel CNN (DCCNN). After stabilization, this model achieved an accuracy of up to 96.4%. Moreover, the number of parameters in the DCCNN was more than 1.5 times that of the CM-supplement network. Without increasing the amount of memory resources consumed, the proposed CM-supplement network can achieve comparable or even better annotation effects than a DCCNN.


Introduction
As multimedia technology rapidly develops and image acquisition devices become increasingly convenient, digital image resources have increased exponentially. How to rapidly retrieve objects of users' interest from a large number of images has become an important research direction in the field of image processing. By automatically annotating the keywords reflecting semantic content, the auto-annotation technique narrows the gap between the low-level visual features of the image and the high-level semantic annotations [1]. It enhances the efficiency and accuracy of image retrieval and therefore gains broad application prospects in the fields of image and video retrieval, scene understanding and human-computer interaction [2]. Nevertheless, image auto-annotation remains a challenging topic due to the existence of "semantic gaps", although it has long been a hot spot of research in the field of computer vision. As an image consists of complex semantic information, a single annotation often fails during the labeling task. Therefore, the adoption of a multilabel annotation approach becomes necessary. The labels in multilabel auto-annotation contain most of the information contained the annotated image. Thus, the image can be rapidly identified. In addition, this approach satisfies the retrieval requirement for images on the Internet with large-scale semantic information [3].
To date, scholars have proposed a number of models for image auto-annotation. These models can be roughly divided into three categories: generative models, nearest models and discriminative models. With generative models, the visual information of the image, such as colors, shapes, textures and spatial relations, is extracted. Then, the joint probability distribution of the visual features and tags or the conditional probability distribution of different tags is calculated, based on which the tags are scored for complete annotation [4]. The multiple Bernoulli relevance model and cross-media relevance model are typical examples of this category. [5,6] Later, Moran et al. [7] proposed a modified sparse kernel learning continuous relevance model (SKL-CRM), which enhanced the performance of image annotation by learning the optimum combination among feature kernels. The fuzzy cross-media relevance model (FCRM) utilizes nonparametric Gaussian kernels to perform continuous estimation of the feature generation probability [8], which further improves the annotation accuracy. Although generative models involve a relatively simple annotation process, the gap between the lowlevel features and high-level semantics of the image, as well as semantic dependence, often leads to inaccurate joint probabilities [9]. In recent years, with the increase in training data, the nearest models have gained increasing popularity. In this category, image annotation is treated as a retrieval task, and the basic idea of the tag propagation mechanism is to find images that resemble the test image and then to annotate the test image with the tags corresponding to the resembling images [10]. The nearest models include the joint equal contribution (JEC) [11], tag propagation metric learning (TagProp_ML) [4] and two-pass K-nearest neighbor (2PKNN) [12] as typical examples. In particular, the 2PKNN Metric Learning (2PKNN_ML) method has been considered an advanced and representative method of nearest models in recent years. With the 2PKNN_ML method, after the semantic neighbor images of the test image are found, the distance weights between features are optimized via metric learning [13]. However, the nearest models suffer from the loss of much valuable information during image visual feature extraction, which is very likely to result in an unsatisfactory annotation effect [14]. Discriminative models treat each tag as a class and consider image annotation as a multiclassification task. Based on multiple classifier training, the models classify the test image into a class to which a certain tag belongs [15]. Typical examples of nearest models include the Knearest neighbor (KNN) [16], K-means clustering [17], support vector machine (SVM) [18] algorithms, as well as the modified versions of these methods. However, all these methods are restricted and influenced by the number and training effect of the classifiers, particularly under the condition of imbalanced training samples, where the training effect of the classifiers corresponding to low-frequency tags is unsatisfactory, which in turn affects the total annotation accuracy; in addition, with the increase in the number of class tags, the number of required classifiers also increases, which makes the annotation model more complex [19].
With the continuous development of deep learning in recent years, convolutional neural networks (CNNs) have been extensively applied in the field of computer vision [20]. In 2012, Hinton et al used a multilayer CNN for image classification of the ImageNet dataset [21], and an impressive recognition effect was achieved [22]. Later, a large number of research projects focused on modifying CNNs in terms of structure and performance. For instance, GoogLeNet, developed by Google, was the champion in the 2014 large-scale image recognition competition [23], and in the ImageNet 1000 Challenge, the deep-level CNN-based computer vision system developed by the Visual Computing Group at Microsoft Research Asia outperformed humans in terms of object recognition and classification for the first time [24]. Furthermore, deep-level CNN-based discriminative models have also made certain achievements in multilabel image auto-annotation. Based on a CNN, Li et al. designed a softmax regression-based multilabel ranking loss function network, which greatly improved the auto-annotation effect compared with traditional methods [25]. Murthy et al. [26] proposed the linear regression-based CNNregression (CNN-R) method, which optimized the model parameters via backpropagation (BP). Although these methods achieved improvement in image auto-annotation compared with traditional methods, most of them targeted improving the network itself or on singlelabel learning, seldom focusing on the application and improvement in multilabel learningbased image auto-annotation. Targeting the issue of low annotation accuracy of low-frequency tags caused by training sample imbalance in image semantic annotation, Cao et al. [27] designed a double-channel convolutional neural network (DCCNN) in 2019. However, due to the requirement of the simultaneous operation of the two channels during testing, great memory usage occurs.
Based on the aforementioned methods, the current study proposes a CM-supplement CNN, in which the characteristics of multilabel image annotation and the influence of training sample imbalance on image annotation were both taken into consideration. The proposed CNN possesses the following advantages: 1) The cavity convolution contained in the model expands the receptive field for feature extraction but without increasing the calculation burden and memory overhead; 2) Inception modules are incorporated into the model, which not only deepens the network but also increases the network width. These changes help extract features at different scales, strengthen the network robustness, and increase the network calculation speed along with a reduced memory overhead. 3) the supplement CNN-based structure can effectively solve the problem of the influence of the discarded negative information on the classification effect in traditional CNNs during image classification.

Supplement CNN
The supplement CNN is a modified version of the traditional CNN [28]; its structure is shown in Fig 1. As shown in Fig 1, the modifications of the current model mainly focus on the convolutional layer. The positive features extracted by the convolutional layer are reversed to obtain the negative features of the image, and then the positive and negative features are fused as input for the next layer. Compared with traditional CNNs, the current model maintains the negative information, which endows it with the ability to learn more features, thereby improving the image annotation and recognition accuracy.
Apart from the abovementioned difference, the supplement network also differs from the traditional CNN in the selection of the activation function. Normally, traditional CNN uses the ReLU function as the activation function. However, the ReLU function cannot maintain the negative information of the image, which can be observed from its expression. Instead, the supplement network uses ELU [29] as the activation function.
The ELU activation function (an exponential linear unit) is a correction function of ReLU. It keeps the negative information of the image by adding a nonzero output to a negative input, and its expression is as follows: ( where x represents the feature map of the image. According to Eq (1), the ELU function contains a negative exponential term, which can prevent the appearance of silent neurons, thereby enhancing the overall learning efficiency of the network.

CM-supplement CNN
Network structure design. The structure of the proposed CM-supplement network in this study is shown in Fig 2. The proposed model is a seven-layer structure with four convolutional layers and three fully connected layers. The detailed information is as follows.
In layers 1 and 2, cavity convolutions with an expansion coefficient of 2 are used as the convolution kernels. This treatment can expand the receptive field, without increasing the network memory overhead and calculation burden, to extract more image features. For example, when the expansion coefficient of a convolutional kernel with a size of 3 � 3 is set to 2, its receptive field will change from 3 � 3 to 7 � 7, but without increasing the number of the parameters. Furthermore, pooling layers do not need to be set for a network containing cavity convolutions because the function of pooling layers is just to expand the receptive field for feature extraction. In addition, some feature information can be lost during receptive field expansion by pooling layers, while this drawback can be well remedied by cavity convolutions.
In addition, in both layer 1 and layer 2, a supplement model structure is used to obtain the negative features of the image for their transfer into the next layer. The implementation of the supplement model is simple and does not need extra parameters to be added (and only involves the reversal operation of the features extracted by the cavity convolution followed by the fusion between the reversed features and the original ones). Therefore, the supplement model was used as a component of the CM-supplement model in this study.
In layer 3, a common CNN is used for the transition from the fusion structure of the cavity convolution with the supplement structure to the Inception module. Furthermore, its use can reduce the calculation burden by reducing the dimensions before the features enter the Inception module.
The fourth layer contains Inception modules. The reason for the incorporation of the Inception module is that it possesses the capability to extract image features at different scales. Furthermore, the number of parameters and computational load of the module are fewer than those of commonly used CNN convolutions. In this study, the dataset for training was PASCAL VOC 2012. Considering that the amount of training data was not large, the Inception v1 module was utilized for its simple structure with a small number of parameters.
Fully connected layers constitute layers 5-7. According to reference [27], a dropout layer was added after layers 5 and 6 to prevent overfitting.
Principles of the modified algorithm. Image annotation consists of two key steps: image feature extraction and the establishment of the mapping relation between the features and tags. In essence, the CM-supplement network designed in this study is subject to modifications within the scope of feature extraction: The cavity convolution, Inception v1 structure and supplement structure are combined for the extraction of finer multiscale features to enhance the accuracy of image annotation. These modifications are embodied in the forward propagation process of the image input model during network training as well as in the network parameter adjustment process according to the difference between the network-predicted outcome and the actual image tag, i.e., the backpropagation process.
(1) Forward propagation process The forward propagation of the CNN model is primarily accomplished by the operation of the convolutional layer and pooling layer. Suppose that the first layer is a convolutional layer; its calculation formula is as follows: where M j represents the input image dataset, x l j represents the jth feature map of the ith layer, � denotes the convolution operation, x lÀ 1 i is the ith feature map of the (l-1)st layer, k l ij represents the convolution kernel that connects the jth feature of the lth layer with the ith feature of the (l-1)st layer, b l j represents the bias, and f(�)represents the nonlinear activation function of the neuron.
Compared with the forward propagation process of the traditional CNN structure, the adoption of the supplement structure enables the CM-supplement network to fuse the positive and negative features of the image, and the calculation formula is as follows: where M j is the set of the feature maps output by the preceding layer, x l j is the jth feature map of the lth layer calculated based on the (l-1)st layer, � denotes the convolution operation, k l ij is the convolutional kernel corresponded by the jth feature map of the lth layer based on the convolution calculation of the ith feature map of the (l-1)st layer, b l j represents neural bias, and f(�) is the activation function that maps the data onto a certain range after the convolution operation. In Eq (3), the plus symbol does not mean a simple mathematical operation but a connection (concatenation) operation.
(2) Backpropagation process The errors in CNN backpropagation primarily consist of output layer errors and hidden layer errors. Suppose that the target function of CNN is a variance cost function as follows: where w represents the weight, b represents the bias, x is the input feature of the image, y represents the output value and h w,b (x) represents the actual value. The specific error adjustment process is described as follows: a. Calculation formula for the output errors: where d l j is the error produced by the jth neuron of the lth layer and z l j represents the input of the jth neuron of the lth layer, which is associated with the weight and bias, respectively. b. Calculation formula for error propagation from the pooling layer to the convolutional layer: where b l j is the feature of the jth neuron of the lth layer, d l j is the error of the jth neuron of the lth layer, f'(�) is the derivative of the activation function, unsample(�)denotes the upsampling operation, and � represents the multiplication operation.
c. Calculation formula for error updating from the convolutional layer to the pooling layer: where f'(�) is the derivative of the activation function, rot180(�) means 180-degree rotation of the convolution kernel, and δ l+1 represents the error in the next layer.
Then, the updated equations of W (weight) and b (bias) of the network are as follows: where W l new is the updated weight, W l old is the weight before updating, α is the learning rate of the network, b l new is the updated bias, and b l old is the bias before updating. As the proposed CM-supplement network extracts more feature information during the forward propagation process than during the backpropagation process, predicted outcomes close to actual values can be obtained, and the error between the predicted value and the actual value can then be reduced, which in turn influences the backpropagation process and accelerates the convergence rate of the network.

Multilabel image annotation
The framework of automatic multilabel annotation proposed in this study is shown in Fig 3. First, the positive and negative features of all the training sets are extracted with the proposed CM-supplement network. Second, the extracted features are concatenated to establish the semantic mapping between the image features and the tags, and the mapping is used as the input to construct the annotation model. Then, the nonannotated images are input into the trained network for image tag predictions.
The proposed framework modifies the convolutional layer of the CNN model: The feature map obtained by the convolutional layer is negated, and the negative feature map combined with the original feature map is introduced into the ELU activation function and then transferred to the next layer. Such treatment enables the framework to obtain more feature information of the image than other methods and reduces the total error, which strengthens the learning of valuable feature information and makes the acquisition of the semantic information implied in the image easier, thereby benefiting automatic image annotation.
The algorithm is described as follows: Step 1: Count the number of samples corresponding to each tagging word in the training set and determine the tag set; Step 2: Employ the convolutional layer and pooling layer for forward propagation and combine the cavity convolution operation, the Inception v1 structure and the supplement structure to extract the positive and negative features; Step 3: Concatenate the positive and negative features of the image; Step 4: Realize backpropagation by updating the errors of the output layer and hidden layer and continuously adjust the network parameters according to the differences between the network-predicted results and the actual tags of the image; Step 5: Repeat Steps 2-4, and train the network model until it is stable; Step 6: Input the image to be tested into the trained network for annotation to obtain the annotation outcomes.

Experimental environment and design
The experimental PC environment consisted of a Windows 10 system, i7-8750 processors, 8 GB of memory, and an NVIDIA GTX1060 GPU. The Python programming language combined with the TensorFlow deep learning framework from Google was adopted to realize the idea behind the algorithm proposed in this study. The proposed CNN model was established.
For comparison, the parameters of the CM-supplement model were basically consistent with those of the CNN in reference [27]. In layer 1, the size of the convolution kernels (n = 20) was set to 10 � 10 with a step length of 4. In layer 2, the step length of the convolution kernels was set to 2 with a size of 5 � 5 (n = 40). In layer 3, 60 common convolution kernels were used with sizes and step lengths and 5 � 5 and 1, respectively. Following the third convolutional layer was a pooling layer with a size of 7 � 7 and a step length of 3. The Inception v1 structure followed the pooling layer. The last three layers were fully connected layers. A dropout layer was added following the first two fully connected layers to prevent overfitting, whose parameter set used the k-fold cross-validation method. After repeated experiments, k was determined to be 10 in this study. The learning rate of the CM-supplement model was set to 0.00001, which was two orders of magnitude lower than that of the compared CNN model. The reason is that when the learning rate of the CM-supplement network is set to 0.001 during training, the NaN error will occur in the corresponding loss function, which means that the obtained loss value would go beyond the range the computer can display, thereby becoming the non-numeric type. By reducing the learning rate, the loss function values can return to normal.
The mean average precision (MAP) was used as the assessment index for multilabel image annotation [27], and its value is the average of the average precision values.
where r represents a group of set thresholds, AP is the average accuracy rate, and Pinterp(r) represents the maximal precision value corresponding to each threshold: wherer denotes a certain threshold and pðrÞ is the accuracy rate corresponding to eachr.
where q represents a certain category and Q is the number of categories.
The adoption of the MAP as an assessment index was based on the consideration that it can overcome the limitation of the isolated accuracy rate, recall rate and F1-score to obtain an index that reflects the overall performance.

Experimental data source
To validate the effectiveness of the proposed CM-supplement network model, the free, open access PASCAL VOC dataset [30] (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index. html) was used for the experiment. PASCAL VOC 2012 contains a total of 20 categories and 22,531 images, with approximately 1,000 images in each category.

Results analysis
In this study, comparisons were made in terms of the annotation accuracy rate and memory consumption.
Annotation accuracy. ( [27], [29] and [32], noticeably improved the accuracy rate, as the extracted image features by these methods are more abstract and more comprehensive than those features extracted by the other methods, which are closer to the high-level semantics of the images understood by human beings. The proposed CM-supplement network achieved an even higher accuracy rate because it can extract more feature information during forward propagation than these methods; therefore, the predicted results are closer to the actual values. (2) Comparison of the changes in the MAP according to the number of iterations Because the methods used in references [27] and [33] are based on artificial feature extraction, when they encounter a large number of samples, the training time will be long. Therefore, the changes in the MAP according to number of iterations were compared only between the proposed CM-supplement network and the other convolution neural network methods, and the results are shown in Fig 4. As shown in Fig 4, the CM-supplement network converged much faster than the other four methods, especially the methods of CNN and literature [29]. Even at the beginning of the training, the CM-supplement model obtained a high accuracy rate, which showed a relatively stable tendency. Furthermore, after training and stabilization, the final MAP of the CM-supplement network remained higher than that of any other model.
(3) Comparisons of the annotation accuracy rates after stabilization training To further validate the performance of the proposed model, the poststabilization annotation accuracy rates of the seven models based on convolutional neural networks for the 20 categories in the PASCAL VOC 2012 dataset were compared (Table 2 and Fig 5).
As shown in Table 2, the accuracy rates of the methods from references [31] and [33] were noticeably lower than those involving convolution neural network-based methods. For the labels with sufficient training and a high original annotation accuracy rate, such as plane, cat, cow and train, the annotation accuracy rates of the five networks were comparable. However, for those labels corresponding to a relatively small number of training images, the CM-    Fig 5, the annotation rates of the CM-supplement network for the 20 annotated words were much higher than those of the methods from reference [31] and [33], were superior to those of the methods from reference [29] and [32] and were comparable to those of the DCCNN. Overall, the average accuracy rate of the former was only 1.7 percentage points higher than that of the latter. However, the proposed model had a smaller number of parameters and consumed fewer memory resources than the DCCNN.
Comparisons of the amount of memory consumed. Although the CM-supplement and DCCNN models were comparable in terms of the annotation effect, they differed greatly in the number of network parameters (Tables 3 and 4).
As shown in Tables 3 and 4, the total number of parameters in the DCCNN was more than 1.5 times that of the CM-supplement network. This result fully indicates the effectiveness of the proposed model in image auto-annotation; it achieved a comparable or even better annotation effect compared with the DCCNN but without increasing the amount of memory resources consumed.

Comparison with the actual annotation effects
To validate the annotation effect of the method proposed in this study, the automatic annotation outcomes of the proposed CM-supplement network for high-frequency and low-frequency words were compared with those based on the methods from reference [27] and [32], as the latter two methods achieved a relatively high annotation accuracy rate. The results are summarized in Tables 5 and 6. Table 5 shows that the annotation outcomes of the three methods were similar in terms of high-frequency words. However, as shown in Table 6, compared with the other two methods, the CM-supplement network exhibited better image descriptions and more complete annotations than the other methods. Moreover, for low-frequency words, such as chair, sofa, tv monitor and dining table, the proposed method had a higher recognition rate. Therefore, although the DCCNN and the method from reference [32] also achieved a more accurate and comprehensive description of the images, the proposed method provided the most complete description of the images.  [10,10,3,20] Convolution kernel [12,12,3,20] ((10 � 10 � 3+1)+(12 � 12 � 3+1)) � 20 = 14680 Convolution layer 2 Convolution kernel [5,5,20,40] Convolution kernel [5, 5, 20,

Conclusion
In this study, a CM-supplement network was proposed for multilabel image auto-annotation based on the characteristics of multilabel learning as well as the consideration of the memory resource consumption, and the typical, frequently used multilabel image dataset PASCAL VOC 2012 was selected for result validation. The comparisons between the proposed CM-supplement network and double-channel CNN proved the improved overall annotation efficiency and reduced memory resource consumption of the model proposed in this study. Based on the outcomes of this study, future research may be carried out in the following three aspects: (1) Larger-scale datasets can be used. As CNN training requires a large amount of data, larger datasets can leverage the advantages of neural networks. They aid in obtaining better parameters and avoiding overfitting, which can further improve the stability of the solution; (2) Due attention can be given to word-word symbiotic relations and the image-image distance to further enhance the annotation accuracy rate; (3) A semi or even unsupervised strategy can be adopted. As the resources of labeled images are limited in quantity, artificial annotation requires a large amount of labor and materials resources. The introduction of a semi-or unsupervised strategy can realize satisfactory annotation outcomes by resorting to a small portion of labeled images.