Based on improved deep convolutional neural network model pneumonia image classification

Pneumonia remains the leading infectious cause of death in children under the age of five, killing about 700,000 children each year and affecting 7% of the world’s population. X-ray images of lung become the key to the diagnosis of this disease, skilled doctors in the diagnosis of a certain degree of subjectivity, if the use of computer-aided medical diagnosis to automatically detect lung abnormalities, will improve the accuracy of diagnosis. This research aims to introduce a deep learning technology based on the combination of Xception neural network and long-term short-term memory (LSTM), which can realize automatic diagnosis of patients with pneumonia in X-ray images. First, the model uses the Xception network to extract the deep features of the data, passes the extracted features to the LSTM, and then the LSTM detects the extracted features, and finally selects the most needed features. Secondly, in the training set samples, the traditional cross-entropy loss cannot more balance the mismatch between categories. Therefore, this research combines Pearson’s feature selection ideas, fusion of the correlation between the two loss functions, and optimizes the problem. The experimental results show that the accuracy rate of this paper is 96%, the receiver operator characteristic curve accuracy rate is 99%, the precision rate is 98%, the recall rate is 91%, and the F1 score accuracy rate is 94%. Compared with the existing technical methods, the research has achieved expected results on the currently available datasets. And assist doctors to provide higher reliability in the classification task of childhood pneumonia.


Introduction
Pneumonia refers to the inflammation of the bronchioles, bronchi, alveoli, and interstitial lungs in the lungs. It is mostly caused by pathogenic microbial infections, stimulation of physical and chemical factors, immune function damage, as well as allergies and drug factors. Among them, bacterial and viral pneumonia are the most common pneumonia, which poses a great threat to the health of children and the elderly. In recent years, despite the use of powerful antibiotics and effective vaccines, the overall mortality rate of pneumonia has increased. networks). Now, a neural network with a deep network structure is the earliest network model in deep learning. For radiologists, X-ray image analysis and inspection are an indispensable part of their daily work. The work is cumbersome and critical, and it also consumes a lot of time and energy. Therefore, researchers have proposed several computer algorithms to analyze X-ray images by combining with computers [12]. Machine learning (ML) technology is used in various fields of medicine, and has excellent results in the detection of pneumonia, breast cancer, heart disease, cataract and other diseases. The combination of ML technology and medical treatment can provide convenience to more people with minimal cost, especially for countries such as India, South Asia and Sub-Saharan Africa.
In this research, we proposed a method that uses Transfer learning [13,14] technology as the initial data, using Xception algorithm [15] and long short-term memory algorithm (LSTM) [16] to merge, from X-ray images of children's lungs can distinguish whether they are infected with pneumonia. The results of the model are compared and analyzed comprehensively with other studies in the literature. The Xception-LSTM model used in this experiment can better assist doctors in the diagnosis and treatment of patients with pneumonia, and can detect children in a timely and effective manner. Can better get timely treatment. Even if this model cannot completely replace manual classification, it can still reduce the burden of doctors' detection, improve safety, reliability, and more accurately classify pneumonia.

Related work
The in-depth research of CNN algorithm has been widely used in various fields, such as industry, medicine, edge computing, flow control, etc. Some studies using convolutional neural networks have also achieved early results in medicine. Use convolutional neural networks in the detection of diabetic retinopathy [17]. Nowadays, many researchers also regard lung sounds as an important measure to diagnose lung diseases. A single entropy nullification classification is less accurate for lung sound, compared with seven entropies combined can be as high as 94.95% [18]. Severe smokers can accept low-dose chest examination. CT can automatically detect coronary artery, thoracic aorta and heart valve calcification. The cardiovascular risk assessment can be reliably achieved [19]. It combines supervised learning back propagation neural network and unsupervised learning competitive neural network to diagnose pneumonia diseases [20]. Lakhana and Sundaram scholars use neural network and AlexNet technology to partition and enhance the data, without any pre-training to obtain 0.94-0.95 under the curve [21]. In medical imaging, the research established a special diagnostic tool based on a deep learning framework, which can better screen patients with common treatable blind retinal diseases [22]. In the literature [23], Santosh and Antani once proposed the use of front chest radiographs to screen AIDS-positive patients for tuberculosis. When diagnosing pneumonia, chest X-ray is also the most indispensable step. In the literature [24], using the VGG16 model in the case of unclear images to diagnose pneumonia, an accuracy rate of 90.54%, 98.71% recall rate and 87.69% accuracy (comparison). Husanbir Singh Pannu et al. [25], once established a reliable neuro-fuzzy inference system (ANFIS) and particle swarm optimization (PSO) to improve the classification rate. Shawni Dutta et al. [26] used a combined model composed of long-term short-term memory (LSTM) and gated recursive unit (GRU), and incorporated memory cells into neural networks for training and testing. Literature [27] once proposed the use of hybrid neuro-heuristic approach and image descriptors of the spatial distribution of chromaticity, saturation, and brightness values to accurately detect diseased and degraded tissues in lung X-ray images. The moth-Flame algorithm is more effective for pneumonia detection. In the literature [28], the fitness function and search model of Bio-Inspired Methods are used as decision support for the detection of lung lesions, and can better guide doctors to find areas that may have lesions. In [29], a fuzzy logic segmentation method and on a probabilistic neural network (PNN) are applied to form an evaluation model for lung cancer. The fuzzy set is used as a segmentation tool to send the results to the neural network for evaluation. As a variant structure of recurrent neural net-work (RNN), LSTM can better mine sequence time information As a variant structure of recurrent neural net-work (RNN), LSTM can better mine sequence time information [30][31][32] and realize image classification. and realize image classification.
In general, deep learning methods have made considerable achievements in medical images. This research also further innovated and improved on this basis. The third part focuses on data preprocessing to enhance the readability and clarity of data. The fourth part is to combine the new model based on deep learning and transfer learning based on LSTM technology to optimize the accuracy of the model and improve the optimization ability. In order to strengthen the relevance between categories, the idea of feature selection is appropriately improved. The ultimate goal of this experiment, as well as the main motivation for the development of this work, is to use artificial intelligence to better automate the classification of diseases in the medical field.

Dataset
The dataset mainly used in this study was published by Kermuny et al. in 2018 [33]. This dataset has 5863 x-ray images (JPEG) and two categories (normal images/images with pneumonia), which are derived from children aged 1 to 5 years old from the Guangzhou Women and Children's Medical Center. All chest x-ray imaging [33] is part of the routine clinical care of the patient. In order to improve the analysis and evaluation of the accuracy of chest radiographs, a total of 5856 chest X-ray images were collected and marked by screening the quality. Taking into account comprehensively, due to external factors, such as the scanning location, personal habits, and the patient's experience of other diseases, the image is unclear. In the research process, we will enhance the image to improve readability.

Data image preprocessing
In medical images, we usually hope that the brightness of the image is evenly distributed. However, the brightness changes are random, which reduces the image quality and produces visual noise, which is also easy to cause troubles for the doctor during diagnosis [34,35]. Therefore, before the experiment, we performed the image denoising processing. Since some areas in the lung X-ray image look very similar, in order to reduce the error of diagnosis, the non-local mean denoising algorithm was selected in the experiment. Take a small window around it, scan the image to obtain similar windows, average all the windows, and then calculate the result to replace the pixel [36].
We store the noise-reduced image in a new folder. In order to ensure the quality of the image, we artificially enhance the image through computer means. Data enhancement has basically become the simplest and direct way to improve model performance. Taking appropriate measures to increase the image quality will also help to solve the over-fitting problem, and better improve the generalization ability of the image in the training process, making the application range more extensive and more accurate [37]. For example, Table 1.
In the training set, we scale and increase the transformation of the image. The study set a zoom factor of 1/255. The purpose of zooming is to enlarge or reduce the length and width of the image. The overall size of the changed image is not to cut the image. The rotation range represents the angle at which the image is randomly rotated (rotated around the center of the image) during training, which is 7 degrees. The height displacement represents a 0.2% movement of the image in the vertical direction. The width displacement is the same as the height displacement, except that the angle changes from the vertical direction to a 0.2% change in the horizontal direction. In the above process of enhancing the image, the size will be randomly enlarged or reduced by 0.45 times to enhance the image, and finally the image needs to be flipped in the horizontal direction again [37].

Transfer learning
Transfer learning, as the name suggests, is to transfer the parameters of the trained model (pre-training model) to the new model to help the new model training. In the medical field, we often need a large number of images for analysis and diagnosis, training a deep CNN [38] model, such as ResNet, Xception, or Vgg, but also a large number of datasets, because they contain millions of trainable Parameters, a small dataset will not be enough to get a good generalization of the model. First, learn about the application of the model by learning the Xception model, and at the same time combine other model structures, such as wavelet transform, residual network, Long-term and short-term memory artificial neural network (LSTM), etc. This article combines Xception and LSTM models to perform with other CNN structures Compared. Depthwise Separable Convolution was first proposed by Dr. Laurent Sifre in literature [39].
The Google team proposed in 2014 that GoogLeNet has improved the traditional convolutional layer. The so-called Inception structure is that different sizes of convolution kernels can be used in the same layer. This innovation has made CNN a small breakthrough. Subsequently, improved versions of Inception V2 [40], Inception V3 [41], and Inception V4 [42] have been proposed. First of all, in Inception V2, the traditional structure often uses the form of a 5 × 5 large convolution kernel. However, in the V2 structure, two 3 × 3 convolutions are used to increase the depth of the network and to a certain extent. Network effect, the establishment of more complex nonlinear transformation, on the other hand, in order to reduce the difficulty of neural network training, choose to use Batch Normalization to achieve, for example, Eq (1).
f(x) and β are expressed as parameters;x is the normalization term of a batch. Inception V3 has more types in the Inception Module, combined with the use of branching technology [43], and then split a larger two-dimensional convolution into two smaller onedimensional convolutions, and at the same time, the standard Inception Simplify, as shown in Fig 2, choose to use only one form of convolution (for example: 3 × 3 convolution kernel), remove the average pooling structure, reduce a large number of parameters to accelerate calculations and reduce over-fitting; at the same time add a layer Non-linear transformation expands the expressive ability of the model. Xception is also an improvement based on Inception V3, using deep separable convolution. An "extreme" version of the initial module, shown in Fig 3, first uses 1 × 1 convolution to draw cross-channel correlation, and then independently draws the spatial correlation of each output channel [44]. The Xception network in this experiment mainly uses separable convolution, which replaces the convolution operation in Inception V3. At the same time, it can further improve the effect of the model without increasing the complexity of the network model, as shown in Table 2.

Xception+LSTM
Convolutional neural network (CNN) in image recognition is a multi-layer neural network composed of convolutional layers for feature extraction and sub-sampling layers for feature processing, and has an indispensable position in the field of computer vision [45]. This study adopts the Xception backbone model based on CNN, which can effectively obtain the data of pneumonia image characteristics. But they lack certain performance in the learning dependence of cross-sequence information, and cannot handle sequential data. However, LSTM is a special recurrent neural network that can learn the long-term dependence of data while retaining information in the form of sequences. Therefore, in order to better extract pneumonia image features in the experiment, we propose a model combining Xception network and long short-term memory artificial neural network (LSTM) in this article, as shown in Fig 4. Then through the training image dataset, the network's batch_size, image quality enhancement and other parameters are fine-tuned to improve the model accuracy and generalization ability. Finally, the neural network attention mechanism is used to merge and analyze the features generated by the two models of Xception and LSTM, and the features are normalized and serially fused into fusion features through the fully connected layer FC [46].
The Xception model feature extraction cross-channel correlation and spatial correlation mapping can be more integrated. At the same time, the architecture has 36 convolutional layers, which can be divided into 14 modules, and the overall will be divided into input blocks (Entry Flow), Intermediate block, output block. Except for the first and last modules, all of these modules have linear residual connections, which are completely based on the convolutional neural network architecture of the deeply separable convolutional layer [15], which forms the basis of the feature extraction of the network. In the first part of the Xception network, an input image of 224 × 224 × 3 is obtained and sent to the input block. After an intermediate block of eight iterations, it is finally output by the input block. Output four-  dimensional pneumonia feature data, and convert the data into three-dimensional data that can be read by the LSTM layer and send it to the second half of the model architecture. There are 200 neural units in an LSTM layer that go through three stages, namely forgetting Stage, select memory stage, output stage. The output data is fully connected through four layers of dense, and the number of neurons in the first three layers decreases at a rate of 2 times decreasing layer by layer. At the same time, with the RELU activation function, the previous local features are reassembled into a complete graph through the weight matrix. In the last layer, predictive classification is performed by selecting the sigmoid activation function.
In the LSTM layer, the unimportant input features are selected and forgotten. Among them, the passed p m changes very slowly. Usually the output p m is p m−1 passed from the previous state plus some values, as shown in Eq (2). Among them, p m , n m , and g m are multiplied by the splicing vector by the weight matrix W, and then converted into a value between 0 and 1 through a "sigmoid" activation function, which is used as a gated state. As in Eq (3), c is to convert the result into a value between -1 and 1 through a "tanh" activation function. Therefore, use the gating state, such as Eq (4), to control the transmission state, remember the long-term memory, and forget the unimportant information.

Optimizing the loss function
In this experiment, in order to better improve the classification results, there are different degrees of optimization and improvement in the optimizer and loss function. As we can see from Table 3, due to the poor ability of other optimizers to extract features under the same dataset, the RMSprop (root mean square) optimizer is selected in this experiment. The purpose is to make the data better approximate or reach the optimal value, and speed up The gradient descent algorithm adjusts the step size when updating the parameters. The step will be smaller in the direction of drastic change, and the step will be larger in the direction of gentle change.
For gradient descent, RMSprop moves in the direction of the horizontal axis, and there will be large fluctuations in the vertical axis. RMSprop can automatically adjust the parameters through the following Eqs (5) and (6), so that the gradient difference between variables can be effectively alleviated.
T db ¼ aT db þ ð1 À aÞðdbÞ v represents the parameter on the horizontal axis, and its rate of change is very small, so the value of dv is very small, so T dv is also small, and b fluctuates greatly on the vertical axis, so the slope is particularly large in the b direction. Therefore, among these differentials, db is larger and dv is smaller. In order to balance the range of changes between T dv and T db , the parameter values V and b are updated. The divisor of V is a small number. Generally speaking, the change of V is large, and the change of V can be expressed by Eq (7). The divisor of b is a larger number, so the update of b in Eq (8) will be slowed down, and the vertical change will be relatively smooth. The two hyperparameters γ are used to slow down the swing when the parameters are falling, and allow you to use a larger learning rate to speed up the algorithm. In the logistic regression binary classification problem, we usually use sigmoid to map the output to the [0, 1] interval to classify with 0.5 as the limit, and the sigmoid function is generated by the principle of maximum entropy. In this experiment, when binary_crossentropy is used, the error can be compared. When it is large, the weight update speed is increased; when it is small, the weight update is slow, but the traditional cross-entropy also has certain problems, although there are significant features in terms of features. Advantages, but the similarity between the relevant classes within the vector is still relatively lacking. We can use the chisquare test to compare two or more samples (composition ratio) and the correlation analysis of two categorical variables. At the same time, it is possible to calculate the degree of agreement between actual and predicted frequencies or the problem of goodness of fit. We can assume that the output probability of the data predicted by the model iŝ p ¼ ½p 1 ;p 2 ; . . . ;p t �, and the probability of the true label value is p = [p 1 , p 2 , . . ., p t ], and we use t to represent the total amount of data. From this, this research can propose a method based on binary_crossentropy, which incorporates the feature selection method proposed by Pearson (as shown in Eq (9)), which can be expressed as Eq (10): x 2 represents the degree of difference between the actual value and the predicted value, p represents the current true label value,p represents the current predicted label value, and L represents the loss function of this experiment.

Number of balanced categories
By dividing the dataset, it is found that the number of patients with pneumonia is significantly higher than the normal number of patients, and the sample is not balanced, resulting in the sample is not an unbiased estimate of the overall sample, which may lead to a decline in the predictive ability of our model. If the minority class is given a very high class weight, the algorithm is likely to be biased towards the minority class and will increase the error in the majority class. In this case, we can try to solve this problem by adjusting the sample weight. In order to make an accurate judgment on the imbalance between the two, first of all, through Fig 5, it is obvious that the category imbalance in the training set can be seen visually. For further verification, it can be seen from [3] that the use of computational training set Whether the imbalance ratio IR is greater than 1.5, to judge whether the training set is balanced.
IR ¼ j Max class j j Min class j ð11Þ It can be seen from the requirements that the IR does not meet the range of < 1.5, and the data between the two categories in the training set is not balanced. The imbalance of the amount of data will cause the classification performance to change. Therefore, in order to balance the category ratio more reasonably, the weight value is selected to balance and the hyperparameter category weight is modified. By default, the value of class_weights is "None", that is, the weights of the two classes are equal. When the classification situation of our training set is somewhat uneven, we can change None to "balanced", which can better optimize the weight ratio of the two. The formula for calculating the weight under "balanced" can be expressed as: • W m : Represents the final calculated weight value of each class.
• i_sum: indicates the size of all samples in this type of dataset.
• i_class:indicates the number of categories in the total sample.
• i_Sum m : indicates the number of samples corresponding to the m category.
The weight values of normal persons and pneumonia patients in the training set under the condition of "balanced" are calculated by Eq (13), respectively, W 1 = 1.94(N), W 2 = 0.68(P), when the class weight = "balanced", the model will Automatically assign class weights that are inversely proportional to their respective frequencies. Purposely increase the power of the minority class and reduce the power of the majority class, thereby optimizing the imbalance between categories.

Ingredient-sensitive learning
Classification is an indispensable part of the machine learning process. However, different sample differences can easily affect the classification results. In this experiment, in order to better balance the dataset, we first processed and enhanced the dataset (through rotation, translation, zooming, etc.) so as to reduce the data loss rate to a certain extent. Secondly, when processing the data in this experiment, the samples are balanced by modifying the hyperparameter category weight "balanced". Through the adjustment of the hyper-parameters, the greater the number of categories, the smaller the penalty item, and the more input samples of a certain category. The smaller the penalty term of the first category is, the learning bias problem caused by the imbalance of the input samples can be well balanced. Finally, on the basis of binary_crossentropy loss, the basic idea is to check the degree of fit, and to further integrate the correlation between the two loss functions. Improve the correlation between actual samples and predicted samples, and enhance learning ability.

Confusion matrix, ROC
When we are conducting experiments, the more accurate the results of the experiments, the higher the quality of the experiments. We also know the four basic indicators from the above. The confusion matrix is the number of observations that are classified into the wrong class and classified into the right class by the statistical classification model. It can be seen from In this section, we will analyze the confusion matrix and ROC curve in this experiment based on the above-mentioned classification model. ROC is equivalent to a random classifier that visualizes the true class rate and the negative positive class rate, and makes decisions based on the threshold. The larger the FTR value on the horizontal axis, the greater the number of positive samples and the greater the number of negative samples predicted. On the vertical axis TPR, the more positive samples there are, the more positive classes are predicted. Therefore, the overall evaluation of the ROC curve compares the performance of the classifiers tested in experiments. More stable to experimental evaluation. As shown in Fig 7. AUC (Area under the Curve of ROC) is the area under the ROC curve, which is the standard to judge the merits and demerits of dichotomous prediction model. As can be seen in Fig  7, the value of AUC can reach 99% through the operation of the data in this model, and its value more intuitively reflects the model classification ability expressed to us by the ROC curve. Its value (the larger the better) represents The performance of the model is good or bad.

LOSS curve, experimental results
Through the connection and fusion between the Xception convolutional neural network model and the LSTM model, compared with the model only using multiple convolutional neural networks for prediction, the accuracy rate can be seen to be significantly improved, as shown in Table 4. At the same time, it is effective to use the RMSprop (root mean square) optimizer and add a loss function based on binary_crossentropy loss combined with interval estimation and hypothesis testing ideas to train the network model. Fig 8 shows the convergence of the training loss in this experiment.
The initial training loss of Loss is about 0.27, and the test loss is about 0.13. During the 5 iterations of the data, the rate of decline was rapid, and then the decline was relatively slow, but the overall decrease. For Accuracy, the accuracy of the first three iterations can be as high as 90% quickly, and the subsequent overall show a slow upward trend. Generally speaking, after Fig 8, we can observe that the accuracy increases with the passage of time, and the overall loss is also decreasing.
It can be seen from Table 4 that this experiment compares the effects of different mainstream models when relying on a large data set. If the classification of pneumonia detection can be improved more effectively and accurately, the probability of children dying from pneumonia disease can be greatly reduced. Use computer-aided medical diagnosis to automatically detect lung abnormalities and improve the accuracy of doctors' diagnosis. In addition, we have also discussed in other literatures that different model structures and algorithms are used for pneumonia diseases to realize the detection of pneumonia images. The literature [22] uses deep learning diagnostic tools to further prove the universal applicability of artificial intelligence in the diagnosis of pneumonia in children with chest X-ray. In the literature [37], using an improved version of CNN technology: feature extractor and a classifier (s-type activation function), [49] proposed multi-core deconvolution (MD-Conv), which has greatly improved the recall rate. In the literature [47], comparing Xception with the Vgg16 model, the test results show that the accuracy of the Vgg16 network exceeds that of Xception. However, Xception has achieved more successful results in detecting pneumonia cases. This experiment proposes a further improvement based on Xception combined with the LSTM algorithm, which is based on the model structure of ordinary Xception.
This study also explored the achievements in tuberculosis disease [50] under the same model technology. Our experimental research on medical image recognition and classification shows that the accuracy, recall, precision, and F1 of chest X-ray images of pneumonia are 0.96, 0.91, 0.98, and 0.94, respectively. At the same time, in this experiment, we applied this method to the data set of patients with tuberculosis disease, and obtained accuracy, recall, precision, and F1 of 0.99, 0.99, 1, 0.99 through model testing. It further provides a more transparent and recognizable diagnosis.

Summary and discussion
This experiment designed a new network model in order to better realize the classification of children's chest X-ray images and improve the accuracy between categories. At the same time, it is also better to make a modest effort in the research of medical image classification. In the experiment, we mainly used the Xception network model to initialize the image information, and changed the conventional convolution to directly extract the spatial information and channel information through a convolution kernel. At the same time, based on the lightweight neural network, combined with a variant form of the cyclic neural network, LSTM, in addition to extracting the basic image feature sequence, the sequence time information can be better mined. Because image acquisition consumes a lot of manpower and material resources, acquiring extensive training data is a difficult task. In this experiment, this problem is solved by using transfer learning and data enhancement techniques, and it also avoids overfitting to add penalty weights to the cost hyperparameters, and uses the square root of the sum of squares of the historical gradients controlled by the attenuation coefficient of the RMSprop optimizer, so that each The learning rate of the parameters is the most suitable. The traditional cross- entropy loss can not effectively solve the correlation of the training samples, combined with the chi-square test, fusion of the correlation between the classification variables between the two loss functions. Our method can outperform the benchmark model and some of the latest work results. On the same benchmark, the accuracy of this experiment reaches 0.96. In future research work, we will further strengthen data preprocessing, improve model generalization capabilities, and continue to optimize our methods so that our research can be applied to a wider range of medical technologies.