Automatic detection and classification of manufacturing defects in metal boxes using deep neural networks

This paper develops a new machine vision framework for efficient detection and classification of manufacturing defects in metal boxes. Previous techniques, which are based on either visual inspection or on hand-crafted features, are both inaccurate and time consuming. In this paper, we show that by using autoencoder deep neural network (DNN) architecture, we are able to not only classify manufacturing defects, but also localize them with high accuracy. Compared to traditional techniques, DNNs are able to learn, in a supervised manner, the visual features that achieve the best performance. Our experiments on a database of real images demonstrate that our approach overcomes the state-of-the-art while remaining computationally competitive.


Introduction
Automatic inspection and defect detection using image processing is an area of machine vision that is being widely adopted in many industries. It is used for high throughput quality control in production systems such as the detection of flaws on manufactured surfaces, e.g. metallic rails [1] or steel surfaces [2]. The idea is to design autonomous devices that automatically detect and examine specific visual patterns from images and videos in order to overcome the limitations and improve the performance of the traditional inspection systems that depend heavily on human inspectors.
Various systems have been previously developed for the automatic inspection of surfaces of different products. They are generally composed of a pipeline of several steps, each one introduces a set of challenges. First, the acquisition step requires efficient calibration of various types of sensors such as cameras and lighting systems [3,4]. The quality of the images produced by the acquisition systems impacts directly the performance of the subsequent analysis steps.
Once images are acquired, the second step is the extraction of visual features, which are then used as input to classifiers, which return the likelihood of the presence of a defect at each pixel of the image. These classifiers can be supervised or non-supervised. The  Since the first use of classical CNNs in inspection task [8,14], they have shown a good classification accuracy for many applications such as the detection of defects on photometric stereo images of rail surfaces [1]. Another application of CNNs is the classification of steel images and used Pyramid of Histograms of Orientation Gradients (HOG) as feature extractor and Max-Pooling Convolutional Neural Networks. The approach achieves a high accuracy with 7% error rate, which is significantly better than other classifier such SVM [2].
Pretrained deep Convolutional Neural Networks have also achieved an excellent performance in image classification and recognition tasks. For instance, Kasthurirangan et al. [15] used pretrained CNN architectures to classify defects in crack pavement. First, some preprocessing methods were applied on the image dataset to enhance the information in the image. The pretrained VGG16 was used to extract, from images, features that can distinguish one image class from another. Different classifiers such as single-layer Neural Network (NN), SVM and Random Forest (RF) were tested proving the performance of (CNN) in this inspection task.
Autoencoder architectures have proved their efficiency with multi-modal data in many classification or retrieval tasks [11]. Their successors, stacked autoencoders [16], where each autoencoder is trained with the output of the hidden layer of the previous autoencoder in the stack. With this architecture, the level of complexity and abstraction of learned representation increase along the stack of autoencoders. Autoencoder trees, in analogy to decision trees, are a new form of neural networks that learn hierarchical representations at different resolutions where layers are replaced by tree's nodes [17,18].
In this paper, we propose a framework for detecting defects on metal boxes and for classifying images into defective or non-defective. The proposed framework uses Deep Neural Networks, which can learn, in a supervised manner, the appropriate features (and thus image representation) as well as the classification function. First, we apply some preprocessing techniques on images of a metallic box to improve their quality. Then, we use an adapted method based on a multi-layer network for learning. Recall that using an autoencoder leads to dimensionality reduction since the number of neurons in hidden layers is smaller than that of input layer. The classification decision is made using a probabilistic model that takes the output of the last layer and returns the likelihood of the images being of a defected metallic box. Finally, we also propose an algorithm that detects the defect if present in the image.

Data acquisition
Metallic boxes or steel can are extensively exploited as containers for the distribution or storage of food. The metallic boxes used in this study were collected by a company who produces steel cans with or without colored printing. Collected metallic boxes are suitable for packing tuna, salmon, fish, shrimp, mushroom, and other foodstuffs. All images were acquired from an installed and configured image acquisition device using Raspberry Pi camera module and processed by OpenCV in unconstrained environment on the company's premises. For each metallic box, the device captures and stores two images that cover the entire box, see Fig 1. Such images can correspond to either non-defective (Fig 1(a)) or defective boxes. The defects can be small (Fig 1(b) and 1(c)) or big (Fig 1(d), 1(e) and 1(f)). In this paper, we consider that we have two different classes of images: defective and non-defective images.

Data description and representation
Due to non-controlled external factors, each step in the acquisition process may introduce noise (random changes) to the raw data (e.g. pixel values). To reduce the effects of such noise on the quality of the inspection process, we first pre-process the images by applying a set of denoising filters and reduce the spread by treating the database as a normal distribution of intensities [19,20]. We refer to this step as the image normalization process. Another motivation for image normalization is to standardize the input of the autoencoder, used in the learning process, in order to reduce variability between intensity distributions [21].
Next, we observe that most of the defects are localized in areas that are dominated by horizontal edges, see Fig 1 for some examples. To highlight these regions, we use the gradient of  the images instead of the images themselves [22]. Other representations and features have been proposed in the literature and we will use some of them for evaluation in our experiments. Since we are interested in horizontal contours where most of the defects are localized, we extract features by applying a Gabor filter with a p 2 orientation. Fig 2 shows two examples of extracted features where the first row shows the input images, the second and the third rows show the gradient and the Gabor features, respectively. Another way to capture relevant features is to decompose the image using adaptive basis such as Fourier, wavelets, or polynomials. Wavelet decompositions have been successfully used for many tasks such as dimensionality reduction, image/video classification, and many other areas of image analysis [23]. This motivated their use for evaluation in our experiments which has shown to be successful.

Features learning using an autoencoder
An autoencoder neural network architecture is a feedforward network composed of one or multiple connected hidden layers. It uses a non linear mapping function between the original data as input and outputs specific learned features [24,25]. Several previous works have used autoencoders for feature learning or for dimensionality reduction since the number of neurons in the hidden layer is smaller than the dimension of the input. In fact, many studies [26,27] showed that an autoencoder with nonlinear characteristic is more efficient than linear dimensionality reduction techniques such as Principal Component Analysis (PCA) when the input data have a nonlinear or a sparse structure. Overall, this provides significant benefits for the inspection task: the computational time and the required memory for storage. And even better, removing the redundant information by maximizing the covariance.
Stacked autoencoders [16,28] are deep neural architectures composed of a succession of autoencoders. Each autoencoder is trained with the output of the hidden layer of the previous autoencoder in the stack. This architecture, which can be seen as a convolution, allows learning complex concepts in an progressive manner. Consequently, the output representations are more relevant when one autoencoder is not sufficient to capture interesting structures that maximize the covariance between the components.
In this work, we will follow the same strategy; We train an autoencoder with different layers to reduce dimensionality as well as extracting relevant features under constraints formulated by a cost function. In the training phase, we used a sparse autoencoder whose training criterion involves a sparsity penalty. Given a finite set of N images, represented by vectors x i with i = 1. . .N, and their corresponding labels y i 2 {0, 1}, the learning is done by stacking multiple layers and the cost function E is defined by: where r is the activation function, X and Y represent the observations, W and b are the network parameters, and kWk 2 2 is a regularization term. The Lagrangian parameter λ weights between bounds of W and proximity to the input.
This cost function is a deterministic functional on parametric functions of the form r(X) = W T X + b that can be minimized using numerical methods to search for optimal param-etersŴ andb. Focusing on neural networks, a wide variety of models can be learned by varying different factors such as the activation function, the number of hidden units, and the choice of the regularization term. The details of the parameters initialization and the optimization process have been discussed and studied in the literature without determining the best choices. We consider that the answer is out of the scope of this paper and we give details of our model in Section 3.
One important aspect about autoencoders is dimensionality reduction and the features computed at the hidden layers, which are useful for inspection tasks. If the output of the autoencoder is considered as a deterministic function, one can plug it into a probabilistic model and perform learning and regression within the same framework. This idea is the main motivation of the proposed method and we will show that it helps improving the performance of the classification.
The idea of learning relevant features from data automatically as an output of an autoencoder can be illustrated by visualizing the hidden layers. For an autoencoder with several layers, every layer encodes different features. For example, Fig 3 shows (a) the output of the second hidden layer of the autoencoder trained on a set of grayscale images, (b) their corresponding gradients, and (c) gradient after binarization. Note that at this stage, the classifier is not yet defined but we expect the output to highlight relevant features. Even if it is hard to judge which one is better, we can say that the area around contours is further enhanced for different inputs.

Regression using Gaussian processes
We assume that we have N observations (X 1 , Here X is the observation (e.g. input image) and Y is its label. To be more specific, we restrict our analysis to the cases where X has a compact support in a finite-dimensional Euclidean space and Y is the class label with y = 0 for non-defective boxes and y = 1 for defective boxes.
Any regression model is based on building a predictive model: learn a probabilistic model from observed Y at different locations of X that will be able to predict y � , the class label, of a new vector x � . Before giving details of the proposed model, we first review classical regression methods, which consider X as elements on a high dimensional space. Then we provide a brief introduction to dimensionality reduction techniques. Next, we introduce the automatic learned features from an autoencoder as a parametric transformation of the input. Finally, we give details of Manifold Regression, the proposed method for jointly learning a regression model and a suitable feature representation ϕ of the data. Though mildly technical, it is useful as we focus on a particular subset of the relevant background in forming our model developments. We begin by reviewing a logistic regression model that has been initially introduced in [29]. The goal is to find the best fitting model representing the relationship between the output variable Y and a set of input variable X = (X 1 ,‥, X p ) where Y 2 {0, 1}.

MLE-based regression.
We suppose that we have N mutually independent samples (X 1 , Y 1 ),‥,(X N , Y N ) of the same law as (X, Y). We consider the problem of fitting a logistic model y = σ(x T β) where σ is the sigmoid function which is equivalent to estimating β from observed samples. Let (x i , y i ) be the observed value of the explanatory variables (X i , Y i ) for each observation i 2 {1,‥, N}. We denote by π β (x i ) the probability of Y i = 1 for a given X = x i : and by modeling Y i |X = x i as a Bernoulli distribution π β (x i ), Y i jX ¼ x i � Bðp b ðx i ÞÞ we write the negative log-likelihood as: Then the gradient of l at β is given by: where π β = (π β (x 1 ),‥, π β (x N )) T , X = (x 1 ,‥, x N ) T and Y = (y 1 ,‥, y N ) T . Typically, to find the MLE we have to search for the critical pointb of the gradient: rlðbÞ ¼ 0. Then the maximization of l(β) yields to the ordinary MLE β � . If X is of maximum rank, we have β 7 ! l(β) is strictly concave, i.e β � exists and is unique [29]. We note that finding β � explicitly is not straightforward, and consequently is very common to use iterative algorithms based on Newton and gradient descent methods.

Penalized MLE-based regression.
Basically inspired from the Ridge logistic [29], the weighted logistic method is more suitable when the components of X are highly correlated or when we have a large number of explanatory variables. Remind that in both cases X is not of maximum rank which means that there is no guarantee for β � to exist nor to be unique. In an effort to address this issue, the idea of ridge logistic consists in adding a regularization term. The regularization will have the effect of controlling the model and improving the performance in the presence of an over-fitting. Consequently, we consider a modified version of Eq 5.
where the regularization parameter satisfies: 0 < λ < 1. We denote by β λ,� the optimal solution. Therefore for a better choice of λ, the estimator β λ,� should maximize the log-likelihood compared to the unstructured MLE, ie. MSE(β λ,� ) < MSE(β � ). Following the same steps for the new modified cost, the gradient of l λ (β) is: with rl(β) is the gradient of l(β) as detailed in Eq 5. Then the estimator is a solution of rl λ (β) = 0 and the negative Hessian of l λ (β) is with As for the previous optimization formulation we use an iterative approach based on Newton method. For numerical efficiency, we present an approximation of the Newton-MLE based on Taylor expansion of rl λ (β 1 ) at β 0 : In particular, we have rl λ (β 1 ) = 0 and a first-order approximation for β 1 as: This process is then updated iteratively until convergence.

Gaussian processes classifier
In this section we make connection with binary classification, as described in the previous section, which forms the foundation of Gaussian processes (GP). Gaussian Processes are a stateof-the-art non-parametric probabilistic regression method. In order to capture the correlation between observed Xs, build a probabilistic model and perform optimal predictions for nonobserved data, we study a GP as a distribution over the transformed variables ϕ * GP(m, C) and fully defined by a mean function m (in our case m = 0) and a covariance function C. The popularity of such processes stems primarily from two essential properties. First, a Gaussian process is completely determined by its mean and covariance functions. This property facilitates model fitting as only the first-and second-order moments of the process require specification. Second, solving the prediction problem is straightforward since the optimal predictor at an unobserved position is a linear function of the observed values.
In the simple case of a real random process, YðxÞ; x 2 R is a Gaussian process if all the finite-dimensional distributions have a multivariate normal distribution. That is, for distinct observations x 1 , x 2 , . . ., x n , the random vector Y 1 , Y 2 , . . ., Y n with Y i = Y(x i ) has a multivariate normal distribution with mean vector m = E[Y] and covariance matrix C with C i,j = C(Y i , Y j ). A Gaussian process is said to be stationary if m is independent of x and the covariance C(Y(x+h), Y(x)) = C(h) < 1 is independent of x (i.e., the process Y(x) is translation invariant). This condition is usually called the strong form of stationarity (second-order or weak). Considering the new formulation, we introduce a new latent variable ϕ with: The idea behind GP prediction is based on placing a GP prior on ϕ, i.e: � � GPðm; C y Þ. Where m is the mean, equal to zero in our case, and C θ is a family of covariance functions with parameter θ. There is a wide choice of covariance functions but, as in this study, the Matérn covariance functions is a preferred choice due its smoothness and asymptotic properties [30,31]. If we consider prediction from the observed locations. Obviously, we are looking for a prediction method that works well on the average. One of the main difficulties is the choice of the covariance function. Despite the fact that a prediction model may yield to an unbiased predictor with a correct predictive variance, it is only true if the choice of a covariance function is optimal. To deal with such issue, a common choice consists in parametric covariance functions leading to an search for an optimal parameterŷ. Many numerical methods have been used to search forŷ and the most studied and popular one is Maximum Likelihood Estimator (MLE). We keep the same notations from the previous section and by abuse of notation, we note ϕ = (ϕ 1 ,‥, ϕ N ) T = (ϕ(x 1 ),‥, ϕ(x N )) T . Therefore, the goal is to estimate the hyperparameter θ = (α, τ, ν) of the covariance function C θ that minimizes the negative likelihood L θ : where C θ,i,j = C θ (d(ϕ i , ϕ j )) is the covariance matrix for ϕ 1 , ϕ 2 , . . ., ϕ n and θ is the full parameter vector taking values in the parameter space Y ¼ ft > 0; a > 0; n ¼ 1 þ k; k 2 Ng. Our goal is then to find the maximum likelihood estimator (MLE)ŷ of θ. Once, there is no analytical solution for Eq 15, we use a Newton-based method to find the MLE. The optimization problem over ν is not straightforward in this case, and thus, we estimate this value using a k-fold crossvalidation rather than the MLE. To summarize, the GP predictive distribution at a new observation ϕ � = ϕ(X � ) is given bŷ where c�� = C(X � , X � ) and c� = C(X, Given the mean and the variance, we make predictions by computing the conditional expectation:

Results
We evaluate the performance and efficiency of the proposed approach on a database of 2042 real images. The database contains 530 images of defective metallic boxes and 1512 images of non-defective ones. First, we will look at the ability of our approach to classify defective and non-defective images. Results demonstrate that the autoencoder can be successfully applied to learning relevant features from different inputs. Combined with the GP classifier, it reaches good performances. Second, we evaluate the proposed method for detecting and localizing defects in images. For both cases, we compare the performances of the autoencoder combined with the GP classifier using the Matérn covariance function [32], and the Newton method to search for the maximum likelihood estimator. We train the autoencoder with 75% of the images, i.e. 1531 images, 1148 with defects and 397 without defects. The rest of the images in the dataset is used for test. In order to remove the test bias, we use 100-fold cross validation, i.e. we run the method 100 times. At each run, we randomly select the training set (75% of the entire dataset) and the remaining 25% are used for test. We then average the performance over the 100 runs. To evaluate the classification quality of the different models, we consider the False Negative (FN) and False Positive (FP) rates where: • FN rate corresponds to the number of images labeled as non-defective but classified as defective.
• FP rate corresponds to the number of images labeled as defective but classified as nondefective.
We compare the proposed method with other state-of-the-art methods using the same experimental protocol. We use different image representations combined with two classifiers: (1) the Matlab implementation of K-Nearest Neighbor classifier (KNN). We used the Euclidean distance for finding the nearest neighbors. (2) The Matlab implementation of the Support Vector Machine (SVM) classifier.

Image classification
This section presents the performance of the proposed method when applied to classify images. In our implementation, we used a model which consists of two stacked autoencoders with 50 and 60 hidden layers, respectively. The sparsity penalty λ is set to 10 −4 for all the experiments. The optimal parametersŴ andb were obtained by optimizing the cost using an iterative Newton-based method, initialized using a standard normal distribution.
There are three key steps that can affect the performance of any classification method: (i) The representation of the images and the definition of the feature space. (ii) The analysis of these observations in the feature spaces. (iii) The classifiers used to classify the observations. To illustrate their importance in the application context, we use two classifiers: the K-Nearest Neighbor (KNN) and the Support Vector Machines (SVM) classifiers, and six different features to represent each image: 1. Pre-processed image intensity.

Histogram of Oriented Gradients (HOG) computed from (a).
5. Coefficients from the decomposition of (a) into a linear combination of the Haar wavelet basis.
6. Gabor descriptor computed from (a) using two directions p 2 ; 3p 4 � � These features are then compared to the method proposed in this article, which can take any input as a feature vector. It then searches for the optimal parameters and features during the training stage, and predicts the correct label for test data. Experimental results show that the KNN classifier performs worse than SVM for all type of features. Compared to KNN, SVM shows better classification performances but remains below the proposed method. Table 1 summarizes the classification performance of these methods in terms of False Negative (FN) and False Positive (FP) rates.
According to Table 1, we note that the proposed method outperforms the other methods regardless of the features used to represent the images. In particular, we observe that the best accuracies are achieved when combining either Haar decomposition and / or HOG descriptors as input with SVM for classification or with our proposed autoencoder model. Automatic detection and classification of manufacturing defects The ROC curves of Fig 4 show that the proposed method has the most predictive power and generalization capability with a value of 0.86, followed by SVM 0.815 and then KNN 0.794. This indicates that the proposed method succeeds in learning relevant features, reducing dimensionally, and predicting more accurately.

Detection and localization of defects
Finally, we extend the proposed approach to detecting and localizing defects in images. First, to build the training (and testing) data, we asked an expert to manually localize, in each input image, regions that contain defects. We then divided all the images into patches of size 32 × 32. Patches that contain defects are then treated as negative examples while the remaining are treated as positive examples. All the patches were subject to the same preprocessing step as the one used for the previous experiments. We selected HOG descriptors and Haar coefficients to represent an image since they achieved the best classification accuracy, see Table 1.
To show the effectiveness of our approach, we use the same database for classifying and detecting defective sub-regions. We also compare our method with HOG-SVM and Haar-SVM models, and a pre-trained CNNs such as VGG-16. VGG 16-layer is a deep convolutional neural network pretrained on ImageNet, a large image dataset composed of 1000 classes and 1.3M images [33]. We use the MatConvNet implementation [33] of VGG-16, with an additional fine-tuning step, combined with a softmax classifier. Table 2 summarizes the defects detection rates for all methods. These results are obtained using 100-fold cross validation using the same setup as the previous experiment. From Table 2, we observe that our approach outperforms the other methods. The ROC curves in Fig 5 confirm that both VGG-softmax and our model reach a good accuracy with a slight advantage for our method.
Finally, Fig 6 shows examples of localized defects on the original test images.

Discussion
The new framework proposed in this article, which enables real inspection and classification of defects in metallic boxes, can be generalized for any application that deals with detecting non-standard subregions. The representation part, i.e. feature extraction, was used to illustrate the ability of our framework to capture discriminant information from the input. Other features from the literature could be also used or adapted for the application at hand. The investigation of the best representation is out of the scope of this work. Nevertheless, the features can be selected independently and then used following the same procedure described in this work. One can also use different features selection [34,35] and similarity learning [36] methods to automatically select the features that achieve the best performance. It addition to the accuracy of the defect detection results, the proposed framework has at least three advantages compared to the state-of-the-art: • First, it provides more flexibility compared to pre-trained CNNs: our method can be adapted to the dimension of the input. In fact, it can accept input of any dimension and choose the output dimension by fixing the size of the different layers. This leads to a reduced dimension, which in turn improves the computational efficiency.
• The input can be a vector of any features or a mixture of them. This can lead to a different network architecture but the overall process remains the same.
• The classifier includes a mapping function which includes the non-linearity of data making this framework more general then linear classifiers. It should be noted that there is still room for further improvements, especially when it comes to the prediction part where we used a Newton-bas method to search for optimal parameters of the Gaussian process. More sophisticated stochastic tools, such as MCMC, may improve the quality of the estimator.
Fig 7 highlights some failure cases (false positives and false negatives) of the proposed method. This figure shows that it is not always a straightforward decision whether a subregion contains a defect or not (see Fig 7 (top row)). In fact, some defects are very small and that images were manually classified by one expert. This means that it could be hard even for an expert to decide visually what is a defect when it is small or has a different shape (see Fig 7   Table 2. Performance of the methods for detecting defective patches. Rates are obtained using 100-fold cross validation. Here, we provide the average performance and the standard deviation (Std) over the 100 runs.  (bottom row)). An extension to this work would be to ask different experts to label images independently and to learn uncertainty at the same time as labels. This will make the framework more complete but will need more complex methodologies to come even close to handling the hard challenges resulting from such formulations.

Conclusion
We proposed in this article a new machine learning method for detecting and localizing defects in images of metallic boxes. The proposed method is based on: (1) an autoencoder to automatically learn features from the input, and (2) a Gaussian process classifier. Different image representations were used as an input to the autoencoder and two of them were selected for detection: HOG descriptor and the decomposition in a wavelet basis. To show the effectiveness of our approach, we used the same database for classifying and detecting defective subregions. We have also compared our method with other state-of-the-art techniques, namely the HOG-SVM and the Haar-SVM models, and a pretrained CNNs such as VGG-16. The experimental results demonstrate that the proposed method achieves the best performance. It can be successfully applied to learning relevant features from different inputs and when combined with the GP classifier.