A shallow convolutional neural network for blind image sharpness assessment

Blind image quality assessment can be modeled as feature extraction followed by score prediction. It necessitates considerable expertise and efforts to handcraft features for optimal representation of perceptual image quality. This paper addresses blind image sharpness assessment by using a shallow convolutional neural network (CNN). The network takes single feature layer to unearth intrinsic features for image sharpness representation and utilizes multilayer perceptron (MLP) to rate image quality. Different from traditional methods, CNN integrates feature extraction and score prediction into an optimization procedure and retrieves features automatically from raw images. Moreover, its prediction performance can be enhanced by replacing MLP with general regression neural network (GRNN) and support vector regression (SVR). Experiments on Gaussian blur images from LIVE-II, CSIQ, TID2008 and TID2013 demonstrate that CNN features with SVR achieves the best overall performance, indicating high correlation with human subjective judgment.


Introduction
A picture wins a thousand words. With the rapid pace of modern life and the massive dissemination of smart phones, digital images have been a major source of information acquisition and distribution. Since an image is prone to various kinds of distortions from its capture to the final display on digital devices, a lot of attention has been paid to the assessment of perceptual image quality [1][2][3][4][5][6][7][8].
Blind image quality assessment (BIQA) mainly consists of two steps, feature extraction (T) and score prediction (f). Before rating an image, T and f should be prepared. The former aims to select optimal features for image quality representation, while the latter builds the functional relationship between the features and subjective scores. With considerable expertise and efforts, a BIQA system can be built. As such, a test image (I) is input to the system and represented with features (T). Finally, the function f will quantify the features and figure out a numerical score (s) as the output, denoting the predicted quality of the test image. The procedure for score prediction can be formulated as follows, Blind image sharpness assessment (BISA) is studied in this paper. Among various kinds of distortions, sharpness is commonly degraded by camera out-of-focus, relative target motion and lossy image compression. It is crucial to readability and content understanding. Sharpness is inversely related to blur which is typically determined by the spread of edges in the spatial domain, and accordingly the attenuation of high frequency components. Karam [43] took account of maximum local variation (MLV) of each pixel and utilized the standard deviation of ranking weighted MLVs as the sharpness score. Li et al. [44] proposed the sparse representation based image sharpness (SPARISH) model that utilizes dictionary learning of natural image patches. Gu et al. [45] designed an autoregressive based image sharpness metric (ARISM) via image analysis in the autoregressive parameter space. Li et al. [46] presented a blind image blur evaluation (BIBLE) index which characterizes blur with discrete moments, because noticeable blur affects the moment magnitudes of images.
Deep learning has revolutionized image representation and shed light on utilizing highlevel features for BIQA [47,48]. Li et al. [49] adapted Shearlet transform for spatial feature extraction and employed a deep network for image score regression. Hou and Gao [50] recast BIQA as a classification problem and used a saliency-guided deep framework for feature retrieval. Li et al. [51] took the Prewitt magnitudes of segmented images as the input of convolutional neural network (CNN). Lv et al. [52] explored the local normalized multi-scale difference of Gaussian response as features and designed a deep network for image quality rating. Hou et al. [53] designed a deep learning model trained by deep belief net and then fine-tuned it for image quality estimation. Yet it is found that some deep learning based methods need to handcraft features [49][50][51][52] or redundant operations [50,52,53].
This paper presents a shallow CNN to address BISA. On the one hand, several studies indicate that image sharpness is generally characterized by the spread of edge structures [35-38, 44,46]. Interestingly, what CNN learns in the first layer are mainly edges [47,48]. Thus, it is intuitive to design a single feature layer CNN for image sharpness estimation. On the other hand, small data sets make deep networks hard to converge which may increase the risk of over-fitting. Consequently, a shallow CNN can be well trained with limited samples [54]. To the best of our knowledge, the most similar work is Kang's CNN [55]. The network utilizes two full-connection layers and obtains dense features by both maximum and minimum pooling before image scoring. Relatively, our network is much simpler in the architecture and more suitable for the analysis of small databases. Besides, our CNN is verified with Gaussian blurring images from four popular databases. After features are retrieved for representation of sharpness, the prediction performance of multilayer perceptron (MLP) is compared to both general regression neural network (GRNN) [56] and support vector regression (SVR) [57]. In the end, the effect of color information on our CNN and the running time are reported.

A shallow CNN
The simplified CNN consists of one feature layer and the feature layer is made up of convolutional filtering and average pooling. As shown in Fig 1, a gray-scale image is pre-processed with local contrast normalization. Then, a number of image patches are randomly cropped for feature extraction. At last, the features are as input to MLP for score prediction. By supervised learning, parameters in the network are updated and fine-tuned with back-propagation.

Feature extraction
Local contrast normalization. It has a decorrelating effect in spatial image analysis by applying a local non-linear operation to remove local mean displacements and to normalize the local variance [25,58]. As in [52,55], the local normalization is formulated as following, where, In the equations, I(i, j) is the pixel intensity value at (i, j),Ĩði; jÞ is its normalized value, μ(i, j) is the mean value, σ(i, j) is the standard deviation and C is a positive constant (C = 10). Besides, [2P + 1, 2Q + 1] is the window size and P = Q = 3.
Feature representation. Each patch randomly cropped in the pre-processed image is through convolutional filtering and pooling before full connection to MLP. A feature vector of an image patch is generated and formulated as, where I p is an image patch, n is the feature dimension and x l is the l th component of the feature vector X.

Score prediction
Multilayer perceptron (MLP). Fig 2 illustrates an MLP with a hidden layer. The output f(X) with regard to the input feature X can be expressed as following, where f mlp denotes an activation function, while w and b respectively stand for the weight vector and the bias vector. General regression neural network (GRNN). GRNN is a powerful regression tool based on statistical principles [56]. It takes only a single pass through a set of feature instances and requires no iterative training. GRNN consists of four layers as shown in To an input feature vector X, its output f(X) can be described as below, where Y i is the weight between the i th neuron in the pattern layer and the numerator neuron in the summation layer, and σ is a spread parameter. In GRNN, only σ is tunable and a larger value leads to a smoother prediction. Support vector regression (SVR). SVR is effective in handling numerical prediction in high dimension space [57,59]. For an input X, the goal of ε-SVR is to find a function f(X) that has the maximum deviation of ε from the subjective score Y for all the training patches. The function is defined by where φ(Á) is a nonlinear function, w is a weight vector and γ is a bias. The aim is to find w and γ from the training data such that the error is less than a predefined value of ε. The radial basis function is used as the kernel function, K(X i , X) = e −ρ||X i −X|| , and ρ is a positive parameter that controls the radius and X i is a training sample. By using a validation set to tradeoff the prediction error, ρ and ε are determined [60].

Network training
CNN is end-to-end trained by supervised learning with stochastic gradient descent. Assume there are a set of features fX i g n i¼1 and corresponding scores fY i g n i¼1 . The training aims to minimize the loss function L(w, b), which is the sum of square error between the predicted s i and the subjective score Y i . Using gradient descent, the relationship between the l th and the (l + 1) th iteration to each weight component can be described as following, where μ is the momentum that indicates the contribution of the previous weight update in the current iteration, and η denotes the learning rate.

Images for performance evaluation
Gaussian blurring images are collected from four popular databases. LIVE-II [10] and CSIQ [61] respectively contain 29 and 30 reference images which are distorted with 5 blur levels and scored by differential mean opinion scores (DMOS). Both TID2008 [62] and TID2013 [63] have 25 references and use mean opinion scores (MOS) for scoring. Each reference image in TID2008 and TID2013 is degraded with 4 and 5 different blur levels, respectively.    [42], MLV [43], SPARISH [44], ARISM [45] and BIBLE [46]. In the end, the running time of involved algorithms and the effect of color information on our CNN are studied.

Performance criteria
Two criteria are recommended for IQA performance evaluation by the video quality experts groups (VQEG, http://www.vqeg.org). Pearson linear correlation coefficient (PLCC) evaluates the prediction accuracy, while Spearman rank-order correlation coefficient (SROCC) measures the prediction monotonicity. Values of both criteria range in [0, 1] and higher value indicates better rating prediction.
A nonlinear regression is first applied to map the predicted scores to subjective human ratings using a five-parameter logistic function as follows, where s and Q(s) are the input score and the mapped score, and q i (i = 1, 2, 3, 4, 5) are determined during the curve fitting.

Software and platform
Softwares are run on Linux system (Ubuntu 14.04). The system is embedded with 8 Intel Xeon (R) CPU (3.7GHz), 16GB DDR RAM and one GPU card (Nvidia 1070). Kang's CNN is implemented by us following the paper [55]. Both CNN models are realized with Theano 0.8.2 (Python 2.7.6) and accessible on GitHub at present for fair comparison (https://github.com/ Dakar-share/Plosone-IQA). Other codes are realized with Matlab. Ten BISA methods are provided by authors and estimated without any modifications, GRNN is with the function newgrnn and SVR is from LIBSVM [59].

Parameter tuning
Several parameters are experimentally determined, the patch number per image (P n ), the kernel number (K n ) and the kernel size ([K x , K y ]) in feature extraction, and the iteration number (N i ) in network training. In addition, the spread parameter (σ) in GRNN and cost function (c) in ε-SVR are also studied. Note that in the network, we define the size of image patch [16 16], the learning rate η = 0.01, the bias γ = 0.1 and the momentum μ = 0.9, and other parameters are set by default. Parameters in CNN. Fig 5 shows CNN performance when the iteration number (N i ) varies from 10 3 to 10 4 and the patch number per image (P n ) changes from 10 2 to 10 3 . No much change is found after N i reaches 4000. On the other side, P n = 400 is a good point to tradeoff PLCC and SROCC. Therefore, we use N i = 4000 and P n = 400 hereafter. Table 1 shows the CNN performance with regard to the kernel number (K n ) and the kernel size ([K x , K y ]). When K n = 16, CNN performs well, while it is unstable when K n = 32. On the other hand, prediction performance of CNN is insensitive to kernel size [K x , K y ] changes. So we define K n = 16 and K x = K y = 7.
Parameters in GRNN and SVR. The spread parameter (σ) in GRNN and the cost function (c) in ε-SVR are studied with learned CNN features. Fig 6 shows PLCC and SROCC values when σ or c changes. The left plot indicates that when σ = 0.01, GRNN performs the best. The right shows that PLCC and SROCC increase when log 10 (c) increases, while when log 10 (c) > 1, SROCC keeps stable. Thus, σ = 0.01 in GRNN and c = 50 in ε-SVR.     Table 3 shows SROCC and bolded values indicate best predication monotonicity. BIBLE [46] shows superiority over algorithms based on handcrafted features, followed by SPARISH [44] and ARISM [45]. Kang's CNN [55] achieves the highest SROCC on Gaussian blurring images from LIVE-II and TID2013, while it gets the second lowest SROCC on images from CSIQ among all metrics. On contrary, SROCC values from our CNN methods are robust on images from different databases. Particularly, CNN features with SVR outperforms other methods on CSIQ and TID2008. Furthermore, it ranks the second and the third place on TID2013 and LIVE-II, respectively. Generally, learned CNN features with SVR reaches an average SROCC of 0.9310, which is higher than CNN features with GRNN (0.9283), BIBLE (0.9160) and other methods.

Time consumption
The time spent on score prediction of image sharpness is shown in Fig 8. Among traditional methods, several algorithms show promise in real-time image sharpness estimation, such as LPC, MLV, SVC and FISH which require less than 1 s. For CNN-based methods, both models take about 0.02 s to rate an image. It should be noted that the major time of CNN models is spent on local contrast normalization which costs about 8 s for an image. Moreover, GRNN and SVR need time after the model is well trained. Fortunately, with the help of code optimization and advanced hardware, it is feasible to accelerate these algorithms and to satisfy real time requirement.

Effect of color information
Chroma is an important underlying property of human vision system [64,65] and it is highly correlated with image quality perception [30,44]. Effect of color information on image sharpness estimation is studied with our CNN. The performance of CNN with gray and color inputs is shown in Fig 9. It is observed that chromatic information positively enhances CNN's  performance on image sharpness estimation. The improved magnitude of PLCC ranges from 0.013 (LIVE-II) to 0.040 (TID2008). Meanwhile, the improved magnitude range of SROCC is from 0.014 (CSIQ) to 0.067 (TID2008).

Future work
The proposed shallow CNN methods have achieved the state-of-the-art performance on simulated Gaussian blur images from four popular databases. Our future work will be to integrate handcrafted features and CNN features for improved prediction capacity. On the other hand, deeper networks will also be considered for representative features in image sharpness. In addition, with the public accessibility to the real-life blurring image databases of BID2011 [37] and CID2013 [66], it will be interesting to explore the proposed algorithm for more general and more practical applications [32, 67,68].

Conclusion
A shallow convolutional neural network is proposed to address blind image sharpness assessment. Its retrieved features with support vector regression achieves the best overall performance, indicating high correlation with subjective judgment. In addition, incorporating color information benefits image sharpness estimation with the shallow network.