Full-Reference Image Quality Assessment with Linear Combination of Genetically Selected Quality Measures

Information carried by an image can be distorted due to different image processing steps introduced by different electronic means of storage and communication. Therefore, development of algorithms which can automatically assess a quality of the image in a way that is consistent with human evaluation is important. In this paper, an approach to image quality assessment (IQA) is proposed in which the quality of a given image is evaluated jointly by several IQA approaches. At first, in order to obtain such joint models, an optimisation problem of IQA measures aggregation is defined, where a weighted sum of their outputs, i.e., objective scores, is used as the aggregation operator. Then, the weight of each measure is considered as a decision variable in a problem of minimisation of root mean square error between obtained objective scores and subjective scores. Subjective scores reflect ground-truth and involve evaluation of images by human observers. The optimisation problem is solved using a genetic algorithm, which also selects suitable measures used in aggregation. Obtained multimeasures are evaluated on four largest widely used image benchmarks and compared against state-of-the-art full-reference IQA approaches. Results of comparison reveal that the proposed approach outperforms other competing measures.


Introduction
Visual information is often a subject of many processing steps, e.g., acquisition, enhancement, compression, or transmission. After processing, some information carried by the content of the image can be distorted. Therefore, its quality should be evaluated from a human perception point of view. There are three categories of image quality assessment (IQA) measures (metrics or models), depending on availability of a pristine, i.e., distortion-free, image: (1) full-reference, (2) noreference, and (3) reduced-reference models. In this paper, the full-reference approach is considered, in which for each distorted image in a benchmark dataset its reference image is provided.
Application of peak signal-to-noise ratio (PSNR) is one of the simplest approaches to IQA. However, an output of PSNR is not well correlated with human evaluation; therefore this technique often serves as a bottom model for comparison. In [1], Damera-Venkata et al. presented The rest of this paper is organised as follows. In the section Methods, a formulation of the optimisation problem and the development of the proposed approach are presented. Experimental results with related discussions are covered in the section Results and Discussion. Finally, the last section concludes the paper.

Methods
Since digital processing can alter an appearance of the image and that may lead to different opinions on its quality, many IQA algorithms have been proposed for automatic assessment [25]. In order to compare IQA approaches, specific image databases have been proposed. They contain reference images, their corresponding distorted images, and ground-truth information obtained from human observers. Information on the perceived quality is reported as mean opinion scores (MOS values) or differential mean opinion scores (DMOS values).
The desired IQA metric should produce objective scores which are consistent with human ratings (subjective scores). In this work, it is assumed that joint metric can provide better results, in terms of prediction quality, than a single metric that contributes to the multimeasure.
Let Q be an output of an aggregated decision of n IQA measures, where n 2 N. It can be expressed as: where A is an aggregation operator. The operator often has a form of a weighted sum [26][27][28], therefore Q can be expressed as follows: where x = [x 1 , x 2 , . . ., x n ] denotes a vector of weights, x 2 R n . The vector x contains decision variables in an optimisation problem of finding an effective fusion of IQA measures. Since many fusions can be proposed, a given x should be evaluated. For this purpose one of typically used IQA measures quality evaluation indices can be used. In order to measure consistency of the output of the examined IQA model with human assessment, the following indices of prediction accuracy, monotonicity, and consistency are often considered [29,30]: Spearman Rank order Correlation Coefficient (SRCC), Kendall Rank order Correlation Coefficient (KRCC), Pearson linear Correlation Coefficient (PCC), and Root Mean Square Error (RMSE). Evaluation indices are calculated after a nonlinear mapping between a vector of objective scores, Q, and MOS or differential MOS (DMOS), S, using the following mapping function for the nonlinear regression [30]: where β = [β 1 , β 2 , . . ., β 5 ] are parameters of the regression model [29], and Q p is a mapped equivalent of Q. SRCC is calculated as follows: where d i is the difference between i th image in Q and S, and m is the total number of images. KRCC, in turn, uses the number of concordant pairs in the dataset, m c , and the number of discordant pairs in the dataset, m d . It is illustrated by Eq (5).
PCC is defined as: where, Q p and S denote mean-removed vectors. RMSE is given by Eq (7).
Higher SRCC, KRCC, and PCC values are considered better, in contrary to the values of RMSE.
One of these performance indices could be used as an objective function in a considered optimisation problem. Preliminary experiments revealed that maximisation of SRCC or KRCC may lead to fusion providing unacceptably high RMSE values. On the other hand, RMSE requires determination of β. Finally, RMSE was used as the objective function in the considered problem (Eq (8)), and β components were considered as decision variables in addition to the weights of fused IQA measures. minimise x RMSEðFðQ; βÞ; SÞ subject to Linear combination may produce negative weights which can be unintuitive in terms of contribution of IQA measures that take part in the aggregation. Therefore, different combination types were considered starting from convex combination, in which weights are positive and their sum is equal one, affine combination with preserved sum condition, or conical combination with positive weights. Preliminary results confirmed that the proposed approach provides best performance without constraining the weights.
Finally, the vector x best d , where d denotes a dataset, was obtained in the following steps: (1) Selection of the 20% reference images from a given dataset and their distorted equivalents; (2) Evaluation of images using N = 16 full-reference IQA measures; (3) Selection of n 2 N IQA measures, finding weights of linear combination of their opinion scores and β. Objective scores of used measures, if needed, were scaled to be in a 0-1 range.
The optimisation problem was solved using a genetic algorithm (GA) [28,33], since the number of possible solutions grows exponentially with the number of used IQA metrics. The GA uses a population of individuals, where each individual represents a single solution. Then, from generation to generation, after applying selection, crossover and mutation operators, better solutions are emerging. The GA was run for 200 generations, with a population of 100 individuals, elite count equal to 0.05 of the population size, and 0.8 crossover fraction. Scattered crossover, Gaussian mutation and stochastic uniform selection rules were used [33]. All presented calculations were performed using Matlab software (version 7.14) with GA Toolbox [34]. After 100 runs, the best solution, x best d , was selected. The individual in the proposed solution is represented by real-valued vector, where dimensions refer to weights of IQA measures, x, and β values. Parameters of the GA were determined experimentally observing convergence of the objective function over the generations. Fig 1 presents a flowchart of the approach with a process in which the introduced fusion measure is obtained and its usage for image quality assessment.
In experiments, the following four image benchmarks were used: TID2013 [35], TID2008 [36], CSIQ [17], and LIVE [3]. The number of reference images, distortions, and subjects for each dataset are shown in Table 1. Each database contains reference images, their corresponding distorted images and subjective scores. In an offline training process, the proposed approach is obtained using some of images from a benchmark dataset. Images are assessed by full-reference IQA measures. Then, a genetic algorithm selects IQA measures and assigns weights to them. Obtained weights for linear combination of selected measures are used in image quality assessment tasks. Finally, four IQA measures, namely Linearly Combined Similarity Measures (LCSIMs), were obtained: Their corresponding β components are as follows:

Results and Discussion
This section presents experimental evaluation of the proposed approach in comparison with state-of-the-art techniques, as well as discussion on influence of the aggregated IQA measures and β on resulting fusion models.

Comparative evaluation
For evaluation, four largest image benchmarks (TID2013, TID2008, CSIQ, and LIVE) and four performance indices (SRCC, PCC, KROCC, RMSE) were used.  Table 2 presents evaluation results for the best ten models and LCSIMs. The top two models for each criterion are shown in boldface. The table also contains direct and weighted averages of obtained values. For the weighted average, the number of images in the database is used as its weight. Overall results for RMSE do not take into account LIVE dataset due to range difference.
The obtained results show that LCSIM3 clearly outperformed other measures, since it yielded the best results on LIVE and CSIQ. It was also the second best measure on TID2008 dataset, after LCSIM2. LSIM1 outperformed other measures on TID2013. Overall results are biased towards techniques that performed well on TID2013, which is the largest benchmark, i.e., LCSIM1, VSI, and IFS. Among results obtained by measures that took part in the LCSIM1 fusion, VSI and MAD are worth noticing. Such good performance of LCSIM family should be confirmed using statistical significance tests. In order to evaluate statistical significance of obtained IQA models, hypothesis tests based on the prediction residuals of each measure after non-linear mapping were conducted using left-tailed F-test [17]. In the test, smaller residual variance denoted the better prediction. Table 3 presents results of these tests, where a symbol "1", "0" or "-1" denotes that the IQA fusion measure in the row is statistically better with a confidence greater than 95%, indistinguishable, or worse than the IQA measure in the column. Significance tests confirm good performance of the developed family of multimeasures. LCSIM3 was significantly better than other measures on TID2013, LIVE and CSIQ databases. Its results on TID2013 were also good. However, since it was developed using information carried by scores being a reflectance of the dataset which do not contain many of distortions that are present in CSIQ benchmark, its opinion scores were less correlated in this case than scores of VSI, FCSIM, or IFS. Consequently, LCSIM that was obtained on TID2013 (LCSIM1) performed worse than other measures on LIVE benchmark. Fig 2 presents the scatter plots for LCSIM3 and the two best performing IQA models for each benchmark. It can be seen that compared models for databases other than TID2013 yielded less accurate quality predictions for large DMOS values and small MOS values (i.e., in presence of severe distortions) than LCSIM3. Fig 3, in turn, contains absolute values of the difference between subjective scores and objective scores for the five best IQA measures after nonlinear fitting (Eq (3)). Here, the values were obtained for 50 images from the most popular LIVE dataset. The figure shows how scores obtained by IQA measures differ from the expected scores; smaller values are considered better. It can be seen that the introduced fusion measure,  FSIMc  GSM  MAD  MSSIM  SR-SIM  VIF  IFS  SFF  LCSIM1  LCSIM2  LCSIM3  LCSIM4   TID2013   1  1  1  1  1  1  1  1  1  1  0  1 The fusion measure in the row is significantly better than the IQA measure in the column ('1'), worse ('-1'), or indistinguishable ('0').   Table 4. It can be seen that MAD and VIF are the most demanding techniques. Taking into account that processing time  requirements for image quality assessment algorithms are less demanding than for video quality assessment techniques, obtained timings on ordinary 2200MHz CPU seem to be acceptable. LCSIMs aggregate several IQA measures; therefore, their running time will be longer in case of sequential execution of used measures or close to the execution time of MAD measure in case of more memory-consuming parallel implementation. It would be desirable to compare the proposed multimeasures with other related fusion IQA measures. Table 5 contains such comparative evaluation based on SRCC values. SRCC was used as a basis for comparison since many papers do not report other performance indices. Two best results for a given benchmark dataset are written in boldface, some results were not reported in referred works; therefore, they are denoted by "-". IQA measures which were developed using images from the benchmark in the column are excluded from the comparison. Moreover, overall results were calculated excluding TID2013 since some measures have not been evaluated on it. Furthermore, in order to provide fair comparison, overall results exclude works in which authors obtained a separate IQA measure for each benchmark without providing cross-database evaluation, e.g., [18,19,[21][22][23], or [37]. Results for approaches that are not dataset independent are written in italics.
Evaluation results show that LCSIM3 and LCSIM2 outperformed other approaches which use fusion of IQA measures. Among other measures, DOG-SSIM and ESIM provided good results on TID2013 benchmark, and the approach developed by Barri et al. turned out to be the second best technique on CSIQ dataset. Outstanding performances of LCSIM3 and LCSIM2 are also confirmed by overall results. Here, they are followed by ESIM, LCSIM4, LCSIM1, and

Influence of parameters and IQA measures on fusion
The already presented results confirm good performance of obtained IQA fusion measures in comparison with state-of-the-art fusion and single IQA measures. However, it would be desirable to answer why some measures took part in the fusion more often than others. A contribution of aggregated models also requires some attention, since the linear combination can produce unintuitive negative weights. At first, in order to show the contribution of a given measure, SRCC values between objective and subjective scores were obtained for each distortion type. This may explain why some measures were involved in a fusion, and also show how well perform developed LCSIMs in comparison with IQA measures that were used in optimisation, from distortion type point of view. Table 6 contains SRCC values of the best ten IQA models and LCSIMs obtained on benchmark datasets. The two best IQA measures for each distortion type are written in boldface.
Results for distortion types reveal that VSI, FSIMc, GSM, VIF, IFS, and SFF are among best single IQA models. They also were often a part of fusion models, what can be seen in Eqs (9)- (12). Here, LCSIM family was better or close to best IQA models and showed outstanding performance on CSIQ dataset. In order to provide further investigation why some measures were fused together, SRCC values between IQA models on CSIQ dataset were obtained. They are shown in Table 7. This time correlation sign was preserved, since it may suggest why some measures have negative weights in fusions. Negative correlations can also be seen on Fig 2. Similar pairwise relations between IQA models were noticed on other datasets. It can be seen that some measures are less correlated with each other while preserving good correlation with subjective scores. VIF is the less correlated measure with MSSIM and MAD, all these measures perform well on CSIQ dataset. SRCC values for these measures are written in boldface in the Table 7. IQA measures in pairs MAD-VIF and MSSIM-VIF are complementary and thus likely to be fused together.
These findings were confirmed in an experiment in which a predefined number of IQA measures, k 2 N, could take part in the fusion. Such reduced fusion models are helpful to determine the contribution of each fused measure. In the experiment, k varied from 2 to 5. In order to estimate the influence of the IQA measure on the results obtained by the fusion model, the percentage decrease of RMSE without the measure was calculated. Table 8 contains such reduced LCSIMs for CSIQ dataset, their RMSE values, and contributions. The table also contains LCSIM3, since it was developed on images from CSIQ.
Results shown in Table 8 confirm that the IQA measures that achieve good performance on CSIQ dataset and are less correlated with each other, are likely to be aggregated. In obtained fusion measures, weights do not reflect well the contribution of selected IQA measures, what can be seen in case of three (k = 3) fused models, where VIF and MAD with lower weights contributed more than SFF. The sign of the weight depends on correlation of the measure with objective scores (MOS or DMOS) but it can also be used as compensation, making the resulting vector of objective opinion scores closer to the vector of subjective scores, since the optimisation utilise RMSE between them for finding better aggregated models.
It is worth noticing that RMSE results obtained by all measures developed in experiments with the predefined number of IQA measures are better than results of state-of-the-art approaches on this dataset (see Table 8).  MAD, VIF, and MSSIM contributed the most to LCSIM measures obtained on CSIQ dataset. This can also be observed for the remaining LCSIM measures, where the best contributing three IQA single models are as follows: MAD (19.76%), IFS (16.90%), and PSNR (16.71%) to LCSIM1, VIF (15.87%), MAD (8.31%), and SSIM (4.49%) to LCSIM2, VIF (38.54%), MAD (33.87%), and GSM (4.22%) to LCSIM4.
The β used in calculation of RMSE (and PCC) also influenced the results. In order to show its influence, each β component, β = [β 1 , β 2 , . . ., β 5 ], determined in optimisation for a given LCSIM was changed in the range 0.1 to 20 with the step 0.1, while other components remained unchanged. Table 9 presents minimum, maximum, mean and standard deviation of RMSE values for each component calculated on benchmark datasets. It can be seen that β 4 has the largest influence on LCSIM1, β 2 on LCSIM2, β 3 on LCSIM3, and all components are similarly important to LCSIM4.

Conclusions
In this paper, a multimeasure resulted from a fusion of full-reference IQA measures is presented. The fusion was formulated as an optimisation problem that was solved using the genetic algorithm, which was also responsible for selection of appropriate IQA measures. Evaluation of the proposed approach on widely used four largest image benchmarks reveals that LCSIM family of measures performs better than compared state-of-the-art IQA models, in terms of prediction quality reflected by SRCC, KRCC, PCC, and RMSE. The contribution of aggregated IQA measures was also investigated in the paper. Further extension of the approach could involve using other IQA measures for fusion; therefore, Matlab source code that would allow running the optimisation with any newly developed measure with known objective scores for used image benchmarks and evaluate the results, is available to download at http://marosz.kia.prz.edu.pl/LCSIM.html. Another direction of future research would be to develop a fusion measure oriented on a given type of distortion or a measure which aggregates full-reference IQA measures with small memory footprint and short computation time.