Predicting Complexity Perception of Real World Images

The aim of this work is to predict the complexity perception of real world images. We propose a new complexity measure where different image features, based on spatial, frequency and color properties are linearly combined. In order to find the optimal set of weighting coefficients we have applied a Particle Swarm Optimization. The optimal linear combination is the one that best fits the subjective data obtained in an experiment where observers evaluate the complexity of real world scenes on a web-based interface. To test the proposed complexity measure we have performed a second experiment on a different database of real world scenes, where the linear combination previously obtained is correlated with the new subjective data. Our complexity measure outperforms not only each single visual feature but also two visual clutter measures frequently used in the literature to predict image complexity. To analyze the usefulness of our proposal, we have also considered two different sets of stimuli composed of real texture images. Tuning the parameters of our measure for this kind of stimuli, we have obtained a linear combination that still outperforms the single measures. In conclusion our measure, properly tuned, can predict complexity perception of different kind of images.


Introduction
The study of image complexity perception can be useful in many different domains. Within the human-computer interaction field, Forsythe et al. [1] proposed an automated system to predict perceived complexity and applied it in icon design and usability. Reinecke et al. [2] quantified visual complexity of website screenshots, formulating a model for the prediction of visual appeal in order to improve the user experience on the web. In addition it has been deemed useful in computer graphics, where a better understanding of visual complexity can aid the development of more advanced rendering algorithms [3] or image based 3D reconstruction [4]. Digital watermarking methods also can benefit from an estimation of image complexity as it has been related to the amount of information that can be hidden in image [5]. Nowadays it finds application in context-based image retrieval [6] where a model of joint complexity of images gives distances that can be used to estimate the degree of similarity between images. For instance in the research complexity of paintings, providing a machine learning scheme for investigating the relationship between human visual complexity perception and low-level image features.
Taking into account the multi-dimensional aspect of complexity, we propose a complexity measure based on a combination of several features related to spatial, frequency and color properties in order to predict complexity perception of real world images. We here consider two different kind of real world stimuli: Real Scenes, (RS) and real texture patches (TXT). The aim of our work is to propose a general purpose metric, that, tuned with respect to the kind of stimuli considered, can better correlate the subjective data with respect to single measures. We here propose a linear combination of visual features to predict image complexity perception where the weighting coefficients can reveal the role of each of them. Starting from a given set of stimuli, we apply a Particle Swarm Optimization (PSO) [38,39] to find the weighting coefficients of the linear combination that best fits subjective data. We set up an experiment, where observers evaluated the complexity of real world scenes on a web-based interface and they were also asked to verbally describe the criteria that guided their evaluation.
Analyzing the most common criteria reported by the observers in the questionnaire, it could be possible to associate some of them with single image features and compare the frequency of these criteria to the weighting coefficients of the linear combination. To test our proposal we performed a second experiment on a new database of real world scenes, where the linear combination previously obtained is correlated with new subjective data. To verify the usefulness of our complexity measure to predict complexity of a different type of stimuli, we performed two more experiments on two different datasets of texture images. We have chosen texture images because they present a high range of complexity levels like real world scenes but with a different semantic content. We again apply PSO to find the new weighting coefficients of the linear combination proposed on the first of these two texture sets of images and we test the obtained measure on the other texture dataset. Up to our knowledge no supervised or unsupervised measure to evaluate complexity perception of real world images has been presented in the literature. Thus as a benchmark for evaluating our proposal, we consider two measures of visual clutter: FC and SE [27]. We have chosen these two measures as they have been frequently used in the literature, where they have shown correlation with image complexity perception, [3,32,40], even if they are not unsupervised measures and they were not specifically designed to predict image complexity.

Materials and Methods
In this work we performed four experiments where the task is to evaluate image complexity. Each experiment is characterized by a different set of visual stimuli.

Stimuli
The images used as stimuli are all of high quality, acquired with professional and semi-professional cameras.
In Experiment 1 we used 49 images depicting real world scenes (RS1 dataset) belonging to the personal photo collection of the authors (RSIVL [41]). For Experiment 2 we considered other 49 real world scenes (RS2 dataset). These images correspond to the reference high quality images of the LIVE [42][43][44] (29 images) and the IVL database [45,46] (20 images). Images belonging to RS1 and RS2 have been chosen to sample different contents both in terms of low level features (frequencies, colors) and higher ones (face, buildings, close-up, outdoor, landscape). They include pictures of faces, people, animals, closeup shots, wide-angle shots, nature scenes, man-made objects, images with distinct foreground/background configurations, and images without any specific object of interest. Experiments 3 and 4 consider two different datasets of real texture images, that represent a kind of stimuli with contents significantly different from those represented within RS1 and RS2.
In Experiment 3, we consider 54 real texture images (TXT1 dataset), belonging to the Vis-Tex data set [47]. This data set consists of 864 images representing 54 classes of natural objects or scenes captured under non-controlled conditions with a variety of devices. From each of the 54 classes, one image has been chosen as representative of the corresponding group. In Experiment 4 we use texture images belonging to the Raw Food Texture database (RawFooT) [48,49]. It includes images of 68 samples of food textures, acquired under 46 lighting conditions. In our work we have used as stimuli 58 texture images acquired under the D65 lighting condition and frontal direction (TXT2 dataset).

Participants
Participants were recruited from the Informatics Department of the University of Milano Bicocca and were either students, researchers or administrative employees. No participants under the age of 18 were involved in our study and no health or medical data was collected from participants.
Through the web interface, informed consent was given by all participants. The data was collected anonymously.
For all experiments, six Ishihara tables were preliminarily presented to the observers to estimate color vision deficiency. If the participants did not report correctly any of the six they were discarded from the subjects' pool.
All the experiments reported in this article were conducted in accordance with the Declaration of Helsinki and the local guidelines of the University of Milano Bicocca (Italy). No ethical approval was required for the present study. All the stimuli and subjective data are available at our web site [50].

Experimental setup
In all four experiments observers were asked to judge images individually presented on a webinterface.
Before the start of the experiment, a grayscale chart was shown to allow the observers to calibrate the brightness and the contrast of the monitor. The observers were asked to regulate the contrast of their monitor to distinguish the maximum number of bands and discern details in shadows and in highlights. In Fig 1 we report the web-interface and the contrast chart used in the experiment.
After calibration, the stimuli were shown in random order, different for each subject. Subjects reported their complexity judgment (score) by dragging a slider onto a continuous scale in the range [0-100]. Stimuli were presented for an unlimited time, up to response submission. The position of the slider was automatically reset after each evaluation at the midpoint of the scale.
In order to get the observers accustomed to the experiment, seven practice trials were presented at the beginning of each experiment, with images not included in the dataset. The corresponding data were discarded and not considered for any further analysis. At the end of the experimental session, the observers were asked to verbally describe the characteristics of the stimuli that affected their evaluation of visual complexity.

Subject scoring
Mean subjective scores were computed for each observer. The raw, subjective, complexity score r ij for the i-th subject (i = 1, . . .S, with S = number of subjects) and j-th image I j (j = 1, . . .N, with N = number of dataset images) was converted into its corresponding Z-score as follows: where r i and σ i are the mean and the standard deviation of the complexity scores over all images ranked by the i − th subject.
Data were cleaned using a simple outlier detection algorithm. A score for an image was considered to be an outlier if it fell outside an interval of two standard deviations width about the mean score for that image across all subjects.
The remaining Z-scores, were then averaged across subjects to yield the mean scores y j for each image j: Our proposal of complexity measure Due to the multi-dimensional aspect of complexity, we here propose a complexity measure based on a linear combination of K different features related to spatial, frequency and color properties. This Linear Combination (LC) can be written as follows: where I j is the j − th image of the considered dataset (j = 1, . . .N), and M k is the measure of the k − th feature. Let x j = LC(I j ) and x kj ¼ M k ðI j Þ, Eq (3) can be rewritten in a compact way as: The set of optimal parameters {a k } = A ? 2 < K of Eq (4) were chosen in order to optimally fit subjective data using a population based stochastic optimization technique, called Particle Swarm Optimization (PSO) [38,39].
In PSO, a population of individuals is initialized as random guesses to the problem solution and a communication structure is also defined, assigning neighbors for each individual to interact with. These individuals are candidate solutions. An iterative process to improve these candidate solutions is set in motion. The particles iteratively evaluate the fitness of the candidate solutions and remember the location where they had their best success. The individual's best solution is called the particle best. Each particle makes this information available to its neighbors. They are also able to see where their neighbors have had success. Movements through the search space are guided by these successes. The swarm is typically modeled by particles in multidimensional space that have a position and a velocity. These particles fly through hyperspace and have two essential reasoning capabilities: their memory of their own best position and their knowledge of the global or their neighborhood's best position. Members of a swarm communicate good positions to each other and adjust their own position and velocity based on these good positions.
Recalling that one of the criteria widely used to evaluate the performance of a measure to fit subjective data is the linear Pearson Correlation Coefficient (PCC), we have chosen it as the fitness function to be maximized. To take into account the non linear mapping between objective and subjective data, the complexity measure x j is previously transformed using a logistic function f [51].
The fitness function is thus: where A is a feasible solution, f(x j ) is the logistically transformed value of the combined objective measure LC for the j-th image, and f ðxÞ and y are the means of the respective data sets. The optimal parameter values A ? are thus obtained as: Note that our fitness function introduces a simple form of regularization of the searched model that is able to mitigate possible overfitting. In fact, optimizing the PCC defined in Eq (5) we are looking for the solution that minimized the square errors between subjective data and a monotonic curve described by the logistic function.
To benchmark our proposal we consider two clutter measures developed by Rosenholtz et al. [27]. Both of them are defined as a combination of different image features. They were not specifically designed to predict image complexity but have been frequently used in the literature as they show good correlation with complexity perception [3,32]. The MATLAB implementation provided by the authors has been used: • Feature Congestion (FC): three clutter maps for the image, representing color, texture and orientation congestion are evaluated across scales and properly combined to get a single measure.
• Subband Entropy (SE): it is related to the number of bits required for subband image coding. After decomposing the luminance and the chrominance channels into wavelet subbands, the entropy is is computed within each band and a weighted sum of these entropies is proposed as clutter measure.

Objective measures
In what follows we list and briefly describe the 11 measures used in the linear combination of Eq (3). These measures evaluate simple visual features and were chosen as they were already used as complexity measures in the literature or could be related to complexity. The first six are computed on grayscale images and do not take into account the color information. M 1 to M 4 measure properties of the Grey Level Co-occurrence Matrix (GLCM), which is one of the earliest techniques used for image texture analysis and classification [52] and the MATLAB function graycoprops was used to compute them.   M 5 : Frequency Factor, it is the ratio between the frequency corresponding to the 99% of the image energy and the Nyquist frequency [53].
M 6 : Edge Density, it is obtained applying the Canny edge detector to the grayscale image with the parameters indicated by Rosenholtz et al. [27].
The following two M 7 to M 8 , describe image features which take into account color information when present: M 7 : Compression Ratio, it is here evaluated as the ratio of the image JPEG compressed with Q factor = 100 and the full size uncompressed image [54].
M 8 : Number of Regions calculated using the mean shift algorithm [55].
Finally, measures from M 9 to M 11 evaluate mainly color image properties: M 9 : Colorfulness; it consists in a linear combination of the mean and standard deviation of the pixel cloud in the color plane [56].
M 10 : Number of Colors; measures the number of distinct color in the RGB image, as described in [57].

Experiment 1
In Experiment 1 we use the 49 images of real world scenes, belonging to the RS1 dataset. The subjective data collected is processed to obtain the mean scores (see Eq (2). The RS1 images, ordered with respect to increasing mean scores, are reported in Fig 2. Image on the top left corresponds to the minimum mean score, while image on the bottom right is the one with the highest one. We now use the mean scores of Experiment 1 in the PSO optimization to set the optimal parameters A ? = {a k } of our complexity measure (Eq (3) Table 1. We call the linear combination obtained using these parameters LC RS1 .
Since the single measures used to obtain the linear combination LC RS1 have been previously normalized, from Table 1 we can infer the role of each of them when predicting image complexity. The highest contribution to the linear combination comes from M 8 (Number of regions), followed by M 5 (Frequency Factor) and M 10 (Number of Colors) while measure M 11 (Harmony) is the one with lowest weight. The sign of the coefficients mainly depends on two different aspects. The first one is related to how each single measure correlates with the subjective evaluations. In Fig 3 the scatter plots between mean scores and each of the 11 single measures are shown together with the monotonic functions that best fit the data. Some of the measures show a monotonically increasing correlation while other a monotonically decreasing one. The second aspect is related to the partial correlation between some features. A minus in the linear combination can also take into account the attempt of the PSO algorithm to reduce the redundancy.
In Fig 4 the scatter plot between subjective (mean scores) and objective (LC RS1 ) data of the RS1 images is reported. The monotone function that best fits the data is also shown with a continuous line. To benchmark our proposal, we also plot in the same Figure the scatter plots between mean scores and FC and SE respectively. To quantify the performance of these complexity measures to correlate with the subjective data, we show in Table 2 (first row) the corresponding PPCs. The p-values are all p < 0.001. From the comparison, we observe that LC RS1 outperforms both FC and SE.
To better analyze the results, we report in Table 3 (first row) the correlation performance, expressed in terms of PCCs, of each single measure. The p-values are all p < 0.001 except for M 9 and M 11 where p < 0.1. The results show that our proposal outperforms each single metric and confirm our initial hypothesis that a pool of measures can better predict image complexity perception.
The verbal descriptions recorded during Experiment 1 were mapped into a list of criteria that aggregates concepts with the same meaning. We summarize in Table 4 the most common criteria used, in terms of their frequency with respect to the observers. We underline that each observer could have used more than one criteria. The quantity of objects, details and colors are the criteria that seem to dominate the complexity perception in Experiment 1. Moreover the most frequent criteria quantity of objects of the verbal descriptions is in accordance with the highest coefficient a 8 obtained with the PSO.

Experiment 2
Experiment 2 is used to test the linear combination LC RS1 . The 49 images here considered (RS2 dataset), depict real world scenes. As in Experiment 1, the subjective data is processed to obtain the mean scores. The RS2 images, ordered with respect to increasing mean scores, are reported in Fig 5. Image on the top left corresponds to the minimum mean score, while image on the bottom right is the one with the highest one.
The scatter plot between mean scores and our complexity measure (LC RS1 ) is reported in Fig 6. The monotone function that best fits the data is also shown with a continuous line. The scatter plots between mean scores and FC and SE respectively are also included in the Figure for a comparison. In Table 2 (second row) the corresponding PPCs are reported. The p-      Table 5 summarizes the most common criteria collected during Experiment 2. From the analysis of the table we confirm the results of Experiment 1. As in the first Experiment the most relevant criteria used during the complexity assessment are quantity of objects, details and colors. They are also adopted with similar frequencies by the observers. Two more criteria are also used: familiarity and texture.

Experiment 3
In Experiment 3 we investigate how the complexity measure LC RS1 behaves when a different kind of stimuli is used. We thus consider as stimuli the 54 texture images of TXT1 dataset. Mean scores are obtained as in Experiments 1 and 2. In Fig 8 stimuli are shown in increasing order of complexity, according to mean scores. Image on the top left corresponds to the minimum mean score, while image on the bottom right has the highest one. We therefore correlate the linear combination LC RS1 with the subjective data collected for the texture stimuli and we find that it does not perform well: PCC = 0.36 with p = 0.001. For a comparison, in Table 2 (third row) we report also the PPCs corresponding to FC and SE applied to the TXT dataset. The p-values are all p < 0.001. We observe that also for FC and SE the correlation performance is decreased with respect to the case of real world scenes datasets.
Given the different kind of stimuli here used (texture vs real scenes) we propose to tune the weighting coefficients of the linear combination on this new set of stimuli. As before, we have performed 1000 runs of the PSO to obtain the new set of parameters. Within the 1000 runs, the average PCC results equal to 0.79 with standard deviation equal to 0.02 (minimun PCC = 0.67, maximun PCC = 0.83). In Table 6 we report the 11 weighting coefficients averaged over the 1000 run. We call the linear combination obtained using these coefficients LC TXT1 .
Comparing Tables 1 and 6, we observe that the linear combinations LC RS1 and LC TXT1 reflect the different nature of the stimuli (RS versus TXT) in the sign and absolute values of the coefficients. The highest contribution to the linear combination in LC TXT1 comes from M 5 (Frequency Factor) followed by M 10 (Number of Colors) and M 6 (Edge Density). The correlation coefficient PCC between LC TXT1 and the mean scores is now increased and is equal to 0.81. The scatter plots between the mean scores of Experiment 3 and LC TXT1 , FC and SE as well as the best fit of the data are shown in Fig 9 respectively. In Table 7 we summarize the results reporting the PCCs of LC TXT1 , LC RS1 , FC, and SE applied to TXT1 dataset. The p-values are all p < 0.001. We observe that the linear combination tuned on the texture set of stimuli outperforms all the others. We also correlate the subjective data collected for the TXT1 dataset with the 11 single measures, reporting in Table 2 (third row) their performance. We observe that in general the single measures do not perform very well. Moreover, for measures M 2 (Correlation) and M 11 (Color Harmony) we were not able to find a significant correlation. Only three of them show PCC greater or equal to 0.5 with p-values p < 0.001. These three measures are: M 3 (Energy), M 6 (Edge Density) and M 7 (Compression Ratio). For the remaining measures the p-values are p < 0.01.
We summarize in Table 8 the verbal descriptions of the observers. With this kind of stimuli the most frequent criteria adopted are regularity, understandability, quantity of details and familiarity, in agreement with the results of Guo et al. [34].

Experiment 4
Experiment 4 is used to further test the linear combination LC TXT1 . The 58 images of TXT2 dataset used in Experiment 4 are reported in increasing order of complexity in Fig 10. Subjective scores of this experiment are used to test the linear combination LC TXT1 . Its correlation performance is shown in Table 9 and it is compared with the performance on the training set TXT1. Our results confirm that also for the case of texture stimuli, the linear combination proposed LC TXT1 outperforms on the test set all the single measures considered (11 measures, FC and SE) and the LC RS1 , as it can be seen from Tables 2 and 3.

Discussions
From our investigation two aspects of image complexity can be underlined. Many different perceptual properties are involved in image complexity evaluation and their relative influence  depends on the type of stimuli. These considerations are supported by both our computational proposal and the analysis of the verbal descriptions.
Analyzing the subjective results of all the four experiments we can try to extract some general considerations about image complexity perception. We separate the following analysis with respect to the different kind of stimuli used. In the case of real world scenes, from Figs 2 and 5 we observe that images with few objects and close-ups are judged as the less complex ones, while on the other hand buildings and streetscapes mainly belong to the most complex ones. These results are in agreement with those obtained by Purchase et al. [31], who addressed image complexity within the field of web interface design.
Analyzing the verbal descriptions reported by the observers while evaluating image complexity (Tables 4 and 5) we note that quantity of objects, details and colors are the criteria that seem to dominate the complexity perception in both Experiments 1 and 2. For Experiment 2, other two criteria are also used: familiarity and texture. We observe that many of these verbal descriptions agree with the different definitions found in the literature about image complexity and above reported (section Introduction): Snodgrass et al. [17] refer to the visual complexity as the amount of details in an image, Heaps and Hande [18] define complexity as the degree of difficulty in providing a verbal description of an image (understandability), Forsythe [20] refers that image complexity should be considered in relation to familiarity. We also note that similar criteria have been found in the study by Oliva et al. [30] who use indoor scenes as stimuli. In fact, the authors reported that the criteria corresponding to variety and quantity of objects and colors dominated the representation of complexity, followed by concepts like clutter, symmetry, open space, organization and contrast.
Trying to associate these verbal descriptions with the single objective measures, we can associate the criteria quantity of objects, details and colors to M 8 (Number of Regions), M 10 (Number of Colors), and M 6 (Edge Density) respectively. The description order and regularity can be in correspondence with the visual clutter measures FC and SE. While quantity of objects, details and colors and order and regularity can be associated to bottom-up cognitive mechanisms, understandability and familiarity that play also an important role, are clearly related to top-down processes and none of the considered measures alone is able to capture these concepts. Moreover, several observers have reported both types of criteria (bottom-up and topdown), confirming that bottom-up and top-down mechanisms interfere in perception.
Regarding the texture stimuli, from Figs 8 and 9 we can notice that images with regular pattern and symmetries have been judged as less complex, while images with more details and less ordered structures have been judged as more complex. These findings are in accordance with those obtained by Heaps and Handel [36]. They ranked the complexity of 24 texture images, printed in grayscale and belonging to the same VisTex database as ours and they observed that "textures with repetitive and uniform oriented patterns were judged less complex than disorganized patterns".
The order of importance of the verbal descriptions of Experiment 3 has changed with respect to the corresponding ones of Experiments 1 and 2 (see Tables 4, 5 and 8. For texture images regularity is the most relevant criteria (reported by 60% of the observers), followed by understandability (47%). Instead for real world scenes these two criteria are among the ones less used (19% and 9% respectively in Experiment 1, and 19% and 22% in Experiment 2). These results are in accordance with those obtained by Yin et al. [25], who used sample images from Brodatz's album [59]. They found that regularity, understandability, roughness, directionality, and density are the main characteristics that affect the visual complexity perception of texture images.
The differences between complexity perception of real scenes and texture patches are mainly related to the different image content and are reflected by the different order of importance and frequency of the criteria reported in the verbal descriptions. In particular real scene images are more easily understandable than texture images. Analyzing the verbal descriptions reported by the observers we note that understandability is present for all the experiments. However in case of texture images it was more frequent, denoting that probably observers pay more attention on this aspect while evaluating texture complexity. Instead, in case of real world scenes understandability has been used with less frequency, as probably real scenes are intrinsically more understandable.
Comparing the performance of the single features, in terms of PCCs (see Table 3), we observe that in general those obtained for real world datasets (RS1 and RS2) are higher than the corresponding ones obtained for texture images. Following a similar trend, our linear combination proposal trained on real world scenes (LC RS1 ) shows a significantly low performance applied to texture images (see Table 2). However, we have demonstrated that if the parameters of the linear combination are optimized with respect to this new dataset (TXT1), a significant improvement is reached. The final performance is thus comparable with that previously obtained on real world images.
Also Chikhman et al. [29] concluded that for different types of images, different measures of complexity may be required. In fact they found that for outline objects the best predictor of their experimental data was number of turns of an image, while for the hieroglyphs set of stimuli, the best correlation was given by the product of the square spatial frequency median and the image area. Focusing on streetscape images, Cavalcante et al. [32] proposed to combine contrast and spatial frequency to form a single objective measure. Within the same kind of streetscape images they found that their proposal is more effective and robust for nighttime scenes than for daytime ones and thus they concluded that "objective measures based on reduced sets of low-level image characteristics are unlikely to be satisfactory for all possible streetscapes".
To assess complexity of painting images, Guo et al. [37] proposed to combine several visual features in a machine learning approach to classify images into three groups of high, medium and low complexity. In this way the authors were able to achieve a high classification performance. Moreover, studying the role of the single feature, they found that features related to hue and local color variations mostly conditioned the classification results in the specific case of painting images.
In our study we have considered real world images that cover several contents, including textures. We have considered eleven visual features, but our computational proposal can be extended either using different features or increasing the number of them, to cope with other different kind of stimuli.
Finally we remark that in our work we have considered databases of high quality images. An interesting aspect that should be investigated in the future is the interference between image quality and complexity perception: how do different kind of distortions influence image complexity perception? How does image complexity influence quality perception? The results of this analysis in fact could give an important insight in interpreting how signals and artifacts mutually interfere while evaluating image complexity.

Conclusions
Real world images show a high variability in depicted content. When observer are asked to assess image complexity, it emerges that complexity perception is guided by different criteria. Some of them are related to visual features representing bottom-up processes, while others are more related to top-down ones. Moreover, several criteria are adopted simultaneously by each observer, showing a multidimensional aspect of complexity. Thus in this work we have proposed a complexity measure that linearly combines several visual features. When this measure is applied to real world scenes and texture it is able to predict the complexity perception outperforming each single feature. The weighting coefficients of the linear combination depend on the kind of stimuli and the relative contribution of the measures mostly reflects the criteria used by the observers. As a future work, to define a more general complexity measure, we are planning to mix images belonging to real scene and texture databases on which conduct new psycho-physical experiments. Furthermore, we plan to investigate non-linear models to combine the single measures, as for example Support Vector Machine or Genetic Programming to take into account different modalities of interactions of visual properties.