Figures
Abstract
Image aesthetics assessment (IAA) has become a hot research area in recent years due to its extensive application potential. However, existing IAA methods often overlook the importance of spatial information in evaluating image aesthetics. To address this limitation, this study proposes a novel method called the Deep Convolutional Capsule Network (DCCN), which integrates an improved Inception module with a capsule routing mechanism to enhance the representation of spatial features—an essential yet frequently underexplored aspect in aesthetic evaluation. This design enables the model to effectively extract both global and local aesthetic features while maintaining spatial relationships. To the best of our knowledge, this is the first attempt to apply capsule networks in the IAA domain. Experiments conducted on two benchmark datasets, CUHK-PQ and AVA, demonstrate the effectiveness of the proposed method. The DCCN achieves a classification accuracy of 94.79% on CUHK-PQ, and on AVA, it obtains a Pearson Linear Correlation Coefficient (PLCC) of 0.8408 and a Spearman Rank-Ordered Correlation Coefficient (SROCC) of 0.7394. While the DCCN shows promising results, it exhibits sensitivity to style variations and resolution changes and has relatively high inference complexity due to dynamic routing, which may affect deployment in real-time applications.
Citation: Hu Y, Dong W, Zhang Y, Lu L (2025) Image aesthetic quality assessment: A method based on deep convolutional capsule network. PLoS One 20(9): e0331897. https://doi.org/10.1371/journal.pone.0331897
Editor: Matteo Bodini, Universita degli Studi di Milano, ITALY
Received: January 30, 2025; Accepted: August 22, 2025; Published: September 22, 2025
Copyright: © 2025 Hu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All datasets and code used in this study are publicly available. The CUHK-PQ dataset and the source code of our work are deposited on Zenodo: https://zenodo.org/records/16990506. The AVA dataset can be downloaded from GitHub: https://github.com/imfing/ava_downloader.
Funding: This work was supported in part by the Research Project of Beijing Institute of Graphic Communication (Grant no. E6202405) awarded to WD, the Disciplinary Construction and Postgraduate Education Project of Beijing Institute of Graphic Communication (Grant no. 21090525014, 21090325003) awarded to WD and LL, the Doctoral Degree Programs Cultivation Project of the First-level Discipline of Information and Communication Engineering of Beijing Institute of Graphic Communication (Grant no. 21090525004) awarded to WD and LL, and the Research Platform Construction Project of Beijing Institute of Graphic Communication (Grant no. KYCPT202509) awarded to LL. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
With the rapid development of Internet technology and the widespread use of smartphones and digital cameras, capturing and sharing images has become remarkably convenient, resulting in an explosive growth of digital images. However, the aesthetic quality of these images varies significantly. Consequently, selecting high-aesthetic-quality images from massive image collections has emerged as a crucial research topic in the fields of image processing and computer vision [1]. Image Aesthetic Assessment (IAA) seeks to computationally model human aesthetic preferences by predicting overall aesthetic scores and attribute-based ratings, reflecting the inherently subjective nature of visual aesthetics [2]. Fig 1 presents representative examples of images with high and low aesthetic quality. On social media platforms, high-aesthetic-quality images are more likely to attract user attention and receive more likes and shares. In industries such as advertising and media publishing, such images can significantly enhance user experience and optimize advertising performance. In the field of artistic design, aesthetically pleasing images can improve the visual appeal and artistic value of creative works.For individual users, posting high-aesthetic-quality images helps establish a positive personal image and increases influence and recognition within social networks. Therefore, in-depth research on image aesthetic quality assessment is of great theoretical significance and holds broad application prospects. It can be applied to improve content quality, optimize recommendation systems, and enhance users’ photographic and editing skills [3].
(a) High aesthetic quality images. (b) Low aesthetic quality images.
Aesthetic quality is fundamentally different from general image quality. Traditional image quality assessment (IQA) focuses on detecting degradations such as noise, blur, or compression artifacts, aiming to measure objective visual fidelity [4]. In contrast, IAA deals with highly subjective perceptions, encompassing emotional responses, compositional preferences, and stylistic appreciation [5]. While IQA emphasizes technical correctness, IAA emphasizes visual appeal and personal preference. Therefore, aesthetic assessment presents more nuanced challenges and requires models to capture complex, abstract, and subjective features beyond simple distortions.
In recent years, some research has explored aesthetic assessment from multiple perspectives. For example, Li et al. [6] proposed an attribute-assisted multimodal memory network to integrate visual and textual cues for aesthetic prediction. Soydaner and Wagemans [2] presented a multi-task convolutional neural network that simultaneously predicts overall aesthetic scores and aesthetic attributes. Pan et al. [7] designed an adversarial learning framework to enhance attribute representation for aesthetic assessment. These works highlight the importance of attributes, semantics, and multi-task learning in aesthetic modeling. Although our method focuses on overall aesthetic score prediction, it provides a flexible foundation and can be further extended to attribute-aware and multimodal settings in future research.
In previous studies, researchers have employed handcrafted methods to extract aesthetic features from images. This category of image aesthetic quality assessment algorithms focuses on extracting predefined features that are constructed based on human visual perception and aesthetic theory. The core idea of handcrafted-feature-based methods is to evaluate image aesthetic quality using visual elements such as composition, color, and texture.For instance, Datta et al. utilized 56 features including texture, color, rule of thirds, and region contrast to represent image aesthetics [8]; Ke et al. employed image clarity, contrast, color, and average brightness as aesthetic features [9]; Tong et al. used three features—contrast, vividness, and saliency—to distinguish aesthetic quality [10]; Marchesotti et al. adopted Scale-Invariant Feature Transform (SIFT) and content-based local image descriptors to classify aesthetic quality [11]. Handcrafted-feature-based approaches are generally built upon explicit aesthetic theories, providing relatively strong interpretability for the assessment results. However, these methods suffer from two main limitations. First, handcrafted feature extraction heavily relies on specific aesthetic theories, requiring researchers to possess substantial domain knowledge. Moreover, the aesthetic features considered are often not comprehensive enough to accommodate diverse aesthetic standards. Second, for complex aesthetic attributes such as emotional expression, it is difficult to extract effective features using handcrafted methods.
In contrast, deep learning models are capable of extracting comprehensive aesthetic features from images, thus overcoming the limitations of traditional algorithms. As a result, many recent studies have adopted deep learning-based approaches. In general-purpose image aesthetic quality assessment algorithms based on deep learning, researchers utilize deep learning techniques to automatically extract aesthetic features from images. These models can learn complex aesthetic representations directly from large-scale annotated datasets. Compared with handcrafted-feature-based methods, deep learning approaches can capture more abstract and subtle aesthetic differences, thereby achieving higher accuracy in image aesthetic quality assessment tasks.For instance, Tian et al. proposed a dual-path deep convolutional neural network model [12]; Kao et al. introduced a multi-task deep learning framework [13]; Lu et al. developed a deep multi-patch aggregation network [14]; Wang et al. presented a multi-scene deep learning model [15]; Zhang et al. proposed a multimodal self-and-collaborative attention network that models the relationship between image and textual features [16]; Liu et al. introduced an attention mechanism-based model with holistic nested edge detection, focusing on local and edge features of images [17]; Yang et al. proposed an aesthetic quality assessment model based on color composition and spatial formation, evaluating the aesthetic level by analyzing image color and spatial structure [18]; Li et al. designed a multi-task self-supervised model guided by photographic knowledge [2]; Celona et al. presented a model combining image semantics, artistic style, and composition attributes [19]; Pfister et al. proposed a self-supervised learning method for aesthetic feature representation [20]; Yan et al. developed a semantic-aware multi-task convolutional neural network [21]; and Chen et al. introduced a multi-task network based on scene, aim, and emotion information [22].
In the earlier development of deep learning algorithms, researchers commonly utilized Convolutional Neural Networks (CNNs) to process image data for aesthetic quality assessment. CNNs can automatically extract key features from raw data through multiple layers of convolution, capturing low-level features such as edges and textures, and progressively combining them into more complex high-level features, thereby deepening the understanding of visual patterns [23]. However, CNNs often suffer from significant loss of spatial information due to pooling operations [24]. Although these operations are effective for dimensionality reduction and feature abstraction, they inadvertently strip away spatial details from image data. This loss of fine-grained spatial information is particularly detrimental in tasks that rely on precise element arrangement, such as image aesthetics assessment, where spatial composition plays a crucial role [25].
The spatial information of an image is a critical aspect of image analysis, as it involves the relative positions and layout of elements within the image, directly influencing human visual perception and aesthetic experience [26]. One fundamental principle is balance, which refers to the even distribution of elements within the spatial layout of an image. When the elements in an image are visually distributed evenly or symmetrically, they produce a sense of stability, harmony, and visual pleasure. Another key principle is emphasis and focus. By controlling the relative size, position, and contrast of image elements, viewers’ visual attention and focus can be guided. When important elements are placed in prominent spatial positions within an image, such a layout can draw the viewer’s gaze and convey important information [27]. Proper use of focus helps viewers better understand the theme or narrative conveyed by an image. Therefore, spatial information is of great importance in image aesthetic assessment.
To address the aforementioned limitations of existing CNN- and Transformer-based methods in preserving spatial hierarchies, this study introduces the Capsule Network (CapsNet) [28]. While various CNN and Transformer-based models have been used for IAA, they often struggle to maintain fine-grained spatial relationships that are crucial to human aesthetic perception. Capsule networks, on the other hand, are particularly suited for such tasks due to their inherent ability to preserve spatial hierarchies and semantic entity relationships via dynamic routing. Specifically, it encodes object pose parameters (e.g., position and orientation) into capsule vectors, which are routed via a dynamic routing mechanism to preserve spatial relationships between features [29]. This mechanism avoids the excessive information loss often caused by pooling operations in CNNs and provides a more interpretable representation through vector outputs [30]. Furthermore, to enhance aesthetic feature extraction, we design an improved Inception module without max-pooling and with adjusted kernel sizes to capture multi-scale features while maintaining spatial integrity [31]. These components are integrated into the proposed Deep Convolutional Capsule Network (DCCN), which is designed to comprehensively model spatial structure and aesthetic semantics.
DCCN is designed to comprehensively extract image aesthetic features while preserving spatial information. The major contributions of this study are as follows:
- The first use of CapsNet in the image aesthetics assessment method is proposed.
- An improved inception module that enhances aesthetic feature extraction is presented.
- High accuracy was attained by the DCCN in the binary classification and distribution histogram prediction tasks, as reflected by the notable PLCC and SROCC scores.
Related work
Methods for aesthetic assessment of images can be categorized into two groups, methods for extracting aesthetic features manually and methods for extracting aesthetic features of images using deep learning methods.
Manual features
The first category uses manual features. Datta et al. [8] utilized 56 features, including light, colorfulness, saturation, hue, rule of thirds, familiarity measure, texture, size, aspect ratio, region composition, low-depth-of-field indicators, shape convexity and so on, to represent the aesthetic quality of an image. Ke et al. [9] considered six features in their aesthetics assessment approach: spatial distribution of edges, color distribution, hue count, blur, contrast, and average brightness. Tong et al. [10], four features were used to differentiate images with high- and low- aesthetic quality, and these are blurriness, contrast, colorfulness, and saliency. Marchesotti et al. [11] employed generic content-based local image signatures to classify the aesthetic quality of an image. In manual feature-based methods, researchers must acquire extensive knowledge about photographic aesthetics, making it challenging to extract the overall aesthetic features of images.
Deep learning feature
The second category uses aesthetic features extracted using deep-learning methods. Tian et al. [12] utilized powerful deep convolutional neural networks, whereas Kao et al. [13] developed a multi-task deep learning framework. Lu [14] introduced a double-column deep convolutional neural network, and Wang et al. [15] created a multi-scene deep learning model. Zhang et al. [16] focused on image-text feature relationships with their multimodal self-and-collaborative attention network. Yang et al. [18] assessed image aesthetics using a model that analyzed color composition and spatial formation. Celona et al. [19] combined image semantics, artistic styles, and composition into the aesthetic assessment model. Pfister et al. [20] leveraged a self-attention mechanism for learning aesthetic features, and Yan et al. [21] integrated semantic awareness into a multi-task convolutional neural net-work. However, during the feature extraction process of these deep learning-based methods, spatial information is often lost, potentially overlooking its influence on image aesthetic quality.
Method
Based on the information provided in Table 1, it is apparent that the methods listed do not explicitly consider spatial location information or incorporate it to a limited extent. For instance, the manual features described by Datta et al. [8], Ke et al. [9], and Tong et al. [10] focused on various aesthetic attributes, such as light, colorfulness, saturation, and contrast, but there was no explicit mention of the spatial location within the image. Similarly, Marchesotti et al. [11] mentioned the use of “generic content-based local image signatures,” which may imply some level of spatial awareness, but do not clearly indicate the extent to which spatial location is factored into their analysis.
In the realm of deep learning, the methods listed tend to emphasize the high-level characteristics captured by deep neural networks. Tian et al. [12] noted that CNNs learn the spatial hierarchies in images. However, the emphasis on spatial location is inconsistent and often unclear. Lu [14] and Wang et al. [15] described deep learning models that are likely to learn spatial relationships; however, the explicit focus on spatial location information was not mentioned. Celona et al. [19] discussed a model that combines image semantics, artistic style, and composition, which suggests a broader understanding of spatial location, as composition involves the arrangement of elements in space. However, the specifics of how spatial information is processed are not detailed.
In addition, recent approaches such as the multi-task CNN framework [2], the Attribute-Assisted Multimodal Memory Network(AAMMN) [6], and the adversarial attribute-guided aesthetic [7] have explored auxiliary cues such as multi-task learning, semantic attributes, and multimodal information to enhance aesthetic prediction. These methods have achieved good results by modeling attribute-level supervision or using attention mechanisms. However, they still rely on convolutional backbones, where the loss of spatial detail due to pooling or flattened representations may limit the model’s ability to retain spatial relationships critical to image composition.
In conclusion, although some methods may capture spatial information via CNN structures or image analysis, many do not provide explicit and comprehensive details of the spatial location information. To overcome these shortcomings, this study proposes a novel approach utilizing CapsNet.
To incorporate the spatial information of images, this study proposes a deep convolutional capsule network as the basic architecture, which integrates a CapsNet and an inception module. The structure of the DCCN model is shown in Fig 2.
The DCCN model consists of four layers: an initial convolutional layer, a primary capsule layer, a digital capsule layer, and an output layer. The initial convolutional layer performs convolutional operations consistent with a conventional neural network, which can extract image features and reduce the number of parameters. The primary capsule layer further performs the convolution operations. The result of the convolution operation is converted into the data format required by the digital capsule layer. The digital capsule layer employs a dynamic routing mechanism to train the network. Finally, the output layer generates the recognition results.
Initial convolutional layer
Since the introduction of AlexNet in 2012, the trend of neural network architecture has inclined to increase the depth of the network [32]. However, it has been observed that beyond a certain threshold, more layers can lead to a decrease in accuracy. Moreover, a network with a large number of parameters requires numerous hardware resources for training [33,34]. To address this limitation, Google proposed GoogLeNet [35], which offers an alternative to the traditional neural network structures. GoogLeNet primarily utilizes inception modules [36], which combine different convolutional kernels and merge the results into a specific dimension. There are five versions of inception: inception-v1, inception-v2, inception-v3, inception-v4, and inception-resnet.
The inception-v1 module includes 1 1, 3
3, and 5
5 convolution kernel channels and pooling layer channels [37]. This module can extract features from three different fields of view by employing convolution kernels with three different sizes. The structure of Inception-v1 is shown in Fig 3.
The CapsNet proposed by Hinton has only one convolutional layer in the initial convolution stage, and this convolutional layer contains only a single convolution kernel [28]. However, if only one convolutional kernel is used to extract aesthetic features, the obtained features may not be comprehensive, which can significantly affect the assessment accuracy. To address this limitation, in this paper, the Inception-v1 module is used to replace the single convolution kernel [38] in the initial convolutional layer of CapsNet. However, the direct incorporation of the Inception-v1 module in the CapsNet has two limitations. First, the maximum pooling operation within the Inception-v1 module leads to a loss of spatial information in the image. Second, the 1 1, 3
3 and 5
5 convolutional kernels offer limited fields of view, which are inadequate for effectively extracting features from larger fields of view. To capture significant image aesthetic features better, in this paper, two improvements of the Inception-v1 module are presented in the initial convolutional layer. First, to preserve spatial information, the pooling channels in the Inception-v1 module are removed. Second, to extract aesthetic features across both small and large fields of view, the 5
5 convolutional kernel in Inception-v1 is replaced with a larger 7
7 convolutional kernel. The improved Inception-v1 module is shown in Fig 4.
Primary capsule layer
To reduce the number of parameters, a 9 9 convolutional kernel is utilized in the primary capsule layer to convolve the fused features from the initial convolutional layer further. This produces a feature map of h
w
64, where h and w represent the height and width of the feature map, respectively. To comply with the data format required by the digital capsule layer [39], the matrix of this feature map is rearranged into a data format of r
8, where the parameter r is equal to hei
wid
8.
Digital capsule layer
The vectors outputted from the primary capsule layer are fed into the digital capsule layer. The dynamic routing process is as follows.
- Input vectors ur are multiplied by the initialization matrices wir to generate shallow capsules uir, as shown in (1).
(1)where i represents the number of repetitions and r denotes the number of vectors involved.
- The matrices bir are initialized and subjected to the softmax operation. The softmax operation generates the result cir, as shown in (2). The shallow capsules uir are first multiplied by cir and then are summed to obtain the deep capsule si, as shown in (3).
(2)
(3)
- A squashing function is performed on the deep capsule si to generate the recognition result
. The squashing function is shown in (4).
(4)
- The obtained vector
is first multiplied by the shallow capsules uir and then added to bir. The updated bir is obtained, as shown in (5).
(5)The above steps (1), (2), (3), and (4) are repeated three times.
Loss function
The input x is fed into the DCCN model to generate the predicted aesthetic score , as shown in (6).
where f denotes the proposed DCCN model, and wir represents the parameters that must be initialized. To reduce the loss between the predicted and ground-truth aesthetic scores of image x, the boundary loss is used as the loss function, which takes the form of (7).
where denotes the ground-truth aesthetic score, m + represents the upper boundary with a value of 0.9, m− stands for the lower boundary with a value of 0.1, and parameter λ is set to 0.5, which indicates that the loss of positive samples is twice as important as the loss of negative samples.
The gradient descent method is used to update the matrix wir, as shown in (8).
This study used the Adam optimizer to update wir, as is shown in (9).
where ε denotes a small constant with a value of 1e-5, α represents the inner learning rate, and the terms mwir(j) and stand for the first and second raw moments of the gradients, defined as (10) and (11), respectively.
where the values of mwir(0) and are set to zero;
and
denotes the exponential decay rates of mwir(j) and
; gwir(j) represents the updated gradients in j-th step and j ranges from 1 to S.
The process of the proposed DCCN model can be roughly described as Algorithm 1.
Algorithm 1. Algorithm for proposed approach.
Process summary
In summary, the complete experimental procedure is as follows: In the data preprocessing stage, we resized the images from the CUHK-PQ and AVA datasets to 112 112 and normalized the pixel values to the range of [0, 1]. The core architecture of the DCCN model includes the input layer, the improved inception module, and the capsule network. The inception module is responsible for extracting multi-scale aesthetic features from the images, while the dynamic routing mechanism of the capsule network helps retain the spatial hierarchy, further enhancing the modeling of spatial information. During training, we employed an appropriate loss function and used the Adam optimizer for parameter optimization. In subsequent experiments, we validated the model’s effectiveness using evaluation metrics such as classification accuracy, PLCC, and SROCC.
Experimental results
In this section, we first outline the experimental configuration, datasets utilized, and evaluation metrics. Furthermore, experiments are conducted on the selection of inception convolution kernel sizes. Then, a discussion of the differing performances of the DCCN model on the CUHK-PQ and AVA datasets is presented. To conclude, this paper presents two sets of ablation studies that confirm the efficacy of the enhanced inception module and boundary loss function.
Experimental configuration
The experimental configurations are listed in Table 2. The experiments are conducted using Python 3.7.0 with TensorFlow-GPU 2.6.0 as the programming environment. An Intel Core i7-13700 CPU with 64.0 GB of RAM and an NVIDIA RTX 4090 GPU with 24 GB of memory are utilized. The size of the input images is set as 112 112
3. The training and validation sets are split in an 8:1 ratio, the batch size is set to 128, and the learning rate to 1e-8. The proposed DCCN model is trained for 100 epochs, with each epoch taking approximately 70-80 seconds.
Databases
The CUHK-PQ [40] dataset is an image aesthetics dataset published by the Chinese University of Hong Kong that includes 17,690 images. Each image was labeled with an aesthetic quality rating of either high or low, as well as semantic annotations. Semantic annotations include categories such as animals, architecture, humans, landscape, night, plants, and static [41]. Fig 5 presents images with different semantic annotations from the CUHK-PQ dataset.
The images in the top row have high aesthetic quality, and the images in the bottom row have low aesthetic quality.
The AVA dataset [42] utilized in the experiments is a large dataset for image aesthetic assessment, containing 250,000 images. Each image in the AVA dataset is annotated using aesthetic scores, semantic labels, and style labels. Aesthetic scores range from 1 to 10 and are provided by different individuals [43]. This dataset includes 66 semantic labels [44], such as animals, history, and music, and each image contains 0-2 semantic annotations. Additionally, there are 14 style labels, including complementary_colors, duotones, HDR, image_grain, light_on_white, long_exposure, macro, motion_blur, negative_image, rule_of_thirds, shallow_DOF, silhouettes, soft_focus, and vanishing_point. Fig 6 presents images from the AVA dataset. Table 3 lists the annotations in Fig 6(a). In the AVA dataset, a score of 5 is commonly considered the threshold distinguishing high and low aesthetic quality images. The score distribution for Fig 6(a) shows that a large number of raters (58 individuals) assigned a score of 7, indicating strong consensus among annotators on the image’s high aesthetic quality. This concentrated distribution in the higher score range reflects the image’s prominent aesthetic appeal and its capacity to attract viewer appreciation.
Evaluation indices
The accuracy of an image aesthetics assessment method refers to the capacity of the evaluation model to classify images correctly based on their aesthetic quality. It serves as the most direct index of performance and is quantified by the ratio of the number of images accurately classified by the model to the total number of images. Specifically, in the binary classification task of image aesthetics evaluation, accuracy pertains to the proportion of images that the model precisely differentiates as high- or low- quality. The formula for calculating the accuracy is shown in (12).
where TP represents the number of high-quality images that are correctly identified as such, TN denotes the number of low-quality images that are accurately recognized as low quality,P stands for to the total number of positive samples, and N is the total number of negative samples.
To measure the strength of the relationship between predicted scores from an image aesthetics assessment method and human subjective scores, this study employs the Pearson linear correlation coefficient (PLCC) and Spearman rank order correlation coefficient (SROCC) as evaluation indices. PLCC and SROCC range from -1 to 1. A PLCC or SROCC value of 1 indicates a perfect positive correlation, meaning the predicted scores are completely proportional to human subjective scores; a value of -1 signifies a perfect negative correlation, meaning the predicted scores are inversely proportional to human subjective scores; and a value of 0 indicates no relationship.
Typically, PLCC measures the strength of the linear relationship between two variables, while SROCC assesses the monotonic relationship regardless of linearity. PLCC is sensitive to outliers because it directly involves numerical values of the raw data. By contrast, SROCC is less sensitive to outliers, as it focuses solely on the rank order of the data.
Size selection of convolutional kernel in the inception module
The proposed DCCN model is tested on the aforementioned datasets for the training of binary classification. The binary assessment rule is defined in (13).
where rank denotes the aesthetic assessment level, score represents the mean aesthetic score of an image. The value of rank is 0 or 1, which indicates low or high aesthetic quality level.
In the DCCN model, only the size of the large-field convolution kernel in the initial convolutional layer is modified, whereas the rest of the model remains unchanged. The accuracy results obtained for the AVA dataset are presented in Table 4 and Fig 7.
In Table 4 and Fig 7, it can be observed that the 7 7 convolutional kernel achieves the highest accuracy (77.35%). Hence, 7
7 is used as the size of the convolutional kernel with a large field of view in the proposed DCCN model.
We consider that this kernel size offers a good balance between local detail and global context. Smaller kernels (e.g., 1 1 or 3
3) are limited in capturing high-level aesthetic structures, such as overall composition and balance, while excessively large kernels (e.g., 9
9 or 13
13) may dilute spatial precision and introduce noise. The 7
7 kernel is sufficiently large to capture mid- to high-level aesthetic cues like symmetry, color layout, and saliency regions, while still retaining important spatial details. Therefore, it leads to better feature representation and contributes to improved model performance.
Peformance of the DCCN model
The accuracy of the proposed DCCN model on both the CUHK-PQ and AVA datasets is shown in Table 5. To reduce the number of parameters, the input image is resized to 112 112.
From Table 5, it can be observed that the DCCN model achieves the best performance with an accuracy of 94.79% on the CUHK-PQ dataset. The second-best performance is obtained on the AVA dataset, with an accuracy of 77.35%. It also attains the PLCC of 0.8408 and the SROCC of 0.7394 for predicting the distribution of aesthetic scores.
The performance gap between the two datasets can be attributed to their inherent differences. CUHK-PQ is a relatively clean and balanced dataset with binary labels (high or low quality), which simplifies the classification task. In contrast, AVA is more complex and challenging—it contains subjective annotations with aesthetic scores ranging from 1 to 10, introducing greater variability and label noise. Additionally, the AVA dataset includes diverse image content and styles, which increases the difficulty of capturing consistent aesthetic patterns. These factors contribute to the lower accuracy observed on AVA compared to CUHK-PQ.
A comparison of the different methods for the CUHK-PQ and AVA datasets is presented in Table 6. While the DCCN model exhibits competitive performance across most evaluation metrics, we note that some prior methods report slightly higher results on certain indicators, particularly on the AVA dataset. This variation may be attributed to differences in model design objectives and dataset characteristics. For instance, models incorporating additional semantic supervision or handcrafted priors may gain advantages in capturing subtle aesthetic cues embedded in the subjective and diverse AVA annotations. In contrast, our model emphasizes structural compactness and spatial hierarchy preservation, aiming for a lightweight yet robust solution that generalizes well across tasks and platforms.
Nevertheless, the DCCN model exhibited an impressive accuracy of 94.79% on the CUHK-PQ dataset, surpassing the Su method [45], Marchesotti method [11], Zhang method [46], Tian method [12], and Wang method [15] by 2.73%, 2.85%, 4.89%, 4.48%, and 2.20%, respectively. This indicates that the DCCN is especially effective in handling binary classification tasks like CUHK-PQ.
The comparative analysis of different methods based on PLCC and SROCC indices, as shown in Table 6, reveals that the DCCN achieves an SROCC of 0.7394, which is 0.0892 lower than the highest score of 0.8286. Nevertheless, it demonstrates superior performance in terms of PLCC, reaching 0.8408. This suggests that the model can better capture the distribution trend of aesthetic scores, even if the ranking is not perfectly aligned.
Moreover, the DCCN model has the fewest parameters (5.83 million) among all methods, whereas the GloRe model [47] has 25.8 million and the HLA-GCN [50] has 38.2 million parameters. A model with fewer parameters is generally less complex, leading to faster inference and lower memory usage. This makes DCCN especially suitable for real-time applications and deployment on resource-constrained or lightweight devices, enhancing its practical value.
Data analysis reveals that the proposed DCCN method achieves higher accuracy on the CUHK-PQ dataset. It also exhibits a higher PLCC value, indicating a strong correlation between the model’s predictions and the actual values. Moreover, compared with other methods, the number of parameters from the DCCN has a smaller value, which results in faster execution times.
The enhanced performance of the DCCN model can be attributed to the strengths of capsule networks, particularly their ability to preserve spatial hierarchies in images. Unlike traditional CNNs that rely heavily on pooling operations, which can cause the loss of important spatial information, the dynamic routing mechanism in capsule networks ensures that the spatial relationships between image features are maintained. This makes capsule networks more suitable for tasks that require spatial understanding, such as image aesthetics assessment, where the positioning and arrangement of visual elements play a crucial role. Additionally, the use of capsules allows the model to capture more complex patterns and hierarchies, improving its ability to recognize fine-grained aesthetic features and making it less sensitive to small variations in image style or composition. The combination of these advantages contributes significantly to the DCCN model’s ability to effectively assess image aesthetics.
To verify the statistical significance between DCCN and recent methods, a Krus-kal-Wallis H test was performed on GloRe, HLA-GCN, Munan, IAACS, and DCCN, considering accuracy, PLCC, SROCC, and the number of parameters as the dependent measures. Statistical analysis indicated that a larger H-statistic typically signifies substantial differences between groups. A p-value less than 0.05 warrants rejection of the null hypothesis. The findings of this study yielded an H value of 13.81 and a p-value of 0.0032, significantly below the threshold of 0.05, providing robust statistical evidence to reject the null hypothesis and affirm the existence of significant disparities.
Table 7 compares the performance of 15 image aesthetic assessment models on the AVA dataset, with all models pre-trained on ImageNet-1K. The table categorizes models into three types: CNN-based, Transformer-based, and the Capsule Network-based model proposed in this study. Each model is evaluated using PLCC, SROCC, and a calculated Ratio metric. In addition, we introduce a derived metric named Ratio, defined as SROCC divided by classification accuracy. This metric reflects the model’s ranking ability normalized against classification performance, offering a more comprehensive view of the model’s generalization ability across both tasks.
The DCCN model, which is the focus of this study, belongs to the Capsule-based category. It achieves a PLCC of 0.840 and an SROCC of 0.739 on the AVA dataset, surpassing most CNN-based and Transformer-based models in terms of PLCC. Despite having only 5.83M parameters, DCCN demonstrates a highly favorable balance between predictive performance and computational efficiency. Specifically, the Ratio value reaches 0.955, indicating that the model delivers exceptional performance relative to its parameter size, even outperforming many models with substantially larger complexities.
In conclusion, the DCCN model excels in performance, particularly in terms of the PLCC metric, while maintaining a much smaller model size compared to other state-of-the-art methods. This result highlights the efficiency of the capsule network in capturing spatial features, enhancing aesthetic quality evaluation while minimizing computational cost. The DCCN’s competitive performance, paired with its small parameter size, makes it a highly efficient model for real-time applications, especially when considering resource-constrained environments.
Ablation study
In this subsection, we explore the efficacy of the DCCN model using four sets of ablation experiments. The first set aims to assess the impact of the improved inception module on the DCCN model. The second set evaluates the effects of different loss functions on the performance of the DCCN model. The third set investigates the influence of varying the number of dynamic routing iterations in the capsule network, testing the model’s performance with 2, 3, 4, and 5 iterations. Finally, the fourth set explores the impact of modifying the number of capsules in the DigitCaps layer, analyzing how different capsule configurations affect the model’s overall performance.
The first set of ablation experiments is conducted to verify the effectiveness of the improved inception module, which comprises three trials. Initially, a capsule network was selected as the backbone of our architecture because it can have better capability in capturing image features and simultaneously preserving spatial positional information. Subsequently, to extract more effective aesthetic features, an inception module was incorporated into the capsule network. Finally, an enhancement of the inception module is proposed to prevent the loss of spatial information that may result from max-pooling operations. The results of these experiments are presented in Table 8 and Fig 8.
It is evident that the DCCN model outperforms both the CapsNet and CapsNet with the inception module in terms of performance. This indicates that the DCCN model is capable of effectively capturing the aesthetic features of images while maintaining stable performance. The effectiveness of the improved inception module has also been validated through our experiments.
The second set of ablation experiments is conducted to verify the effectiveness of the boundary loss, which includes three distinct trials, and the results of experiments are presented in Table 9 and Fig 9. Under the precondition that the DCCN model structure remains unaltered, this study compares its performance using EMD loss, cross-entropy loss, and boundary loss. The corresponding experimental results are shown in Fig 8(b). For the true values P and predicted values , the EMD loss is computed, as shown in (14).
where pk and denotes the probability masses of
and
the
-th point, respectively. The cross-entropy loss is computed, as shown in (15).
The third set of ablation experiments investigates the influence of the number of dynamic routing iterations in the capsule network. This experiment tests the DCCN model with different numbers of routing iterations: 2, 3, 4, and 5. The aim is to analyze how varying the number of dynamic routing iterations impacts the model’s performance in terms of classification accuracy, PLCC and SROCC.
The results of these experiments are presented in Table 10. As shown in the results, the DCCN model achieves its best performance with 3 iterations of dynamic routing, which leads to an optimal balance between spatial representation and computational efficiency. Increasing the number of routing iterations beyond 3 does not significantly improve performance but increases the computational cost. This demonstrates that the DCCN model benefits from an optimal number of routing iterations, where 3 iterations provide the best trade-off between model accuracy and computational efficiency.
The fourth set of ablation experiments investigates the effect of different structural configurations of the DigitCaps layer while keeping the total number of capsules constant (64). Specifically, we evaluate three capsule arrangements: 32 2, 16
4, and 8
8. As presented in Table 11, the 8
8 structure used in the DCCN model achieves the best performance across all metrics, indicating that a more balanced and square-shaped capsule layout facilitates more stable dynamic routing and more effective spatial representation.
From Table 8 and Fig 9, it can be observed that within the DCCN model, the accuracy attained with the boundary loss reaches 77.35%, PLCC reaches 84.08%, and SROCC reaches 73.94%. These indices surpass those achieved using EMD and cross-entropy losses. Consequently, this study posits that employing boundary loss in the DCCN model is more effective than utilizing EMD or cross-entropy loss. From the results presented in Tables 8 to 11, it is evident that each component of the DCCN model contributes significantly to its overall performance. The boundary loss demonstrates superior effectiveness, with the DCCN achieving an accuracy of 77.35%, a PLCC of 84.08%, and an SROCC of 73.94%, outperforming both the EMD and cross-entropy losses. Similarly, the improved inception module markedly enhances feature extraction capability, as shown by the substantial performance gap between the baseline capsule network and the final DCCN architecture. Furthermore, experiments modifying the number of routing iterations and the configuration of the DigitCaps layer also confirm that the default settings used in the DCCN — three routing iterations and an 8 8 DigitCaps layout — achieve the best balance between accuracy and stability. These ablation results comprehensively validate the architectural and training choices made in the design of the DCCN model.
Pedictions of model
Fourteen images with different semantic annotations from the CUHK-PQ and AVA datasets are randomly selected for prediction.
The assessment process is illustrated in Fig 10. The ratio of high to low aesthetic quality images is 1:1, which includes seven images of high aesthetic quality and seven of low aesthetic quality. Fig 11 illustrates these selected images and their probabilities obtained from the proposed DCCN model. The original labels and predictions of these fourteen images are provided in Table 12, where 1 and 0 denote respectively high aesthetic quality and low aesthetic quality.
(a)–(g) are low aesthetic quality images, and (h)–(n) are high aesthetic quality images.
From the experimental results in Table 12, it can be seen that ten images are correctly classified and four images are misclassified. Thus, the proposed DCCN model achieves a prediction accuracy of 71.43%, as listed in Table 12. This further confirms the effectiveness of the DCCN model.
The confusion matrix heatmap of these fourteen images is shown in Fig 12. From this heatmap, it can be observed that the DCCN model performs better in predicting images with high aesthetic quality, although its performance is weaker for images with low aesthetic quality.
Conclusion and future work
This study proposes a novel method for image aesthetics assessment based on CapsNet, designated as the DCCN model. Within the DCCN framework, an improved inception module is introduced to enhance the extraction of aesthetic features. To the best of our knowledge, this is the first application of CapsNet in the image aesthetics assessment domain. The effectiveness of the DCCN model is validated through binary classification and distribution prediction tasks. In the binary classification task, the model achieves an accuracy of 94.79% on CUHK-PQ and 77.35% on AVA. For aesthetic score distribution prediction on the AVA dataset, the DCCN attains a PLCC of 0.8408 and an SROCC of 0.7394, demonstrating its robust performance.
Although the proposed DCCN model exhibits notable performance, especially on the CUHK-PQ dataset, several limitations warrant further investigation. One concern is the model’s sensitivity to variations in image styles, such as vintage filters or grayscale effects, which may hinder its generalization when encountering stylistic distributions not represented in the training set. Additionally, the reliance on fixed-size inputs (112 112) may reduce the model’s adaptability across datasets with diverse image resolutions. Furthermore, the dynamic routing mechanism—despite enhancing spatial representation—imposes a relatively high computational cost during inference, which may limit deployment in real-time or resource-constrained scenarios.
Furthermore, future research should also investigate the impact of dataset bias on the model’s performance. The DCCN model currently shows sensitivity to different dataset distributions, which can lead to performance degradation. To address this issue, domain adaptation techniques could be explored to improve the model’s generalization across various datasets and reduce the impact of dataset-specific bias.
To further enhance the model’s effectiveness, future work can focus on expanding the diversity of the training dataset to include a broader spectrum of compositional styles and aesthetic patterns. Incorporating additional image-level information, such as semantic or emotional attributes, may also contribute to more nuanced aesthetic evaluation. Moreover, exploring advanced capsule-based architectures (e.g., Matrix Capsules) or hybrid frameworks that integrate DCCN with other neural networks may improve both performance and generalization. Addressing the aforementioned limitations could ultimately strengthen the practical applicability and robustness of the DCCN model in real-world aesthetic quality assessment tasks.
References
- 1.
Jin X, Zou D, Wu L, Zhao G, Li X. Aesthetic attributes assessment of images. In: MM Proceedings of the ACM International Conference on Multimedia. Nice, France, 2019. p. 311–9.
- 2. Soydaner D, Wagemans J. Multi-task convolutional neural network for image aesthetic assessment. IEEE Access. 2024;12:4716–29.
- 3. Deng Y, Loy CC, Tang X. Image aesthetic assessment: an experimental survey. IEEE Signal Process Mag. 2017;34(4):80–106.
- 4. Talebi H, Milanfar P. NIMA: neural image assessment. IEEE Trans Image Process. 2018:10.1109/TIP.2018.2831899. pmid:29994025
- 5. Lv P, Fan J, Nie X, Dong W, Jiang X, Zhou B, et al. User-guided personalized image aesthetic assessment based on deep reinforcement learning. IEEE Trans Multimedia. 2023;25:736–49.
- 6. Li L, Zhu T, Chen P, Yang Y, Li Y, Lin W. Image aesthetics assessment with attribute-assisted multimodal memory network. IEEE Trans Circuits Syst Video Technol. 2023;33(12):7413–24.
- 7. Pan B, Wang S, Jiang Q. Image aesthetic assessment assisted by attributes through adversarial learning. AAAI. 2019;33(01):679–86.
- 8.
Datta R, Joshi D, Li J, Wang JZ. Studying aesthetics in photographic images using a computational approach. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer; 2006. p. 288–301. https://doi.org/10.1007/11744078_23
- 9. Ke Y, Tang X, Jing F. The design of high-level features for photo quality assessment. In: Proceedings of the IEEE Computer Society Conference on Computer Vision, Pattern Recognition, New York, NY, USA, 2006, p. 419–26 and Art. no. 1640 788.
- 10.
Tong H, Li M, Zhang H, He J, Zhang C. Classification of digital photos taken by photographers or home users. Lecture Notes in Computer Science, Tokyo, Japan, 2005, p. 198–205.
- 11.
Marchesotti L, Perronnin F, Larlus D, Csurka G. Assessing the aesthetic quality of photographs using generic image descriptors. In: Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 2011. p. 1784–91.
- 12. Tian X, Dong Z, Yang K, Mei T. Query-dependent aesthetic model with deep learning for photo quality assessment. IEEE Trans Multimedia. 2015;17(11):2035–48.
- 13. Kao Y, He R, Huang K. Deep aesthetic quality assessment with semantic information. IEEE Trans Image Process. 2017;26(3):1482–95. pmid:28092553
- 14.
Lu X, Lin Z, Jin H, Yang J, Wang JZ. RAPID: rating pictorial aesthetics using deep learning. In: MM Proceedings of the ACM International Conference on Multimedia, Orlando, FL, USA, 2014. p. 457–66.
- 15. Wang W, Zhao M, Wang L, Huang J, Cai C, Xu X. A multi-scene deep learning model for image aesthetic evaluation. Signal Processing: Image Communication. 2016;47:511–8.
- 16. Zhang X, Gao X, He L, Lu W. MSCAN: multimodal self-and-collaborative attention network for image aesthetic prediction tasks. Neurocomputing. 2021;430:14–23.
- 17.
Liu L, Guo X, Bai R, Li W. Image aesthetic assessment based on attention mechanisms and holistic nested edge detection. In: 2022 Asia Conference on Advanced Robotics, Automation, and Control Engineering (ARACE); 2022. p. 70–5.
- 18. Yang B, Zhu C, Li FWB, Wei T, Liang X, Wang Q. IAACS: image aesthetic assessment through color composition and space formation. Virtual Reality & Intelligent Hardware. 2023;5(1):42–56.
- 19. Celona L, Leonardi M, Napoletano P, Rozza A. Composition and style attributes guided image aesthetic assessment. IEEE Trans Image Process. 2022;31:5009–24. pmid:35867369
- 20.
Pfister PF, Kobs K, Hotho A. Self-supervised multi-task pretraining improves image aesthetic assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Virtual, Online, TN, USA, 2021. p. 816–25.
- 21. Yan W, Li Y, Yang H, Huang B, Pan Z. Semantic-aware multi-task learning for image aesthetic quality assessment. Connect Sci. 2022;34(1):2689–713.
- 22.
Chen Y, Pu Y, Zhao Z, Xu D, Man, Qian W. Image aesthetic assessment based on emotion-assisted multi-task learning network. In: 2021 6th International Conference on Multimedia Systems and Signal Processing. 2021. p. 15–21. https://doi.org/10.1145/3471261.3471263
- 23.
Ma S, Liu J, Chen CW. A-Lamp: adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 4535–44.
- 24. Tomasini UM, Petrini L, Cagnetta F, Wyart M. How deep convolutional neural networks lose spatial information with training. Mach Learn: Sci Technol. 2023;4(4):045026.
- 25.
Hosu V, Goldlucke B, Saupe D. Effective aesthetics prediction with multi-level spatially pooled features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 9375–83.
- 26.
Obrador P, Schmidt-Hackenberg L, Oliver N. The role of image composition in image aesthetics. In: 2010 IEEE International Conference on Image Processing. 2010. p. 3185–8. https://doi.org/10.1109/icip.2010.5654231
- 27. Sammartino J, Palmer SE. Aesthetic issues in spatial composition: effects of vertical position and perspective on framing single objects. J Exp Psychol Hum Percept Perform. 2012;38(4):865–79. pmid:22428674
- 28. Jin X, Li X, Lou H, Fan C, Deng Q, Xiao C, et al. Aesthetic attribute assessment of images numerically on mixed multi-attribute datasets. ACM Trans Multimedia Comput Commun Appl. 2022;18(3s):1–16.
- 29. Xi E, Bing S, Jin Y. Capsule network performance on complex data. arXiv preprint 2017. https://arxiv.org/abs/1712.03480
- 30.
Hinton GE, Sabour S, Frosst N. Matrix capsules with EM routing. In: International Conference on Learning Representations, 2018.
- 31.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. p. 2818–26.
- 32. Zhao L, Shang M, Gao F, Li R, Huang F, Yu J. Representation learning of image composition for aesthetic prediction. Computer Vision and Image Understanding. 2020;199:103024.
- 33.
Xu M, Zhong J, Ren Y, Liu S, Li G. In: MM Proceedings of the ACM International Conference on Multimedia, Virtual, Online, USA, 2020. p. 798–806.
- 34. Zhu H, Zhou Y, Li L, Li Y, Guo Y. Learning personalized image aesthetics from subjective and objective attributes. IEEE Trans Multimedia. 2023;25:179–90.
- 35.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, et al. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015. p. 1–9. https://doi.org/10.1109/CVPR.2015.7298594
- 36. Zhou J, Zhang Q, Fan J-H, Sun W, Zheng W-S. Joint regression and learning from pairwise rankings for personalized image aesthetic assessment. Comp Visual Med. 2021;7(2):241–52.
- 37. Sheng K, Dong W, Huang H, Chai M, Zhang Y, Ma C, et al. Learning to assess visual aesthetics of food images. Comp Visual Med. 2021;7(1):139–52.
- 38. Yang J, Zhou Y, Zhao Y, Lu W, Gao X. MetaMP: metalearning-based multipatch image aesthetics assessment. IEEE Trans Cybern. 2023;53(9):5716–28. pmid:35580097
- 39. Zhang C, Liu S, Li H. Quality-guided video aesthetics assessment with social media context. Journal of Visual Communication and Image Representation. 2020;71:102643.
- 40. Tang X, Luo W, Wang X. Content-based photo quality assessment. IEEE Trans Multimedia. 2013;15(8):1930–43.
- 41. Hou J, Ding H, Lin W, Liu W, Fang Y. Distilling knowledge from object classification to aesthetics assessment. IEEE Trans Circuits Syst Video Technol. 2022;32(11):7386–402.
- 42.
Murray N, Marchesotti L, Perronnin F. AVA: a large-scale database for aesthetic visual analysis. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012. p. 2408–15.
- 43. Kim W-H, Choi J-H, Lee J-S. Objectivity and subjectivity in aesthetic quality assessment of digital photographs. IEEE Trans Affective Comput. 2020;11(3):493–506.
- 44. Li L, Zhu H, Zhao S, Ding G, Jiang H, et al. Personality driven multi-task learning for image aesthetic assessment. In: Proceedings. IEEE International Conference on Multimedia, Expo. Shanghai, China; 2019. p. 430–5 and Art. no. 8784 759.
- 45.
Su H, Chen T, Kao C, Hsu W H, Chien S. Scenic photo quality assessment with bag of aesthetics-preserving features. In: MM Proceedings of the ACM International Conference on Multimedia Co-Located Workshops, Scottsdale, AZ, USA; 2011. p. 1212–6.
- 46. Luming Zhang, Yue Gao, Zimmermann R, Qi Tian, Xuelong Li. Fusion of multichannel local and global structural cues for photo aesthetics evaluation. IEEE Trans Image Process. 2014;23(3):1419–29. pmid:24723537
- 47. Kong S, Shen X, Lin Z, Mech R, Fowlkes C. Photo aesthetics ranking network with attributes and content adaptation. In: Lect. Notes Comput. Sci. Amsterdam, Netherlands. 2016. p. 662–79.
- 48. Kao Y, Huang K, Maybank S. Hierarchical aesthetic quality assessment using deep convolutional neural networks. Signal Processing: Image Communication. 2016;47:500–10.
- 49.
Chen Y, Rohrbach M, Yan Z, Yan S, Feng J, Kalantidis Y. Graph-based global reasoning networks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision, Pattern Recognition, Long Beach, CA, USA; 2019. p. 433–42 and Art. no. 8954 417.
- 50.
She D, Lai Y, Yi G, Xu K. Hierarchical layout-aware graph convolutional network for unified aesthetics assessment. In:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, Online, USA; 2021. p. 8471–80.
- 51.
Ma S, Liu J, Chen CW. A-lamp: adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017. p. 722–31. https://doi.org/10.1109/cvpr.2017.84
- 52.
Chen Q, Zhang W, Zhou N, Lei P, Xu Y, Zheng Y, Fan J. Adaptive fractional dilated convolution network for image aesthetics assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 14114–23.
- 53.
She D, Lai YK, Yi G, Xu K. Hierarchical layout-aware graph convolutional network for unified aesthetics assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. p. 8475–84.
- 54.
Hosu V, Goldlucke B, Saupe D. Effective aesthetics prediction with multi-level spatially pooled features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
- 55.
He S, Zhang Y, Xie R, Jiang D, Ming A. Rethinking image aesthetics assessment: models, datasets and benchmarks. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2022.
- 56. Hou J, Lin W, Yue G, Liu W, Zhao B. Interaction-matrix based personalized image aesthetics assessment. IEEE Trans Multimedia. 2023;25:5263–78.
- 57. Hou J, Ding H, Lin W, Liu W, Fang Y. Distilling knowledge from object classification to aesthetics assessment. IEEE Trans Circuits Syst Video Technol. 2022;32(11):7386–402.
- 58.
Yue X, Sun S, Kuang Z, Wei M, Torr PHS, Zhang W, Lin D. Vision transformer with progressive sampling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. p. 387–96.
- 59.
Chen Z, Zhu Y, Zhao C, Hu G, Zeng W, Wang J, et al. DPT: deformable patch-based transformer for visual recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, 2021. p. 2899–907. https://doi.org/10.1145/3474085.3475467
- 60. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint 2021.
- 61.
Ke J, Wang Q, Wang Y, Milanfar P, Yang F. MUSIQ: multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021: 5148–57.
- 62.
Xia Z, Pan X, Song S, Li LE, Huang G. Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. p. 4794–803.
- 63.
Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, et al. MaxViT: multi-axis vision transformer. In: Proceedings of the European Conference on Computer Vision (ECCV). 2022.