Critical analysis on the reproducibility of visual quality assessment using deep features

doi:10.1371/journal.pone.0269715

Fig 1.

This is a high-level flowchart of the procedures employed in four of the five referenced visual quality papers.

Video frames, image patches, or images are input into a pre-trained deep learning network with a classification or regression head replacement. The entire network is fined-tuned and then used as a feature extractor. The approaches differ by using the last layer or all layers as a feature source. The feature representations are then aggregated, where appropriate, and used to train the final quality predictor.

More »

Expand

Table 1.

Performance results of DeepBIQ on LIVE-in-the-wild (rows 1-3) and TID2013 (rows 4-6) according to [6], as well as our own reimplementation of the approach as described in the paper.

The last column designates whether fine-tuning (column ‘ft’) was performed correctly (green checkmark), or with data leakage Case I (red cross). The numbers in bold font in lines 3 and 6 give the true performance of DeepBIQ, much below the claimed 0.90/0.89 PLCC/SROCC for LIVE-in-the-wild and 0.96 PLCC and SROCC for TID2013.

More »

Expand

Fig 2.

The training progress during fine-tuning as reported in [9].

The blue lines show smoothed and per iteration training accuracies in dark and light color variants, respectively. Similarly, the orange lines depict smoothed and per iteration training losses in dark and light color variants, respectively. The dashed dark gray lines linearly connect the validation accuracies and losses indicated by the dark gray circle markers.

More »

Expand

Fig 3.

A diagram of the MOS scale (numbers 1 to 5), and the class labels used in [9] and [10] for fine-tuning the network represented by the bins E to A.

Each bin interval is highlighted by a different color (red to green). The video MOS values are binned according to the five intervals (E to A) to form the target classes used during training. The quantization of MOS values to the five bins reduces labelling precision, which makes training more challenging. For instance, given three videos v₁, v₂, and v₃ at adjacent class boundaries, the difficulty of the classification task becomes apparent. The perceived quality of v₂ and v₃ is very similar, but they are split into different classes. Conversely, v₁ and v₂ have a less similar quality than the previous pair, but they are grouped into the same class.

More »

Expand

Fig 4.

Comparison of reimplementations of the fine-tuning procedure.

The top figure depicts the training progress of a fine-tuning procedure with data leakage, while the bottom figure shows the training progress of a fine-tuning procedure without data leakage.

More »

Expand

Fig 5.

Average distribution of class predictions in percent across the five splits used for the fine-tuning of the feature extraction model.

The error bars denote the standard deviation.

More »

Expand

Fig 6.

Performance comparison of SVRs trained using different kernel functions from our reimplementation.

Chart (a) shows the results when no fine-tuning is used for the feature extraction network. The performance with correctly applied fine-tuning is shown in the chart (b), which is also the true performance of the approach. Charts (c) and (d) depict the performance when fine-tuning is performed with data leakage. The bars represent the average performance of five random training, validation, and test splits. Independent test sets are chosen prior to fine-tuning, and for (d) also tainted test sets are chosen at random before SVR training. The red cross markers represent the corresponding numbers reported in [9], as measured from the figures in the paper.

More »

Expand

Fig 7.

Performance comparison of our reimplementation of the approach described in [10].

Again, bar (a) depicts the performance when no fine-tuning is used for the feature extraction network. When correctly applying fune-tuning we obtained the performance shown in bar (b), which is also the true performance of the approach. Bars (c) and (d), then, indicate the performance when fine-tuning is performed with data leakage. All bars represent the average performance of five random training, validation, and test splits. Independent test sets are chosen before fine-tuning, and for (d) also tainted test sets are chosen at random before SVR training. The red cross markers represent the corresponding numbers reported in [10], as measured from the figures in the paper.

More »

Expand

Table 2.

Performance results of various VQA algorithms on KoNViD-1k.

The data is taken from the references listed in the second column. In the upper half, the first column gives the abbreviated name of the algorithm. The lower half denotes the base architecture used to extract features (column ‘base’) and the model used to predict the overall quality (column ‘pred’). The last two columns designate whether fine-tuning (column ‘ft’) was performed correctly (green checkmark), or with data leakage (red cross), and whether the test set (column ‘test’) was independent (green checkmark) or tainted (red cross). The two approaches indicated by * were published after the referenced publication and are current state-of-the-art. –.–– indicates unreported values. The numbers in bold font in lines 15 and 20 give the true performance of CNN-SVR and CNN-LSTM, much below the claimed performance.

More »

Expand

Table 3.

Performance results of MultiGAP-NRIQA on KADID-10k alongside our reimplementation, both with and without data leakage.

The last column designates whether the splits were sampled correctly considering content (green checkmarks) or randomly, thereby giving rise to tainted test sets (red crosses). The numbers in bold font in lines 3 and 6 give the true performance of the method in [8] both with or without fine-tuning, much below 0.97/0.94 PLCC and 0.97/0.94 SROCC, respectively, as claimed. In line 7 we report the results of an implementation of an end-to-end regression network that combines the feature selection and the quality score prediction.

More »

Expand

Fig 8.

Performance (SROCC) of the DeepRN [7] model under data-leakage Case I.

The network is trained in two stages. In the first stage the network is trained on the entire KonIQ-10k database for the number of epochs shown on the x-axis. Intermediate models are saved every ten epochs, and the second stage training is performed on each of them by correctly using the official training/validation and test sets. The first stage models are trained using the cross-entropy loss, outputting five class-likelihoods, one for each score range. In order to easily compare the reported performances of the two stages, we calculate the corresponding scalar scores from the class-likelihoods. The likelihoods form a distribution over the respective integer scores. Thus, we compute the SRCC between the means of these distributions and the corresponding ground-truth MOS. Neither the test, nor the training performance in the second stage matches the results reported in the DeepRN paper.

More »

Expand