Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Transferability of features for neural networks links to adversarial attacks and defences

  • Shashank Kotyan ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Information Science and Engineering, Kyushu University, Fukuoka, Japan

  • Moe Matsuki,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation SoftBank Group Corporation, Tokyo, Japan

  • Danilo Vasconcellos Vargas

    Roles Conceptualization, Funding acquisition, Project administration, Resources, Software, Supervision, Writing – review & editing

    Affiliations Department of Information Science and Engineering, Kyushu University, Fukuoka, Japan, Department of Electrical Engineering and Information Systems, School of Engineering, The University of Tokyo, Tokyo, Japan


The reason for the existence of adversarial samples is still barely understood. Here, we explore the transferability of learned features to Out-of-Distribution (OoD) classes. We do this by assessing neural networks’ capability to encode the existing features, revealing an intriguing connection with adversarial attacks and defences. The principal idea is that, “if an algorithm learns rich features, such features should represent Out-of-Distribution classes as a combination of previously learned In-Distribution (ID) classes”. This is because OoD classes usually share several regular features with ID classes, given that the features learned are general enough. We further introduce two metrics to assess the transferred features representing OoD classes. One is based on inter-cluster validation techniques, while the other captures the influence of a class over learned features. Experiments suggest that several adversarial defences decrease the attack accuracy of some attacks and improve the transferability-of-features as measured by our metrics. Experiments also reveal a relationship between the proposed metrics and adversarial attacks (a high Pearson correlation coefficient and low p-value). Further, statistical tests suggest that several adversarial defences, in general, significantly improve transferability. Our tests suggests that models having a higher transferability-of-features have generally higher robustness against adversarial attacks. Thus, the experiments suggest that the objectives of adversarial machine learning might be much closer to domain transfer learning, as previously thought.

1 Introduction

Adversarial samples are noise-perturbed samples that can fail neural networks for tasks like image classification. Since they were discovered by [1] some years ago, both the quality and variety of adversarial samples have grown. These adversarial samples can be generated by a specific class of algorithms known as adversarial attacks [25].

Most of these adversarial attacks can also be transformed into real-world attacks [68], which confer a big issue as well as a security risk for current neural networks’ applications. Despite the existence of many variants of defences to these adversarial attacks [920], ‘no known learning algorithm or procedure can defend consistently’ [2126]. This shows that a more profound understanding of the adversarial algorithms is needed to formulate consistent and robust defences.

Several works have focused on understanding the reasoning behind such a lack of robust performance. It is hypothesised in [9] that neural networks’ linearity is one of the main reasons for failure. Other investigation by [27] shows that with deep learning, neural networks learn false structures that are simpler to learn rather than the ones expected.

Moreover, researches by [28, 29] unveil that adversarial attacks are altering where the algorithm is paying attention. In [30], it is discussed that an adversarial sample may have a different interpretation of learned features than the benign sample. The authors show that learned features of adversarial samples are remarkably similar to different images of different true-class and links adversarial robustness to features learned by deep neural networks.

1.1 Overview

This article tries to open up a new perspective on understanding adversarial algorithms based on evaluating the transferability-of-features of In-Distribution classes to Out-of-Distribution classes. We do this by verifying that this transferability is indeed linked with the adversarial attacks and defences for neural networks. Specifically, we propose a methodology loosely based on Zero-Shot Learning entitled Raw Zero-Shot for evaluating this transferability.

We conduct experiments over the soft-labels of an Out-of-Distribution class to assess the transferability-of-features for various classifiers. This is based on the hypothesis that, if a classifier is capable of learning useful features, an Out-of-Distribution class would also be associated with some of these features learned from In-Distribution classes (Amalgam Proportion) (Fig 1).

Fig 1. Illustration of Raw-Zero Shot methodology using transferability-of-features.

In the figure, the unknown class (Giant Panda) is represented as a combination of known classes (Bear, Bird, and Zebra).

We call this type of inspection over Out-of-Distribution class, Raw Zero-Shot. (Section 3) Furthermore, we also introduce two associated metrics to evaluate this transferability. One is based upon the Clustering Hypothesis (Section 3.1), while the other is based on Amalgam Hypothesis (Section 3.2).

1.2 Contributions

  • Evaluate a wide assortment of datasets and classifiers and assess their transferability-of-features. (Section 5)
  • Evaluate different adversarial defences and understand their effect on transferability-of-features. Also, determine the statistical relevance of defences on transferability by conducting a paired samples t-test. (Section 6)
  • Reveal an intriguing connection between the transferability-of-features and attack susceptibility by calculating the Pearson coefficient of proposed metric measuring transferability-of-features with adversarial attacks. (Section 7)
  • Discuss some observations for different adversarial attacks, defences and architectures from the perspective of transferability-of-features (Section 5)

2 Related works

2.1 Understanding adversarial attacks

Since the discovery of adversarial samples by [1], many researchers have tried to understand the adversarial attacks. It is hypothesised in [9] that neural networks’ linearity is one of the principal reasons for failure against an adversary and non-linear neural networks are thus, more robust compared to linear networks [31]. A geometric perspective is analysed in [32], where it is shown that adversarial samples lie in shared subspace, along which the decision boundary of a classifier is positively curved. In [33], a relationship between sensitivity to additive perturbations of the inputs and the curvature of the decision boundary of deep networks is shown. Another aspect of robustness is discussed in [18], where authors suggest that the capacity of the neural networks’ architecture is relevant to the robustness. It is also stated in [34] that the adversarial vulnerability is a significant consequence of the dominant supervised learning paradigm and a classifier’s sensitivity to well-generalising features in the known input distribution. In [35], the authors proposed a decision based black-box adversarial attacks, further suggesting that the decision boundary of neural networks might not be robust enough. Also, research by [36] argues that adversarial attacks are entangled with the interpretability of neural networks as results on adversarial samples can hardly be explained. The bounds for the robustness using this input feature space is also studied in [37]. Further, the existence of different internal representations learned by neural networks for an adversarial sample compared to a benign sample is shown in [30]. In this article, we explore a new perspective to understand adversarial attacks and defences based on the transferability-of-features of the neural networks.

2.2 Zero-Shot learning

Zero-Shot learning is a method to estimate Out-of-Distribution classes that do not appear in the training data. The motivation of Zero-Shot learning is to transfer knowledge from In-Distribution classes to Out-of-Distribution classes. Existing methods address the problem by estimating Out-of-Distribution classes from an attribute vector defined manually for all classes. For each class, whether such an attribute (like colour, shape) relates to the class or not is represented by one or zero. [38] introduced Direct Attribute Prediction (DAP) model, which learns each parameter of the input sample for estimating the attributes of the sample from the feature vector generated. It estimates an unknown class of the source data which is estimated from the target data by using these parameters. This approach projects feature vectors generated by learned classes into the source domain to classify the unknown classes. Based on this research, other zero-shot learning methods have been proposed, which uses an embedded representation generated using a natural language processing algorithm instead of a manually created attribute vector [3943]. The opposite direction was proposed in [44], which learned how to project from the source domain to the generated feature vector. [45] proposed a different strategy by constructing the histogram of known classes distribution for an unknown class to estimate unknown classes. [46] suggested that generative models learn better about the Out-of-Distribution class by learning independednt attributes in zero-shot setting. It is also shown in [47], that generative models are favorable for zero-shot learning. They assume that the unknown classes are the same if these histograms generated in the prediction and source domains are similar. Our Raw Zero-Shot test is distinguished from other zero-shot learning algorithms as in Raw Zero-Shot, the neural network has no access to features (attribute vector) or additional supplementary knowledge.

3 Raw Zero-Shot

Raw Zero-Shot is a learning test in which only N − 1 of the N classes in the dataset are presented to the classifier during training; in other words, all the samples of one specific class are removed from the standard training dataset. Such a classifier trained on only N − 1 of the N classes is called ‘Raw Zero-Shot Classifier’. Please note that a ‘Standard Classifier’ is trained on all N classes has N soft-label dimensions in the soft-label space. In contrast, a Raw Zero-Shot Classifier has only N − 1 soft-label dimensions in the soft-label space due to the forced exclusion of a class.

The excluded (now) Out-of-Distribution class then can be predicted as a combination of the remaining N − 1 soft-label dimensions of the learned In-Distribution classes. We call this combination as ‘Amalgam Proportion’ (Fig 1). Only the Out-of-Distribution class (excluded class from N) is provided to the classifier during testing. Amalgam Proportion for the given Out-of-Distribution class is recorded for the classifier. This process is iterated for all potential (N) classes, excluding a different class each time.

Soft labels of a classifier composes a space in which a given image would be categorised as a weighted vector involving the previously learned classes. If neural networks can learn the features existing in the In-Distribution classes, it would be reasonable to consider that the Amalgam Proportion also describes a given image as a combination of the previously learned classes (Fig 1). Similar to a vector space in linear algebra, the soft-labels can be combined to describe Out-of-Distribution objects in this space.

In our hypothetical example (Fig 1), the Out-of-Distribution class (Giant Panda) is represented as a combination of In-Distribution classes (Bear, Zebra, Bird) where 60% of the features of Bear (like body-shape) and 39% of the features of Zebra (like stripes pattern) is ‘associated’ with the Giant Panda. This is analogous to how children associate unknown objects (Giant Panda) as a combination of recognised objects (Bear and Zebra) when they are asked to describe the unknown object with their learned knowledge [48, 49]. Further, a study by [50] shows that humans can combine available perceptual information with stored knowledge of experiential regularities, which helps us to describe things that are similar as close and things that are dissimilar as far apart.

Thus, intuitively, all the images of the class Giant Panda should have similar Amalgam Proportion as the hypothetical classifier can associate Giant Panda with some features of Zebra and Bear classes and all the images belong to a single class. Metrics are then computed over the Amalgam Proportion of the Out-of-Distribution class to assess transferability-of-features, (Fig 2). These metrics are based on a different hypothesis of what defines a feature or a class. In the same way, as there are various aspects of robustness, there are also different interpretations of transferability-of-features. Therefore, our metrics are complementary, each highlighting a different perspective of the whole. The following subsections define them.

Fig 2. Illustration of proposed metrics, a) Davies–Bouldin Metric (DBM), and b) Amalgam Metric (AM).

Here, we use a neural network’s last layer (classification layer) to evaluate the transferability-of-features from known classes to an unknown class. A network with good transferability-of-features would have a consistent view of the unknown class and form a dense cluster.

3.1 Davies–Bouldin Metric (DBM)—Clustering hypothesis

We can use cluster validation techniques to assess the Amalgam Proportion, considering that the cluster of Amalgam Proportion of an Out-of-Distribution class would constitute a class in itself. Here, we choose for simplicity Davies-Bouldin Index [51], one of the most used metrics in internal cluster validation. Davies–Bouldin Metric (DBM) for an Out-of-Distribution class can be defined as follows: in which, n is the number of samples from the Out-of-Distribution class, G is the centroid of the cluster formed by the soft-labels of all the n samples, and z is soft-label of a single sample of Out-of-Distribution class. A denser cluster would have a lower Davies–Bouldin Metric (DBM) Score representing a consistent view taken by the classifier in terms of features learned from the In-Distribution classes.

3.2 Amalgam Metric (AM)—Amalgam hypothesis

Differently from the previous metric, we establish our metric on the hypothesis that the classes learned by a classifier share some similarities with the Out-Of-Distribution class. The classifier can associate this similarity in its features while evaluating these Out-Of-Distribution classes. This hypothesis formulates from the fact that humans can combine available perceptual information with stored knowledge of experiential regularities, which helps us to describe things that are ‘similar’ as close and things that are ‘dissimilar’ as far apart [50]. However, what would constitute the baseline Amalgam Proportion for a given Out-of-Distribution class still needs to be determined to assess the extent of the classifier to exploit this existence of similarity between classes.

To calculate the baseline Amalgam Proportion of a given Out-of-Distribution class, we use here the assumption that ‘Standard Classifiers’ should output a good approximation of the Amalgam Proportion since it is an In-Distribution class for the standard classifier. We thus associate the evaluated Amalgam Proportion of the Raw Zero-Shot Classifier and the baseline Amalgam Proportion of the Standard Classifier for a given class with our Amalgam Metric (AM) (Fig 2) as, in which, z′ is the normalized soft-labels of non-target classes from the Standard classifier, and z is the soft-labels of In-Distribution classes of the Raw Zer-Shot Classifier. Note that the given class is ‘known’ (target) by the standard classifier and is ‘unknown’ to the Raw Zero-Shot Classifier.

Hence, the Amalgam Metric captures the existence of some unique features learned which are specific to a class which in turn changes the Amalgam Proportion between Raw Zero-Shot Classifier and Standard Classifier. A higher AM score corresponds to a classifier preferring to learn special features of a class over general features present across the distribution. In other words, a lower AM score corresponds to a classifier preferring to learn general features over special features. A non-zero AM score thus verifies the unique special features to a class learned by training the classifier on that specific class.

4 Experimental design

4.1 Considered datasets

We conduct experiments on three diverse datasets to evaluate the transferability-of-features for the neural networks. We used Fashion MNIST (F-MNIST) [52], CIFAR-10 [53] and a customised Imagenet (Sub) dataset (combining 100 classes of original ILSVRC2012 Imagenet dataset [54] into 10 superclasses) for our evaluations. More details about the customised Sub-Imagenet dataset is mentioned in S1 Appendix in S1 File. Note that the number of samples (7000 for Fashion MNIST, 6000 for CIFAR-10, and roughly 13500 samples for Sub-Imagenet dataset) in the assumed unknown class differ from the dataset. We use the samples from both the training and the testing splits for the ‘unknown’ class for evaluation because we exclude these samples in the training process.

4.2 Considered classifiers

We evaluate different architectures for different datasets. For the Fashion MNIST datasets, we chose to evaluate Multi-Layer Perceptron (MLP), and a shallow Convolution Neural Network (ConvNet). For the CIFAR-10 dataset, LeNet (a simpler architecture which is a historical mark) [55], VGG (a previous state-of-the-art architecture which is a historical mark) [56], All Convolutional Network (AllConv) (an architecture without max pooling and fully-connected layers) [57], Network in Network (NIN) (an architecture which uses micro neural networks instead of linear filters) [58], Residual Networks (ResNet) (an architecture based on skip connections) [59], Wide Residual Networks (WideResNet) (an architecture which also expands in width) [60], DenseNet (an architecture which is a logical extension of ResNet) [61], and, Capsule Networks (CapsNet) (a completely different architecture based on dynamic routing and capsules) [62]. For our Sub-Imagenet dataset, we chose InceptionV3 [63], and ResNet-50 [59]. More details about the Standard and Raw Zero-Shot Classifiers are mentioned in S2 Appendix in S1 File.

4.3 Considered adversarial defences

We also evaluated the transferability-of-features for some of the adversarial defences. Please refer to [23] for a discussion about the performance of adversarial defences in general. for the CIFAR-10 dataset, such as Feature Squeezing (FS) [17], Spatial Smoothing (SS) [17], Label Smoothing (LS) [13], Thermometer Encoding (TE) [20], and Adversarial Training (AT) [18]. We also evaluate classifiers trained with an augmented dataset having Gaussian Noise of σ = 1.0 (G Aug). More details about the adversarial defences are mentioned in S3 Appendix in S1 File.

4.4 Considered attacks

We also evaluated all our standard vanilla classifiers against well-known adversarial attacks such as Fast Gradient Method (FGM) [9], Basic Iterative Method (BIM) [7], Projected Gradient Descent Method (PGD) [18], DeepFool (DF) [64], and NewtonFool (NF) [65]. More details about the adversarial attacks are mentioned in S4 Appendix in S1 File.

4.5 Technical assumption and limitations

An assumption is taken that architecture will have a consistent view of all the Out-of-Distribution classes since they belong to the same dataset from which the In-Distribution classes are sampled. An analysis of transferability-of-features for an unknown class belonging to a different dataset is left for future work. Further, in this article, we evaluate a single excluded class as Out-of-Distribution (OoD) class with respect to several In-Distribution classes. Complex analysis on multiple excluded classes and effect on transferability-of-features due to any relationship between excluded (OoD) class and In-Distribution classes is left for future work.

5 Experimental results for vanilla classifiers

Table 1 shows the results of our metrics (DBM and AM) for vanilla classifiers. Note that we use mean across all the metric values for N classes of the dataset to be characteristic metric values for an architecture. To enable the visualisation of DBM, we plot a projection of all the points in the decision space of unknown class (N − 1 dimensions) into a two-dimensional space. (Fig 3). The characteristic of Isomap is that it seeks a lower-dimensional embedding that maintains geodesic distances between all sample points; that is, it preserves the high-dimensional distance between the points. Other manifold visualisations are added in the S5 Appendix in S1 File.

Fig 3. Visualisation of the Davies–Bouldin Metric (DBM) results for vanilla classifiers using a topology preserving two-dimensional projection with Isometric Mapping (Isomap).

Each row represents a classifier trained with a label excluded whose projection is visualised. While each column represents a different classifier architecture evaluated. A denser clustering represents a consistent view of the unknown class by the architecture.

Table 1. Mean and standard deviation of Davies–Bouldin Metric (DBM) and Amalgam Metric (AM) scores for vanilla Raw Zero-Shot Classifiers.

Similarly, we can also visualise AM, in the form of histograms of soft labels for the classifiers. The computed histograms (H′ and H) is plotted for every class and classifier to enable the visualisation of the Amalgam Metric (Fig 4). It is interesting to note that the histograms of CapsNet (Fig 4) are different from the other ones. This reveals that this metric can capture such representation differences. It can be noted (Fig 4) that for most classes of CapsNet, the variation is relatively low than the other architectures. This contributes to having a good representation of CapsNet.

Fig 4. Histograms of soft-labels (H′ and H) from which the AM is calculated.

Each row shows the histograms of one classifier with one class excluded. Dark-shaded thinner and light-shaded broader bins are respectively the soft-labels from the ground-truth (H′) from the classifier trained on all classes and the soft-labels of the classifier trained on N − 1 classes (H′).

Table 1 reveals that for CIFAR-10 dataset, CapsNet possesses the best transferability-of-features amongst all classifiers examined as it has the least (best) score in both of our metrics. At the same time, LeNet has the second-best transferability. Moreover, other architectures possess similar transferability.

Also for the Sub-Imagenet dataset, both architectures (InceptionV3 and ResNet-50) are equally clustered and predict the Amalgam Proportion similarly. However, ResNet-50 has marginally better transferability than the InceptionV3 as it has better scores for both of our metrics. Similarly, for the Fashion MNIST dataset, both architectures (MLP and ConvNet) have a similar quality of transferability. While ConvNet seems marginally superior to the MLP in terms of clustering the unknown classes more tightly (suggested by DBM), MLP seems marginally superior to predict the Amalgam Proportion better than the ConvNet (suggested by AM).

A further study can also be carried out to analyse the characteristics of representation of the neural network, which makes a class more robust than the other classes. Further investigations can also be carried out to analyse the effect of a class for an adversarial attack based on this. This can also provide insight into the classes which are robust to adversarial attacks. However, these analyses are beyond the scope of the current article and hence, left for future work.

6 Link between transferablility of features and adversarial defences

Table 2 shows the results of our metrics (DBM and AM) for vanilla classifiers and classifiers employed with a variety of adversarial defences for improving the robustness of vanilla classifiers for CIFAR-10. We also analyse the statistical relevance of the change in metric values due to introduction of adversarial defences. A paired samples t-test was conducted for our metrics’ distributions (DBM and AM) of Vanilla Classifiers (without adversarial defence), and Adversarially defended classifiers (Table 2) to test the significance in the change in metric values due to Adversarial Defences [66] The Null hypothesis of paired samples t-test assumes that the true mean difference between the distributions is equal to zero. Based on the results (Table 2) adversarial defences, ‘in general’, tend to improve the transferability-of-features for the neural networks evaluated using Amalgam Proportion. It does so by either by creating a more dense cluster of the soft-labels (suggested by DBM) or learning more general/special features (suggested by AM), or both.

Table 2. Mean and standard deviation of Davies–Bouldin Metric (DBM) and Amalgam Metric (AM) values for different Raw Zero-Shot Classifiers with and without the adversarial defences on CIFAR-10.

Raw DBM Score values for weaker defences such as Gaussian Augmentation, Feature Squeezing, Spatial Smoothing and Thermometer Encoding lie within the standard deviation of vanilla classifiers suggesting that they affect minimally in clustering the Amalgam Proportion of Out-of-Distribution classes. At the same time, DBM Score values for defences such as LS and AT are noticeably lower than vanilla classifiers suggesting they try to form a denser cluster of Amalgam Proportion compared to the vanilla classifiers. Thus, a better association of available features is observed for the more robust defences. From the perspective of AM Score values, the results suggest that LS favours learning special features belonging to a class while AT favours to learn more general features. Interestingly, a general low p-value for the paired samples t-test is observed for the adversarial defences, which suggests that underlying transferability for adversarial defences differ from the vanilla classifiers with high statistical relevance.

7 Link between transferablity of features and adversarial attacks

As the results in Table 2, suggests a link between the transferability-of-features and the adversarial defences. It is intuitive to assume that there also exists a link between the transferability-of-features and the adversarial attacks. We conducted a Pearson correlation coefficient test of our metrics (DBM and AM) of the vanilla classifiers with adversarial attacks to evaluate the statistical relevance of this link between transferability evaluated using Amalgam Proportion, and adversarial attacks [67] The Pearson correlation analysis of our metrics suggests a relationship between our metrics and the adversarial attacks in general.

We use the analysis of adversarial attacks in the form of Mean L2 Score (L2 difference between the original sample and the adversarial one) to compute the correlation [64]. The Pearson correlation coefficients of our metrics (DBM and AM) with Mean L2 Score is shown in Table 3 for every architecture and attack. Moreover, these Pearson relationships between our metrics and Mean L2 Score can also be visualised (Figs 5 and 6). We visualise the Pearson correlation between the Raw Zero-Shot metrics (DBM and AM) with the adversarial metrics (Adversarial Accuracy and Mean L2 Score) mentioned in Table 3. Figs 5 and 6, visualizes the relationship of Raw Zero-Shot metrics with adversarial metrics.

Fig 5. Visualisation of Pearson correlation of Davies-Bouldin Metric (DBM) with Mean L2 Score of adversarial attacks (Table 3).

Here, the x-axis represents the Mean L2 Scores while the y-axis represents the DBM values. Each point represent a DBM value and Mean L2 Score for a labelled class. Slope of the line determines the pearson correlation coefficient value while lower dispersion of points from the line determines a low p-Value.

Fig 6. Visualisation of Pearson correlation of Amalgam Metric (AM) with Mean L2 Score of adversarial attacks (Table 3).

Here, the x-axis represents the Mean L2 Scores while the y-axis represents the AM values. Each point represent an AM value and Mean L2 Score for a labelled class.

Table 3. Pearson correlation coefficient values of Davies–Bouldin Metric (DBM) and Amalgam Metric (AM) with Mean L2 Score of adversarial attacks for each vanilla classifier and attack pair.

8 General discussion on transferability-of-features

On carefully observing the metric values (Tables 13), we found that our assessment of representation quality using Amalgam Proportion also explains some of the propositions by other researchers, we highlight some of our key findings below,

Does a model with high capacity will have a better transferability-of-features?

Our results reveal that a deeper network which generally has a higher capacity [18] does not necessarily correspond to have a better transferability-of-features. As CapsNet and LeNet, which are much shallower than the other deeper networks, are shown to have superior transferability than other deeper networks (Table 2).

Why CapsNet has better transferability than other deeper networks?

We observe that Capsule Networks (CapsNet) has the best transferability-of-features amongst other neural networks (Table 2). Our results suggest that CapsNet produces a denser cluster for Amalgam Proportion and learns more general features. We believe it might be because of the dynamical nature (routing) of the CapsNet. Thus, our results call for a more in-depth investigation of Capsule Networks and their property of transferring features.

How does augmenting the dataset with Gaussian Noise affect the transferability-of-features?

We observe that Gaussian Augmentation degrades the transferability-of-features of all the classifiers (Table 2). This supports our intuition (Section 3), as adding Gaussian noise to the images subdue the features of the image by blurring, making the classifier harder to interpret these features. Consequently, a weaker association of the transferability with these features is observed through the perspective of Amalgam Proportion.

How does Label Smoothing improve the transferability-of-features?

Our results corroborate the analysis in [68] that Label Smoothing (LS) encourages the features to group in tight, equally distant clusters. The raw metric values from our experiments for LS suggests that classifiers employed with LS do form a tighter cluster in soft-label space (as suggested by DBM) (Table 2). At the same time, LS also favours the classifiers to learn special features belonging to a class (as suggested by AM).

Does learning features near to the feature centroid is beneficial against adversarial attacks?

It is shown that forcing a loss function to make features near to the feature centroid is beneficial against adversarial attacks in [69]. We notice from Table 3 that Davies–Bouldin Metric (DBM) is related quite reasonably with the adversarial attacks. Our metric DBM, as it calculates precisely the closeness of the feature to the feature centroid, and is related quite reasonably with the adversarial attacks corroborates the results by [69].

9 On links of transferability-of-features with adversarial attacks and defences

Based on our experiments and results, we hypothesise that the cause of the links of transferability-of-features is due to the presence of a bias introduced in the training of neural networks. We call this bias—Dataset Bias and define it as a bias towards the classes and data distribution present in a dataset. It is already proven theoretically that it is possible to separate any number of classes, provided enough samples are evaluated. However, this separation only exists inside the evaluated samples’ underlying data distribution and classes. With the introduction of noise or corruptions in the underlying data distribution, this separation of classes is not valid anymore as the distribution is substantially modified. The area related to Zero-Shot Learning and Transfer Learning investigates this bias by introducing unknown class samples when inferring. In contrast, in adversarial machine learning, the same bias is studied by introducing noisy adversarial samples.

10 Conclusions

This article proposes a novel Zero-Shot learning-based method, entitled Raw Zero-Shot, to assess the transferability-of-features in neural networks. In order to assess the transferability, two associated metrics are formally defined based on different hypotheses of interpreting transferability-of-features. Our results suggest that CapsNet, a dynamic routing network, has the best transferability-of-features amongst classifiers which calls for a more in-depth investigation of Capsule Networks. Also, the behaviour of different architectures spotted in the DBM can be visualised in the Isomap plots, which shows that the DBM indeed capture the existing differences in transferability-of-features.

Our experimental results reveal that,

  • Classifiers employed with adversarial defences improve the transferability-of-features for the classifiers as evaluated by DBM, suggesting that to improve the robustness of the classifiers, we have to improve the transferability-of-features of the classifiers too.
  • Adversarial defences have a low p-value (in general) in the paired samples t-test when compared to vanilla classifiers in general, suggesting that transferability is significantly affected by various adversarial defences.
  • A high Pearson correlation coefficient and low p-value (in general) of the Pearson correlation test between DBM and the adversarial attacks suggest a link between the transferability-of-features and the adversarial attacks.

Hence, the proposed Raw Zero-Shot was able to assess and understand the transferability-of-features from the perspective of Out-of-Distribution classes of different neural networks’ architectures, along with the adversarial defences and link this property of transferability of the neural networks with adversarial attacks and defences. It also opens up new possibilities of using transferability-of-features for both evaluation (i.e. as a quality assessment) and the development (e.g. as a loss function) of neural networks.


We would like to thank Prof. Junichi Murata for his kind support without which it would not be possible to conduct this research.


  1. 1. Szegedy Cea. Intriguing properties of neural networks. In: In ICLR. Citeseer; 2014.
  2. 2. Nguyen A, Yosinski J, Clune J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 427–436.
  3. 3. Brown TB, Mané D, Roy A, Abadi M, Gilmer J. Adversarial patch. arXiv preprint arXiv:171209665. 2017.
  4. 4. Moosavi-Dezfooli SM, Fawzi A, Fawzi O, Frossard P. Universal adversarial perturbations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Ieee; 2017. p. 1765–1773.
  5. 5. Su J, Vargas DV, Sakurai K. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation. 2019;23(5):828–841.
  6. 6. Sharif M, Bhagavatula S, Bauer L, Reiter MK. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Acm; 2016. p. 1528–1540.
  7. 7. Kurakin A, Goodfellow I, Bengio S. Adversarial examples in the physical world. arXiv preprint arXiv:160702533. 2016.
  8. 8. Athalye A, Sutskever I. Synthesizing robust adversarial examples. In: Icml; 2018.
  9. 9. Goodfellow IJ, Shlens J, Szegedy C. Explaining and Harnessing Adversarial Examples. arXiv preprint arXiv:14126572. 2014.
  10. 10. Huang R, Xu B, Schuurmans D, Szepesvári C. Learning with a strong adversary. arXiv preprint arXiv:151103034. 2015.
  11. 11. Papernot N, McDaniel P, Wu X, Jha S, Swami A. Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy (SP). Ieee; 2016. p. 582–597.
  12. 12. Dziugaite GK, Ghahramani Z, Roy DM. A study of the effect of jpg compression on adversarial images. arXiv preprint arXiv:160800853. 2016.
  13. 13. Hazan T, Papandreou G, Tarlow D. Perturbations, Optimization, and Statistics. MIT Press; 2016.
  14. 14. Das N, Shanbhogue M, Chen ST, Hohman F, Chen L, Kounavis ME, et al. Keeping the bad guys out: Protecting and vaccinating deep learning with jpeg compression. arXiv preprint arXiv:170502900. 2017.
  15. 15. Guo C, Rana M, Cisse M, van der Maaten L. Countering Adversarial Images using Input Transformations. In: International Conference on Learning Representations; 2018.
  16. 16. Song Y, Kim T, Nowozin S, Ermon S, Kushman N. PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples. In: International Conference on Learning Representations; 2018.
  17. 17. Xu W, Evans D, Qi Y. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:170401155. 2017.
  18. 18. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards Deep Learning Models Resistant to Adversarial Attacks. In: International Conference on Learning Representations; 2018.
  19. 19. Ma X, Li B, Wang Y, Erfani SM, Wijewickrema S, Schoenebeck G, et al. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:180102613. 2018.
  20. 20. Buckman J, Roy A, Raffel C, Goodfellow I. Thermometer encoding: One hot way to resist adversarial examples. In: International Conference on Learning Representations; 2018.
  21. 21. Carlini N, Wagner D. Towards evaluating the robustness of neural networks. In: 2017 ieee symposium on security and privacy (sp). Ieee; 2017. p. 39–57.
  22. 22. Tramèr F, Kurakin A, Papernot N, Goodfellow I, Boneh D, McDaniel P. Ensemble Adversarial Training: Attacks and Defenses. In: International Conference on Learning Representations; 2018.
  23. 23. Athalye A, Carlini N, Wagner D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In: Icml; 2018.
  24. 24. Uesato J, O’Donoghue B, Kohli P, Oord A. Adversarial Risk and the Dangers of Evaluating Against Weak Attacks. In: International Conference on Machine Learning; 2018. p. 5032–5041.
  25. 25. Vargas DV, Kotyan S. Robustness Assessment for Adversarial Machine Learning: Problems, Solutions and a Survey of Current Neural Networks and Defenses. arXiv preprint arXiv:190606026. 2019.
  26. 26. Tramer F, Carlini N, Brendel W, Madry A. On adaptive attacks to adversarial example defenses. arXiv preprint arXiv:200208347. 2020.
  27. 27. Thesing L, Antun V, Hansen AC. What do AI algorithms actually learn?-On false structures in deep learning. arXiv preprint arXiv:190601478. 2019.
  28. 28. Vargas DV, Su J. Understanding the One-Pixel Attack: Propagation Maps and Locality Analysis. arXiv preprint arXiv:190202947. 2019.
  29. 29. Kotyan S, Vargas DV. Deep neural network loses attention to adversarial images. arXiv preprint arXiv:210605657. 2021.
  30. 30. Sabour S, Cao Y, Faghri F, Fleet DJ. Adversarial manipulation of deep representations. arXiv preprint arXiv:151105122. 2015.
  31. 31. Guo Y, Zhang C, Zhang C, Chen Y. Sparse dnns with improved adversarial robustness. In: Advances in neural information processing systems; 2018. p. 242–251.
  32. 32. Moosavi-Dezfooli SM, Fawzi A, Fawzi O, Frossard P, Soatto S. Robustness of classifiers to universal perturbations: A geometric perspective. In: International Conference on Learning Representations; 2018.
  33. 33. Fawzi A, Moosavi-Dezfooli SM, Frossard P, Soatto S. Empirical study of the topology and geometry of deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 3762–3770.
  34. 34. Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. Adversarial examples are not bugs, they are features. In: Advances in Neural Information Processing Systems; 2019. p. 125–136.
  35. 35. Chen J, Jordan MI, Wainwright MJ. Hopskipjumpattack: A query-efficient decision-based attack. In: 2020 ieee symposium on security and privacy (sp). Ieee; 2020. p. 1277–1294.
  36. 36. Tao G, Ma S, Liu Y, Zhang X. Attacks meet interpretability: Attribute-steered detection of adversarial samples. In: Advances in Neural Information Processing Systems; 2018. p. 7717–7728.
  37. 37. Fawzi A, Fawzi H, Fawzi O. Adversarial vulnerability for any classifier. In: Advances in Neural Information Processing Systems; 2018. p. 1178–1187.
  38. 38. Lampert CH, Nickisch H, Harmeling S. Learning to detect unseen object classes by between-class attribute transfer. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. Ieee; 2009. p. 951–958.
  39. 39. Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, et al. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:13125650. 2013.
  40. 40. Fu Y, Yang Y, Hospedales T, Xiang T, Gong S. Transductive multi-label zero-shot learning. arXiv preprint arXiv:150307790. 2015.
  41. 41. Akata Z, Reed S, Walter D, Lee H, Schiele B. Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 2927–2936.
  42. 42. Zhang Z, Saligrama V. Zero-shot recognition via structured prediction. In: European conference on computer vision. Springer; 2016. p. 533–548.
  43. 43. Bucher M, Herbin S, Jurie F. Improving semantic embedding consistency by metric learning for zero-shot classiffication. In: European Conference on Computer Vision. Springer; 2016. p. 730–746.
  44. 44. Shigeto Y, Suzuki I, Hara K, Shimbo M, Matsumoto Y. Ridge regression, hubness, and zero-shot learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2015. p. 135–151.
  45. 45. Zhang Z, Saligrama V. Zero-shot learning via semantic similarity embedding. In: Proceedings of the IEEE international conference on computer vision; 2015. p. 4166–4174.
  46. 46. Xian Y, Lampert CH, Schiele B, Akata Z. Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly; 2020.
  47. 47. Yan C, Chang X, Li Z, Guan W, Ge Z, Zhu L, et al. Zeronas: Differentiable generative adversarial networks search for zero-shot learning. IEEE transactions on pattern analysis and machine intelligence. 2021. pmid:34762584
  48. 48. Walker CM, Gopnik A. Toddlers infer higher-order relational principles in causal learning. Psychological science. 2014;25(1):161–169. pmid:24270464
  49. 49. Walker CM, Bridgers S, Gopnik A. The early emergence and puzzling decline of relational reasoning: Effects of knowledge and search on inferring abstract concepts. Cognition. 2016;156:30–40. pmid:27472036
  50. 50. Casasanto D. similarity and Proximity: When Does Close in space mean Close in mind? Memory & Cognition. 2008;36(6):1047–1056.
  51. 51. Davies DL, Bouldin DW. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence. 1979;(2):224–227. pmid:21868852
  52. 52. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. CoRR. 2017;abs/1708.07747.
  53. 53. Krizhevsky A, Hinton G, et al. Learning multiple layers of features from tiny images; 2009.
  54. 54. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. Imagenet large scale visual recognition challenge. International journal of computer vision. 2015;115(3):211–252.
  55. 55. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324.
  56. 56. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014.
  57. 57. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:14126806. 2014.
  58. 58. Lin M, Chen Q, Yan S. Network in network. arXiv preprint arXiv:13124400. 2013.
  59. 59. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 770–778.
  60. 60. Zagoruyko S, Komodakis N. Wide residual networks. arXiv preprint arXiv:160507146. 2016.
  61. 61. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 4700–4708.
  62. 62. Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. In: Advances in neural information processing systems; 2017. p. 3856–3866.
  63. 63. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 2818–2826.
  64. 64. Moosavi-Dezfooli SM, Fawzi A, Frossard P. Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 2574–2582.
  65. 65. Jang U, Wu X, Jha S. Objective metrics and gradient descent algorithms for adversarial examples in machine learning. In: Proceedings of the 33rd Annual Computer Security Applications Conference. Acm; 2017. p. 262–277.
  66. 66. David HA, Gunnink JL. The paired t test under artificial pairing. The American Statistician. 1997;51(1):9–12.
  67. 67. Freedman D, Pisani R, Purves R. Statistics (international student edition). Pisani , Purves R, 4th edn WW Norton & Company, New York. 2007.
  68. 68. Müller R, Kornblith S, Hinton GE. When does label smoothing help? In: Advances in Neural Information Processing Systems; 2019. p. 4696–4705.
  69. 69. Agarwal C, Nguyen A, Schonfeld D. Improving Robustness to Adversarial Examples by Encouraging Discriminative Features. In: 2019 IEEE International Conference on Image Processing (ICIP). Ieee; 2019. p. 3801–3505.