Classification of masked image data

Data classification is one of the most commonly used applications of machine learning. The are many developed algorithms that can work in various environments and for different data distributions that perform this task with excellence. Classification algorithms, just like other machine learning algorithms have one thing in common: in order to operate on data, they must see the data. In the present world, where concerns about privacy, GDPR (General Data Protection Regulation), business confidentiality and security are growing bigger and bigger; this requirement to work directly on the original data might become, in some situations, a burden. In this paper, an approach to the classification of images that cannot be directly accessed during training has been made. It has been shown that one can train a deep neural network to create such a representation of the original data that i) without additional information, the original data cannot be restored, and ii) that this representation—called a masked form—can still be used for classification purposes. Moreover, it has been shown that classification of the masked data can be done using both classical and neural network-based classifiers.


Introduction
In the present world, the task of classification is common. In almost every aspect of our lives, the classification of our data becomes more useful and essential. To facilitate the machine learning algorithms' power and get an in-depth analysis of our data, we have to provide the data, perform the analysis, and interpret the results.
The machine learning sophisticated classification algorithms find in-data hidden features and then use them for discrimination. For this, the training data must be provided so that a model describing data can be constructed during the learning process [1].
Furthermore, while this might seem to be an obvious case, more and more often, such a requirement alone proves to be an obstacle. For various reasons, many companies that want to outsource data classification cannot provide virtually any training data. Such a situation can be caused by many reasons, such as corporate policy, security, GDPR, government regulations, legal constraints, etc.
One is faced with a dilemma of how to analyze the data without having direct access to it. The classical approach to data protection, i.e., encryption, is specifically designed to prevent that. The encryption techniques are purposefully designed in such a way that statistical analysis of the encoded form provides no clues about the original data [2]. Another approach is a blind one, i.e., to construct a classifier using the synthetic or publicly available data only, let the customer test it on his own and wait for the results to improve the classifier. Such a process being by nature iterative, is slow and due to constrained feedback of information leads to sub-optimal results.
What is needed is a mapping that, given the original data, would produce its masked form. The masked form would retain the necessary information from the original data, but in a form that is not interpretable by human and, without the inverse mapping, could not be used to reconstruct the original data. Such approach would allow to analyse the data that otherwise could not have been disclosed. Industries that could benefit from such a technique might include medicine, pharmaceutics, and sectors dealing with sensitive or confidential data. There may be many other possible applications as more and more businesses are aware of how much valuable knowledge might be mined from their data.
In this paper, such mapping and its inverse form are proposed using the adversarial autoencoder networks [3] with their latent layers having enforced distributions [4].
Encoded in enforced distributions, latent layers of an adversarial autoencoder can retain information needed for classification, while the output of the network remains as close to the input as desired.
To ensure that original data cannot be easily restored from its latent form, apart from the enforcement of the distribution, the size of the latent layer should be significantly smaller than the size of the input. In this way, the information in the latent layer not only has enforced distributions but it is also compressed in a lossy manner (the autoencoder's output, while being desirably close to the input, does not have to be equal to it).

Methodology
In this section, we formalize proposed data masking method and present in detail neural network architectures that have been used.

Data masking
To address the problem of keeping the data confidential, we propose a solution that uses two neural networks: an adversarial autoencoder and a classifier. First, we trained the adversarial autoencoder on the public image dataset. Then we used the encoder to generate datasets of masked images taken from other public datasets, together with corresponding labels. The masked dataset was then used to train another network to classify the masked representations of images. It has also been shown that for datasets of low complexity, the masked data classification can be done using classical, i.e., non-neural approach.
In our implementation of an adversarial autoencoder, we used two vectors for the latent layer, first, with Gaussian distribution of the data and the second one with a categorical one. These two vectors together form the masked version of the image presented at the input of the encoder.
The Gaussian and categorical distributions were purposefully selected for the latent vectors to facilitate the subsequent classification process. In this way, it is possible to partition the data in an unsupervised way using the categorical part of the latent space and organize each of these partitions internally in a Gaussian way. While this partitioning is fully unsupervised, it still organizes the elements according to their features.
The Gaussian distribution of the latent layer can be obtained using the standard variational autoencoder (VAE) [5]. Still, the use of the adversarial autoencoder allows for the latent layer to be composed of many parts, each with different distribution.
Adversarial autoencoder also produces a better representation of the manifold of the original data in the Gaussian space. In VAE, the Gaussian space contains empty areas that make classification harder as one does not know beforehand what kind of data might in future tasks reside within them [3].
Forcing the encoder to output latent vectors with Gaussian and categorical distributions [6] makes masked representation noise-like to human eyes while retaining information usable by the neural network-based or classical classifier (see Fig 1). In our research, we prove that the encoder can be used as universal data compression and masking tool that condenses data irreversibly when no proper decoder is provided while preserving information about the input features. Reconstruction of original data is possible only by running the matching decoder, so the decoder weights fulfill the function akin to the private key [2].

Implementation details
The system architecture corresponds to the two processing phases: masking and classification. Firstly, the image is preprocessed with the encoder part of the adversarial autoencoder, the only part of the model that needs to be saved for masking purposes. The output consists of vectors with Gaussian and categorical distributions. The concatenation of these two vectors constitutes the input for the classification task.
The experiments performed followed these steps: 1. Creating dataset A (image, category label) 2. Unsupervised training of the adversarial autoencoder on training image dataset A, to obtain a latent image representation; label data is not being used.
3. Creating dataset B (image, category label), disjoint with set A.

Creating a masked dataset M(B) using the encoder trained in
Step 2 (masked_image, category label). The architecture of the autoencoder is presented in Fig 2. Structure of the encoder and decoder parts are shown in S1 and S2 Tables. Discriminators, used in adversarial training, were omitted to simplify the schema, their structure can be found in Supporting information (S3 and S4 Tables). Input image takes masked form in the latent layer that is later an input for the decoder part. The encoder is constructed using three convolutional layers that extract image features and two linear layers, splitting output into the categorical and normal parts. In the input of the decoder, those two vectors are concatenated. The decoder has analogous but transposed convolutional layers as the encoder part.   We also compared the performance of masked image classification of that neural network with classical machine learning algorithms (see Fig 4). We choose the Random Forest [7] and AdaBoost [8] as they are top-performing classical multi-class classifiers [9].

Methods
All programming work was done using Python programming language [10]. Autoencoder and neural network classifier were implemented using PyTorch [11] and trained with the help of PyTorch-Ignite [12]. CIFAR and MNIST datasets were managed using PyTorch data loaders. For training classical classifiers, AdaBoost [8] and Random Forest [7], we used scikit-learn library [13]. Mathematical work, as well as matrix and vector manipulations, were done using Numpy [14]. For creation of figures we used Matplotlib [15], Pillow [16], and Draw.io. Plots presenting neural network training sessions were obtained with Tensorboard, part of Tensorflow library [17].

Experiments
In this section, we analyze the performance of our data masking method for two classification tasks experimentally. We present a description of used datasets and metrics, introduce a generic adversarial autoencoder baseline and show the ability to classify encoded data.

Datasets and evaluation metrics
In this study, we used three public datasets. We trained autoencoder with CIFAR-10 [18], which contains 60000 images grouped in 10 classes (6000 each). Labels, which were not used in adversarial training, were omitted.
Fist classification task. The classifier was tested on masked images from CIFAR-100 [18] which contains 60000 images divided into 100 classes (600 each), and on MNIST [19] dataset containing 60000 training images and 10000 test images, divided into ten classes.
In the case of CIFAR-100 [18] we tested the classification of masked images taken from 2, 3, and 4 randomly selected classes. To avoid biased results, for each category, the choice of classes was made ten times, and obtained results were averaged. Tests on CIFAR-100 [18] dataset were thus performed 30 times in total, ten times using two randomly selected classes, ten times using three randomly selected classes and finally, ten times using four randomly selected classes.
Second classification task. The MNIST case was used to show that for input data of low complexity, the classification of masked data can be successfully performed using classical, i.e., non-neural approach. Images from the MNIST dataset were transformed to RGB and resized to 32x32x3-i.e., the size of images in the CIFAR dataset.
CIFAR-100 is divided into train and test sets. There are 50k training images and 10k test images. We extracted the first 10k samples from a training set for validation purposes during the training of the classifier (see [4]). For classical machine learning approaches, there is no need to extract a validation set from a training set. Thus, we merged training and test sets to apply stratified split, resulting in a training to test images ratio of 8:2.

Adversarial training
Training adversarial autoencoder consists of two phases called reconstruction and regularization [3].
The first phase is the same as training simple autoencoder-encoder and decoder learn to reconstruct input image into the output image. The similarity of images is evaluated with the mean squared error (squared L2 norm) [4]. During training, the reconstruction error should decrease until it reaches the desired value (see Fig 5).
The second phase aims to shape the latent vector so that information has the given distribution. A discriminator learns to classify input, in terms of distribution, as real or fake. It gets random vector with values sampled from desired distribution with label real and encoder output labeled as fake. Based on discriminator feedback, the encoder trains to generate values with correct distribution, i.e., it is trained to produce output for which the discriminator would return real. For regularization purposes, we measured the Binary Cross Entropy [4] between the target and the output. During training, it is expected for generating loss to decrease and for discriminators loss to increase (encoder becomes so good that discriminator cannot correctly distinguish it from real i.e., sampled vector). The process of training those two networks are depicted in Figs 6 and 7.
All models were trained with Adam optimizer [20] with default hyperparameters (learning rate lr = 0.001 and coefficients betas = (0.9, 0.999)) with a mini-batch size 16. Choosing the size of the latent space is an important problem in autoencoder training. We have used 384long Gaussian distributed vector and 20-long categorically distributed vector. These values allowed to achieve a satisfactory quality of learning maintaining low consumption of the GPU memory as well as enforced compression of the input data in the latent layer.

(Encoder, decoder)-The matching pair
During the training of the autoencoder, the encoder and decoder parts are trained together. The training of the network starts with the weights initialized in a random way [21]. Also, during the training, the order of images in batches are randomized [4]. The final weights of the network are thus a product of many processes that, by design, are random in their nature. While it is theoretically possible that two heavily random processes may produce the same result, such a situation is very unlikely. Similarly, it is possible that randomly generated password will unlock access to password-protected data, especially if the password is short or directory-based. Here for the described encoder part of the network, the model has over 400 thousand trainable parameters. To test that each encoder-decoder pair that has been obtained from training is a matched pair, the following experiment has been conducted.
The autoencoder was trained four times using the same training dataset. From training, we have obtained four instances of the autoencoder. Let the enc j denote encoder transformation from j th instance of the autoencoder. Let also the dec j denote decoder transformation from j th instance of the autoencoder.
From Eqs 1-12 it is evident that that when encoder and decoder form a matched pair, the mean MSE error is much lower than in the case of unmatched pair. In fact, it is lower by two orders of magnitude. This together with results shown in Figs 8 and 9 and in S1-S6 Figs clearly shows that to decode a data in a masked form, one requires not just any decoder trained with given architecture but the decoder that was trained together with the encoder used for masking process.

Classification of masked data from the CIFAR-100 dataset
To investigate the information capacity of the masked image data, we performed classification tasks with a neural network classifier consisting of linear layers, as well as classical machine learning algorithms, namely AdaBoost and Random Forest.
In this experiment, the encoder, part of the adversarial autoencoder, was used on the CIFAR-100 dataset images to produce masked representation. Input data for all algorithms consisted of the Gaussian (1x384) and Categorical (1x20) output of the encoder, joined to create a vector of a size 1x404. Classification tasks were performed on all subsets created by subclassing the CIFAR-100 dataset, as described in the previous sections (2-, 3-and 4-class subsets were used).
Classification with classical machine learning algorithms. Data were fed into both classifiers, AdaBoost and RandomForest, with original labels associated with images. As for the algorithm, default hyperparameters were used, and models were trained using 100 estimators. After the training step, predictions were made on the test set. Results comprising accuracy, precision and recall are shown in S5-S7 Tables.
Classification with neural network. The neural network classifier task consists of three linear layers (S8 Table). The activation layers we have chosen are ReLU and Softmax (last layer). The size of the dense layers was chosen arbitrarily based on our experiments with different network configurations. At its input network sees the masked representation of the image, together with the original label in mini-batches of size 4. The model returns, as a prediction, a class an image supposedly belongs to. Model was trained with Adam optimizer with default hyperparameters (learning rate lr = 0.001 and betas = (0.9, 0.999)) with cross entropy as the loss function. In training, we used a test set disjoint with a validation set. Results calculated for a test set (accuracy, precision and recall) are gathered in S5-S7 Tables.  Tables 1, 2 and S5-S7 Tables indicate that neural network outperforms classical machine learning algorithms in the task of classification of the masked data.

Results shown in
Although results for 2-class cases are comparable (see Tables 1 and S5-S7 Tables), the accuracy of Random Forest and AdaBoost classifiers drops much faster, with increasing number of classes, than it is the case with a neural network.
While for 2 classes the accuracies are (0.852, 0.822, and 0.804) for neural network, Random Forest and AdaBoost, respectively, the drop in accuracy when considering four-class experiment is (0.083, 0.387, and 0.212). In the case of classical algorithms (Random Forest and Ada-Boost), this drop is by order of magnitude larger than in the case of the neural network-based classifier.
Our tests have shown no significant improvement in the accuracy of the predictions provided by classical algorithms when the number of estimators was increased.
Neural network-based classifier, even with four classes, achieves accuracy close to 80% on masked data.

Discussion and conclusions
In this study, we propose a novel data masking method using an adversarial autoencoder. Our experiments show it is possible to create a masked form of an image, visually not similar to the original, with retained hidden information sufficient for classification tasks. It has been shown that only the matching decoder is capable of turning masked representation into a form close to the original. This way, we have shown that it is possible to classify the image data without having direct access to it. The owner of the data using the encoding part of the described autoencoder can produce its masked form to provide it for classification. As long as the instance of the autoencoder used for masking remains undisclosed, the classification can be performed without the risk of disclosing the original data. Construction of the autoencoder guarantees that while the information in the latent layer is unreadable to the human, it is still present and can be used for classification and other tasks of choice. Future work: In future work, an attempt with larger images should be made. While the classification of images from the CIFAR set is proof that such technology can be used, this method should be extended to larger images for many practical approaches. There are many possible ways to approach the extension task, and they should be considered.
Also, the use of autoencoders with the latent layer size in a different ratio to the size of the input, than the one used in this paper, might be considered. A bigger latent layer might provide a better reconstruction of the input, which might provide more informational content for classification. On the other hand, a smaller latent layer would enforce more lossy compression, making the content of the latent layer even more protected from attempts of reverse reconstruction, as well as it would limit the memory demand during the training process.
Supporting information S1 File. The test patterns. ZIP file containing archived black and white patterns used for autoencoder testing. (ZIP) S1 Fig. Combinations of (encoder, decoder) pairs from various instances. Results of dec i (enc j (x)) where the original image x is shown in upper left corner. The x is properly recreated only if i == j. Use of test pattern shows that dec i (enc j (x)) for i 6 ¼ j has no similarity to x.