Shallow Encoder Deep Decoder (SEDD) Networks for Image Encryption and Decryption

2020Β·Arxiv

Abstract

Abstract

simple shallow encoder neural network E for encryption, and a complex deep decoder neural network D for decryption. E is kept simple so that encoding can be done on low power and portable devices and can in principle be any nonlinear function which outputs an encoded vector. D is trained to decode the encodings using the dataset of image -encoded vector pairs obtained from E and happens independently of E. As the encodings come from E which while being a simple neural network, still has thousands of random parameters and therefore the encodings would be practically impossible to crack without D. This approach differs from autoencoders as D is trained completely independently of E, although the structure may seem similar. Therefore, this paper also explores empirically if a deep neural network can learn to reconstruct the original data in any useful form given the output of a neural network or any other nonlinear function, which can have very useful applications in Cryptanalysis. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the decoded images from D along with some limitations.

1. Introduction

Cryptography is concerned with encoding a sensitive piece of data in a form which is unintelligible or meaningless to any human or machine other than the intended party which has the decoding mechanism to regenerate the original data from the encoding. Cryptanalysis on the other hand is used to breach cryptographic security systems and gain access to the contents of encrypted messages [1]. A primary research area in Cryptography and Cryptanalysis is encryption and decryption techniques for secure communication and code breaking. Most of the protocols for encryption in commercial use today are n bit key based or hash functions like RSA, DES, 3DES, AES, SHA. In this paper we are concerned with images as data to be encrypted at source, sent securely and decrypted at destination [2].

Deep learning neural networks have been an area of active research as a means for both encryption and decryption for some time. Techniques like Hopfield neural networks, chaotic time delayed neural networks [3], Autoencoder networks [4], [5], Generative and GAN based [5], [6] etc have shown encouraging results. Autoencoders and generative networks are specially interesting as the proposed framework is inspired from them. However, this research has not found much application outside research where conventional algorithms continue to dominate. One of the reasons for this is that deep neural networks as in autoencoders are computationally expensive to run involving many matrix operations [7]. This is especially true for discrete mobile security systems such as miniature cameras used in espionage, cameras used in drones or other robotic systems, portable communication devices in military, space as well as general real-time applications. The proposed framework attempts to solve this problem. For the purpose of breaking encryption, deep learning models are especially suited because of their inherent ability to learn nonlinear functions and abstract relations given a large number of labelled examples [8].

In the proposed Shallow Encoder β Deep Decoder (SEDD) framework, the encoder E is made extremely simple and shallow to reduce the computational load for encryption considerably and can be done on low powered mobile ARM processors in real time. The weights and biases of E are randomly initialized and not trained to optimize any objective, as the task of E is to process the dimensional image vector into a dimensional floating-point encoding vector which represents the output layer of E. It is not possible to reverse engineer the image from the encoding without knowing the parameters and the overall structure of E. Further brute force, statistical and differential attacks are impractical as the network has piecewise linear units with nonlinear activations and thousands of randomly initialized parameters [9]. The decoder D is a generative deep neural network which takes an encoded vector as input and outputs an dimensional image vector which is reshaped into an 8-bit RGB matrix and processed into an image. D is trained as a regression problem on a large number of image-encoding pairs obtained from E to generate the original image from the encoding. Therefore, D is not dependent on E for training in any way, provided a large dataset of encodings from E is available and can be thought of as an adversary net trying to learn the underlying hidden relations in E which convert the image to the encoding [10].

2. Related Work

A lot of research has gone into deep learning models for Cryptography and Cryptanalysis applications in recent years. However, much of the research ties conventional encryption techniques with neural networks where the latter serve as a booster or enhance the technique instead of the full focus being on deep learning methods.

Autoencoders have been developed for encoding and decoding data using neural networks [11]. Applications of autoencoders in speech such as speech spectrogram coding [12], for generation images [13] and for denoising have given useful results [14], [15]. Stacked autoencoders have also been used for encryption but are computationally expensive for encoding which limits their application and are less secure as the encoder and decoder are trained in tandem. However, these autoencoders are mostly lossless with decrypted images having high quality [5]. But applications of autoencoders in cryptography in the way proposed by this paper remains mostly unexplored.

Other deep learning approaches to encryption such as chaotic Hopfield neural networks generate binary sequences to mask plain text [3]. An older but relevant βAnalysis of Neural Cryptographyβ [16] is based on mutually learning networks but is prone to attacks.

Aside from cryptography, generative networks for images such as Plug & Play generative networks [17], GANs (generative adversarial networks) [10], [18] can generate photorealistic images with random noise as input. SEDD networks build upon these generative networks, without having the discriminator but trading off quality of the generated images.

3. Encoder

The SEDD framework consists of a shallow Encoder network E which is a feedforward single hidden layer perceptron. Therefore the 3 layers in the encoder are (input layer), (hidden layer), (output layer). takes an RGB 8-bit image flattened into a single dimensional vector. is a hidden layer with a small size to reduce the number of parameters and hence the computations required to get the output. is the output layer with size equal to the desired encoding size . Choosing a large increases the complexity but also increases the features available to the decoder for training. The complexity of E is kept low as it is to run on device in real time. Weights and biases of E are randomly initialized, and the model (E) is saved as such without any training. Therefore, E serves as a function which is highly nonlinear and mangles the input to encrypt it. The encodings from E are considered as encrypted data and it is not possible to recreate images back from encodings without the decoder.

4. Decoder

The Decoder D is the main workhorse of the SEDD framework and is a deep multi-layer neural network (here for simplicity D is a multi-layer perceptron, more complex networks are discussed in Future Work). The input layer of D, is dimensional as it takes the encoding vector as input. D is trained on image-encoding pairs available from E and tries to recreate images from the encodings and therefore the output contains units which is reshaped into an RGB 8-bit image. D runs on a machine with high computational power and is saved in a secure way with the intended agent.

5. Algorithm

The encoder E contains 3 layers: input layer , hidden layer and output layer . Let the 8-bit RGB image of height and width is to be encoded. Therefore, the flattened image vector is n dimensional which is the same as the size of , where = . Let this vector be a.

Figure 1: Sample image in dataset

has a size of which is kept small to make E computationally simple. A typical value of will be 10. The size of output layer is the desired size of the encoding vector x. Eβs weights and biases are initialized randomly and saved right away. x is called the encoding or encrypted vector i.e. the image has been converted into a vector of floating-point numbers represented by x.

The job of decryption or decoding the encoded vector back into the original image lies with the decoder D. Therefore D is a complex deep neural network with hidden layers , β¦ , . D takes the encoded vector x as input and therefore is dimensional. As D is a generative network (it outputs a vector which is reshaped into an image), the hidden layers are activated by the leaky rectified linear function as it tends to give better results in such networks [18]. The hidden layers are regularized with adding dropout layers after activations [19]. The output layer is dimensional, where = and the output vector is reshaped back into a matrix of an RGB 8-bit image.

D is trained on a dataset of image-encoding pairs obtained from E. If is the set of images,

As D is trained on the dataset X, for the forward pass for the image is the input image into and is the output vector from is reshaped into the image matrix Stochastic gradient descent (SGD) is used to minimize the loss which is the mean squared error (MSE) and is calculated by comparing the (or ) to . The training is done till a global minimum of loss is reached for the test set. D would be ideal and lossless if is equal to i.e. we are able to extract the exact image from the encoding.

E runs on the edge (on device or companion device on which the image is created or received). Once the encoding is obtained the original image can be deleted. This algorithm is designed for the encryption to run on low power portable devices such as an ARM processor in a miniature camera or a raspberry pi like computer. The encodings can then be transferred safely to the intended party which has the trained decoder. The decoded images can be obtained from D by doing a forward pass (inference) in the decoder.

6. Experiments

Tensorflow 1.14 with keras on python 3.6.9 was used to implement the framework which was trained on an nVidia GTX 1660ti GPU. The decoder was trained on a range of datasets including MNIST [20], CIFAR-10 [21] and Cat image dataset [22]. The implementation here is shown for the cat images. The Cat image dataset contains 12500 images of cats for training.

The encoder E is randomly initialized with the hidden layer having 10 units i.e. = 10 with relu activation. The output layer of E has sigmoid activation with the size of 1024. Hence the encoding size is 1024. There are just 15,774 parameters in E. The images in the cat dataset are resized into a with h = w = 150. The decoder D has 3 hidden layers with 512 units each with leaky rectified linear activation having an alpha of 0.2 and a Dropout layer with a rate of 0.3, 0.3 and 0.2 respectively. The loss mse is minimized during training of D using SGD. There are 36,727,212 trainable parameters in D.

Training for a large number of epochs (>20) quickly overfits artificially lowering the train mse a lot so an Early stopping with a test mse < 0.075 constraint is applied (this value is derived empirically). Under this constraint, D trains for 10 epochs and achieves a satisfactory loss minimum on both the train and test sets.

Figure 2: Mean square error optimization for the training and test sets for the decoder

Random test images decoded using trained D shows promising results. While the generated images by D are very lossy and noisy they do retain the general structure and major details of the original image. Therefore the theory and principle is validated however any practical use requires further work to improve the quality of decoded generated images. It is to be notes that these results required extensive hyperparameter optimization and some pre and post processing.

Figure 3: Visualization of sample decoded images from the model. Images on the left are original images which were encrypted by the Encoder into the encoded vectors. Images on the right are the corresponding generated images by the Decoder from the encoded vectors. Examples shown here are manually selected as some of the better ones on random test images.

7. Advantages & Disadvantages

This new cryptography framework comes with advantages and disadvantages relative to conventional cryptography techniques as well as deep learning techniques. The major advantage is primarily computational. The encoder is able to run on most low power portable devices as stated earlier and the encodings are practically impossible to decrypt without the trained decoder. Another advantage is that the images are encoded into floating point vectors of same size which adds additional security and needs lesser storage.

The major disadvantage is that while the process itself is secure in the sense that the encodings canβt be decrypted without the trained decoder, a person can train another decoder to decrypt the encodings if a large number of encodings are available to him. In other words if a device with the encoder and consequently the image encodings generated are available, then another decoder can be trained with a new dataset generated by that encoder. Therefore, while the data (encodings) is safe the encoder itself must be protected from going into the wrong hands. Another disadvantage is the information loss as the decryption process in lossy and the decoded image is of much lower quality as compared to the original image.

8. Conclusions and future work

The framework proposed is explained mathematically and tested experimentally with the results shown. This paper has demonstrated the ability of deep learning networks to be able to decrypt encodings of data encrypted from nonlinear functions, suggesting that further research in this area can be useful. This framework can be further extended by:

standards. This is due to the focus of this paper is to establish and verify the principles and not on achieving the best quality. Another limitation was the available computation power. The decoder can be made deeper by adding more hidden layers on varying types. CNNs can be explored for both the encoder as well as decoder [23], [24]. Sequential models like RNN, LSTM [25] as well as ensemble networks can increase the quality of decoded images.

pairs to train the decoder can reduce the mse further for better results. Introducing feature selection techniques can also help in reducing the size of encoded vectors and make them more representative of the images themselves [26].

techniques in the encoding process along with the neural network can eliminate the stated major disadvantage of the framework while also inhering their own disadvantages.

References

[1] Schaefer, Edward. "An introduction to cryptography and cryptanalysis." California's Silicon Valley: Santa Clara University (2009).

[2] Stinson, Douglas R. Cryptography: theory and practice. Chapman and Hall/CRC, 2005.

[3] Yu, Wenwu, and Jinde Cao. "Cryptography based on delayed chaotic neural networks." Physics Letters A 356.4-5 (2006): 333-338.

[4] Lotfollahi, Mohammad, et al. "Deep packet: A novel approach for encrypted traffic classification using deep learning." Soft Computing (2017): 1-14.

[5] Hu, Fei, et al. "Batch Image Encryption Using Generated Deep Features Based on Stacked Autoencoder Network." Mathematical Problems in Engineering 2017 (2017).

[6] Hitaj, Briland, Giuseppe Ateniese, and Fernando Perez-Cruz. "Deep models under the GAN: information leakage from collaborative deep learning." Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2017.

[7] Bengio, Yoshua, Ian Goodfellow, and Aaron Courville. Deep learning. Vol. 1. MIT press, 2017.

[8] Rumelhart, David E., Bernard Widrow, and Michael A. Lehr. "The basic ideas in neural networks." Communications of the ACM 37.3 (1994): 87-93.

[9] Tang, Yang, Zidong Wang, and Jian-an Fang. "Image encryption using chaotic coupled map lattices with timevarying delays." Communications in Nonlinear Science and Numerical Simulation 15.9 (2010): 2456-2468.

[10] Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.

[11] Ng, Andrew. "Sparse autoencoder." CS294A Lecture notes 72.2011 (2011): 1-19.

[12] Deng, Li, et al. "Binary coding of speech spectrograms using a deep auto-encoder." Eleventh Annual Conference of the International Speech Communication Association. 2010.

[13] Han, Junwei, et al. "Background prior-based salient object detection via deep reconstruction residual." IEEE Transactions on Circuits and Systems for Video Technology 25.8 (2014): 1309-1321.

[14] Vincent, Pascal, et al. "Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion." Journal of machine learning research 11.Dec (2010): 3371-3408.

[15] Vincent, Pascal, et al. "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th international conference on Machine learning. ACM, 2008.

[16] Klimov A., Mityagin A., Shamir A. (2002) Analysis of Neural Cryptography. In: Zheng Y. (eds) Advances in Cryptology β ASIACRYPT 2002. ASIACRYPT 2002. Lecture Notes in Computer Science, vol 2501. Springer, Berlin, Heidelberg

[17] Nguyen, Anh, et al. "Plug & play generative networks: Conditional iterative generation of images in latent space." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

[18] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).

[19] Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." The journal of machine learning research 15.1 (2014): 1929-1958.

[20] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278β2324.

[21] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.

[22] Microsoft Research. (2013). Dogs vs. Cats [25,000 images of dogs and cats]. Retrieved from https://www.kaggle.com/c/dogs-vs-cats

[23] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[24] Yasrab, Robail, Naijie Gu, and Xiaoci Zhang. "An encoder-decoder based convolution neural network (CNN) for future advanced driver assistance system (ADAS)." Applied Sciences 7.4 (2017): 312.

[25] Toderici, George, et al. "Full resolution image compression with recurrent neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

[26] Catal, Cagatay, and Banu Diri. "Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem." Information Sciences 179.8 (2009): 1040-1058.