This work has been supported by UK’s EPSRC Centre for Doctoral Training in Intelligent Games and Game Intelligence (IGGI; grant EP/L015846/1).
[1] J. Lehman and K. O. Stanley, “Abandoning objectives: Evolution through the search for novelty alone,” Evolutionary Computation, vol. 19, no. 2, pp. 189–222, 2011.
[2] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” arXiv preprint arXiv:1312.6114, 2013. [Online]. Available: http://arxiv.org/abs/1312.6114
[3] D. P. Kingma and P. Dhariwal, “Glow: Generative Flow with Invertible 1x1 Convolutions,” pp. 1–15, 2018. [Online]. Available: http://arxiv.org/abs/1807.03039
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” Advances in Neural Information Processing Systems, pp. 2672–2680, 2014. [Online]. Available: http: //papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
[5] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” 2015. [Online]. Available: http://arxiv.org/abs/1512.09300
[6] Z. C. Lipton and S. Tripathi, “Precise Recovery of Latent Vectors from Generative Adversarial Networks,” arXiv preprint arXiv:1702.04782, 2017. [Online]. Available: http://arxiv.org/abs/1702.04782
[7] T. White, “Sampling Generative Networks,” arXiv preprint arXiv:1609.04468, 2016. [Online]. Available: http://arxiv.org/abs/1609.04468
[8] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” arXiv preprint arXiv:1710.10196, 2017. [Online]. Available: http://arxiv.org/abs/1710.10196
[9] J. D. Cook, “Willie Sutton and the multivariate normal distribution,” 2011. [Online]. Available: https://www.johndcook.com/blog/2011/09/01/multivariate-normal-shell/
We applied the approach mentioned in this paper on a number of different models and architectures, however the primary test case we refer to (and from which we also show the results) is a GAN (specifically, a Progressively Grown GAN [8]) trained on over 100,000 images scraped from the photo sharing website flickr. The dataset is very diverse and includes images tagged with: art, cosmos, everything, faith, flower, god, landscape, life, love, micro, macro, bacteria, mountains, nature, nebula, galaxy, ritual, sky, underwater, marinelife, waves, ocean, worship and more. We include three thousand images from each category and train the network with no classification labels. Given such a diverse dataset without any labels, the network is forced to try and organise its distribution based purely on aesthetics, without any semantic information. Thus in this high dimensional latent space we find directions allowing us to seamlessly morph from swarms of bacteria to clouds of nebula, oceanic waves to mountains, flowers to sunsets, blood cells to technical illustrations etc. Most interestingly, we can perform these transformations across categories while maintaining overall composition and form.
We use the opensource Non Linear Video Editor Kdenlive (Figure 1) on Ubuntu. Unfortunately this editor lacks support for exporting the industry standard EDL. However, Kdenlive’s native project file format is XML based. This allows us to write a Python based parser to load the project file, inspect the edits, retrieve the corresponding numpy z-sequences and conform by performing the same edits on them and exporting a new z-sequence. At this point the conform is very simple and only supports basic operations such as trimming, cutting and joining, and does not include cross-fades or other more advanced features or transitions. However, to implement such additional features is relatively trivial and left as future work (e.g. a cross-fade between two images in the NLE can be thought of as an interpolation between the two corresponding points in latent space).
Figure 1: An example project in Kdenlive
We use generative models with high dimensional (512D) multivariate Gaussian distributed latent spaces. Because these distributions are concentrated around the surface of a hypersphere [9], when we wish to interpolate between points in this space, we have to make sure that our trajectory stays within the distribution. A common solution is to use spherical instead of linear interpolation. However this produces visibly noticeable discontinuities in the movement of the output images due to sudden changes in speed and direction. The images below are two different z trajectories, i.e. journeys in latent space, created by interpolating between a number of arbitrary keyframes. In both images, a single pixel wide vertical slice represents a single z vector, and time flows left to right. Figure 2 visualises the results of spherical interpolation. We can see notch-like vertical artifacts that happen when the interpolation reaches its destination and we set a new target, creating a sudden change in speed and direction. To remedy this we introduce a simple physics based system, the results of which can be seen in Figure 3. In the high dimensional latent space we create a particle connected to both the surface of the hypersphere and the next destination point with damped springs. This ensures that the particle stays close to the distribution, but also moves without discontinuities at keyframes.
Figure 2: z sequence using spherical interpolation
Figure 3: z sequence using physical interpolation
As the network trains, the latent space changes with each training iteration, to hopefully represent the data more efficiently and accurately. However a noticeable change across these iterations also includes transformations and shifts in space. E.g. what may be an area in latent space dedicated to ‘mountains’ at iteration 70K, might become ‘flowers’ at iteration 80K, while ‘mountains’ slide over to what used to be ‘clouds’ (this is a bit of an exaggerated oversimplification). To investigate the effects of these transformations, we render the same z-sequence decoding from a number of different snapshots across subsequent training iterations (e.g. the last 28 snapshots spaced 1000 iterations apart), and we tile the outputs in a grid (e.g. 7x4) when saving a video. An example video can be seen at https://www.youtube.com/watch?v=DVsf0ooqFWE and Figures 5-14 show example frames.
Figure 4: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Here, every tile within a frame is the same z-vector decoded from a different snapshot in time (i.e. training iteration). The small number in the top left of each tile (in both black and white) is the iteration number. We can see in many cases the images are relatively similar with slight variations. In other cases there are more radical shifts, where earlier snapshots are hinting at generating one type of image while later snapshots are producing another for the same z-vector. Interestingly, even while semantically the images might be radically different, sometimes the overall form and composition is similar. E.g. in Figure 6 we can see that the space occupied by the current z-vector briefly gives way from mountains to flowers, however the images maintain the valley-like shape.
When editing our videos in the NLE, we edit these videos containing the outputs from multiple tiled snapshots. This gives us an overview of the aesthetic qualities from the different training iterations, and allows us to choose the most aesthetically desirable snapshot(s) to use for our final output.
Figure 5: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 6: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 7: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 8: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 9: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 10: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 11: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 12: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 13: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.
Figure 14: An example frame where the same z-vector is decoded from 28 snapshots spaced 1000 training iterations apart.