Spherical images, which capture all possible directions (horizontal 360vertical 180
), are used in various domains such as surveillance systems, building industry, tourism, autonomous cars, and entertainment. They can capture environments around the main subject and represent the space itself as a subject. Furthermore, when viewed through a head-mounted display, spherical images allow one to enjoy a scene in a more immersive manner. How-
Figure 1. Symmetry types of spherical images in SUN360 dataset [35]. Images are represented by equirectangular projection (see Fig. 2), and arrows in the circles correspond to the viewpoint transitions that do not change the appearance significantly.
ever, capturing spherical images is not an easy task, as doing so requires a specific panoramic camera or specific software that stitches together images taken from multiple directions. Therefore, it would be more convenient to generate a spherical image from a single normal-field-of-view (NFOV) image taken using a normal camera. Furthermore, it would considerably expand the usage scenarios that require plausibility rather than reproducibility; for example, the background of VR content can be created using a single concept photo, or generating peripheral views of pictures taken in the past or historic paintings can allow the viewers to enjoy the content with immersive feeling.
However, two main challenges exist while generating a spherical image from a single NFOV image: one is generating an image corresponding to the spherical structure, and the other is controlling the high degree of freedom involved in generating a wide area, which includes the all directions of a plausible spherical image. Because a spherical image cannot be uniquely determined for an NFOV image, the generation of various plausible scenes must be controlled.
Generating a spherical image from an NFOV image is related to the task of image completion, which predicts a whole image from a partial one. However, conventional image-completion methods [25, 16, 38] are not suitable for generating spherical images, because these methods are de- signed for planar NFOV images. Recently, a spherical-image-completion method using a single NFOV image was proposed [1]. This method can handle the spherical structure by performing rearrangement in an equirectangular projection for the input and employing dilated convolution; however, it cannot control the content of the generated regions. Therefore, unlike conventional research works, we handle the spherical structure and aim to control the aforementioned degree of freedom in order to obtain the plausible variations of the desired spherical image.
Various factors control spherical-image generation. Among these factors, scene symmetry is a basic property of the global structure of spherical images. As pointed out in [35], the 360view has specific symmetry types, meaning that specific geometric operations such as rotation and flip significantly change the appearance or not. Fig. 1 depicts the typical types of symmetry, such as rotational symmetry, plane symmetry and asymmetry. When a spherical image possesses a certain level of symmetry, a correlation exists between specific areas of the image. On the basis of this observation, we control the scene-symmetry properties to improve the plausibility and make diverse generation. Furthermore, we control the symmetry intensity, as a perfectly symmetric image is sometimes unnatural.
The contributions of this paper can be summarized as follows.
• We propose a novel spherical-image-generation method using a single NFOV image; this method improves the plausibility of the generated images by leveraging scene symmetry.
• We design a new architecture that is able to estimate and control the symmetry of the image by using the circular shift and the flip of the hidden variables of convolutional neural networks (CNNs).
• We demonstrate that our proposed method can generate multiple spherical images, controlled from symmetric to asymmetric.
2.1. Image Completion
Various image-completion technologies have been proposed thus far in order to predict the missing regions of an image. Traditionally, several diffusion-based methods [3, 7] diffuse the information of the visible regions to the missing regions, and several patch-based methods [10, 5] complete the missing regions by matching, copying, and realigning using the visible regions. Both these methods assume that the missing regions contain information correlated with the visible regions.
Recently, generative models, which are learned using large-scale datasets, such as the variational autoencoder (VAE) [21, 28] and generative adversarial networks (GANs) [13] have experienced a significant boost, and both of these models have been adopted for image completion. Li et al. [25] directly generated the contents for missing regions on the basis of CNNs [12, 23] with a combination of a reconstruction loss, a semantic parsing loss, and two adversarial losses. Iizuka et al. Furthermore, [16] employed global and local context discriminators in the framework of adversarial learning to improve the naturalness and consistency of the completed regions.
While most image-completion methods produce only one result for each input, Zheng et al. [38] presented a method to generate multiple and diverse plausible solutions for image completion by using conditional VAEs (CVAEs) [30]; however, their method cannot explicitly control the generated contents. Therefore, unlike any of the previous methods, we propose a spherical-image-generation method that can explicitly control the symmetry.
2.2. Panoramic and Spherical Image Generation
Zhang et al. [37] proposed to extrapolate an NFOV image to a panoramic image, with the guidance of a panorama image. The method required a guide image belonging to the same scene category to which the input image belonged. Kimura et al. [19] proposed a peripheral-image-generation method based on pix2pix [17]; however, the field of view of the generated image was limited. Sumantri et al. [32] proposed a spherical-image-generation method based on pix2pixHD [34], which required a set of images captured from multiple directions as an input. Recently, Akimoto et al. [1] proposed a spherical-image-completion method using a single NFOV image. This method can handle the spherical structure by means of rearrangement in an equirectangular projection for the input and employing dilated convolution; however, it cannot control the content of the generated regions. Unlike these conventional research works, we handle the spherical structure and aim to control the aforementioned degree of freedom to obtain plausible variations of the desired spherical image.
2.3. CNN for Spherical Signals
Ordinal 2D CNNs are designed on the basis of translational equivariance on a plane; therefore they are not directly applied to spherical images. A CNN-based approach for generating spherical signals is to apply planar CNNs in the projected plane, such as an equirectangular projection [31, 2] depicted in Fig. 2, tangent planes [33], and cube mapping [8]. Other approaches that do not use 2D planar CNNs, such as generalized FFT-based CNNs on sphere [9, 11], distortion-aware sampling [24, 6], and graph convolution on sampled grid on sphere [18, 27], were also pro-
Figure 2. Equirectangular projection. Point P on the 2D sphere is converted to point
posed.
Although projection distortion is caused when using the methods based on planar CNNs, some good spherical images were generated by using the equirectangular projection in [32, 1]. Furthermore, using the equirectangular projection offer advantages such as the symmetry operation around the gravity axis can be realized using the circular shift and the flip with low computational requirements. In this study, we employ equirectangular-projection-based CNNs.
This section describes our proposed method, which generates a spherical image from a partial image (e.g., an NFOV image)
by using a scene-symmetry parameter
. The term s corresponds to the intensity of C types of rotational symmetry around the gravity-axis, and plane symmetry to vertical planes (the gravity-axis is aligned with the negative direction of the z-axis in Fig. 2). To generate diverse images conditioned with the partial image, we employ CVAEs [30] as a base framework, and we also incorporated scene-symmetry parameters as latent variables. Furthermore, the probability density functions are represented by the equirectangular projection-based CNNs, and scene symmetry is implemented as a circular shift and a flip of the hidden variables. The following describes the details of our method.
3.1. CVAEs with Symmetry Parameters
First, we describe our probabilistic framework. To generate diverse spherical images conditioned with the partial image, we employ CVAEs [30] as a base framework. We obtain the conditional distribution , maximizing the variational lower bound of the likelihood for the training data
(s is not required), following which we sample a new
for the given
and s from this distribution.
In CVAEs, the variational lower bound of the conditional log-likelihood of observing a partial image for a spherical
image is expressed as follows:
where denotes the latent vector with the whole image information encoded;
the posterior importance sampling function;
the conditional prior;
the likelihood;
, and
the parameters of the neural networks; KL(q||p) the KL divergence between q and p; and
the expected value for the probability distribution q. Each probability distribution is obtained by determining the parameters so as to maximize the variational lower bound.
Furthermore, we introduce the scene-symmetry parameter s as a part of the latent vector , i.e.,
, and we let s be independent of the partial image
, i.e.,
. Thus, Eq. (1) can be written as follows:
The parameters , and
are determined to maximize the variational lower bound for the training data
. We obtain the targeted conditional distri- bution as follows:
where is sampled from this distribution for the given
and s.
3.2. Likelihood for Reconstruction and Generation
We assign two kinds of functions for the log-likelihood . The first function is the reconstruction likelihood
to reconstruct
corresponding to
in the training data. The second function is the generation likelihood
to generate a sample that does not depend on specific
but follows the distribution of the training data (we allow these likelihood functions deviate from the probability distribution). Thereafter, we maximize the combination of these two functions to generate the samples that possess both plausibility and diversity. The combination of the two criteria, namely, the reconstruction and generation errors, is inspired by PICNet [38]. However, our method differs with respect to the following important point: the scene-symmetry parameter is incorporated to latent variables.
First, to maximize the reconstruction likelihood, we employ the following decomposition approximation, because it is difficult to perform the maximization directly. We have
Using the approximation in Eq. (4), the right-hand side of Eq. (2) can be transformed as follows:
However, for the generation likelihood, we approximate by removing the dependency to
as follows:
Using the approximation in Eq. (6), as KL(q||p) takes the minimum value of 0 when q = p, the right-hand side of Eq. (2) can be transformed as follows:
We combine both these likelihoods and maximize the following evaluation function by adding each variational lower bound with the ratio of . We have
We determine the parameters and
to maximize the summation of
for the training data
. Details of the probabilistic framework are described in the Appendix A.
3.3. Probability Distribution Settings
The prior and posterior distributions for the latent variables z and s are set as follows:
where denotes the probability density function of a normal distribution with the mean vector
and the covariance matrix
denotes a delta function; and
and
denote the hyperparameters of the mean vector and the covariance matrix of s, respectively. Furthermore, F denotes the function for image-feature extraction, and
denote the functions that estimate the parameters of each distribution. In our implementation,
output the diagonal matrices, indicating that each element of z is conditionally independent in
and
.
In addition, the likelihood functions are set as follows:
Figure 3. Circular padding. The circular padding eliminates the discontinuity between the left and right edges of the equirectangular image for convolution.
where denote the weighting factors, G is a function that generates a spherical image, D is a function that outputs confidences to discriminate real images for multiple partial regions, 1 is a constant vector whose elements take all the value of 1, and M is a function that extracts the image in the region corresponding to
.
3.4. Network Structure
Functions are implemented using fully CNNs. To eliminate the discontinuity between the left and right edges of the equirectangular image, we employ the circular padding before each convolution layer. As depicted in Fig. 3, the circular padding copies
mod 2 columns from the left and right edges of the image to each opposite side in the equirectangular image, where k denotes the kernel size of the convolution. The implementation detail of each function is described in Section 4.1.
Fig. 4 depicts the entire network structure of the proposed method. The input spherical image and the partial image
are represented using equirectangular projections (see Fig. 2). The encoder F and the estimators
calculate the latent variables z and s for
; in addition,
and
calculate
for
using the reparameterization trick [21]. Moreover, F outputs the partial features
for
, where
and
are obtained from the hidden layer and the last layer, respectively. In addition,
is sampled from p(s). The decoder G generates the spherical images
from
and s, and
from
and
. The scene symmetry s is estimated on the basis of the rotated and flipped differences of the hidden variables of
, and the hidden variables of G are circular shifted, flipped, and copied depending on s, as described in Section 3.5. In this manner, the symmetry of the spherical image is encoded and decoded. The discriminators D output the confidences
and
from
and
in order to discriminate real images. Thereafter, the loss is calculated according to Eq. (8).
3.5. Estimating and Controlling Scene Symmetry
We propose a novel method for estimating and controlling the scene symmetry for spherical images by using both a circular shift and a flip in the equirectangular projection. Because our networks consist of fully CNNs, the hidden variables store the location information of the input equirectangular image. In the equirectangular projection, a
Figure 4. Structure of our proposed method for spherical-image generation. During training, the spherical image and the partial (NFOV) image
are used. During testing, a spherical image
is generated from a single
Figure 5. Circular shift. The circular shift is used for estimating and controlling the rotational symmetry.
Figure 6. Flip on the axis. This operation combines 0
circular shift. The flip is used for estimating and controlling the plane symmetry.
where denotes a sigmoid function,
a symmetrictransformation function consisting of the circular shift and flip, and
and
the trainable parameters.
The decoder G controls the scene symmetry of the generated image. First, we define the symmetry-control function H, which takes the weighted linear sum of the symmetric transformation of the input feature f as follows:
where denotes a weight vector,
an elementwise product, and the quotient is elementwise. In addition, the element of
corresponding to the position v on the sphere is defined as follows:
where denotes a hyper parameter for the concentration, c the center position of the input partial image on the sphere, and
the inner product of the 3D Euclidean space. That is, a high weight is applied at the position of the partial image and attenuates as it goes around.
The scene symmetry is reflected in the generated image as follows. The decoder G takes as the input of the first layer and, subsequently, concatenates
to the output of the specified hidden layer in the channel dimension. Thereafter, G outputs the spherical image
.
3.6. Adversarial Learning
We employ adversarial learning combined with VAE, as used in [22, 4, 38]. First, we learn to maximize the evaluation function of Eq. (8). Subsequently, on the basis of LSGAN [26] for each mini-batch data, we learn D to minimize the following loss function:
3.7. Sampling-Generated Spherical Image
ing which we sample from this distribution for the given
and s by using the following two steps: (i) sampling z from
in Eq. (10), and (ii) sampling
from
corresponding to
in Eq. (13). Because
does not represent probability distribution, instead, we take a sample to maximize
, i.e.,
.
We conducted experiments to verify the effectiveness of the proposed method. To that end, we used the Sun360 dataset [35], which includes various spherical images, both symmetric and asymmetric, from indoor to outdoor. The data were divided into 50,000 images for training, 10,000 images for testing, and 5,000 images for validation. The spherical image was an RGB image of the equirectangular format with a resolution of pixel. Furthermore, a partial (NFOV) image was cropped from the spherical image with 30
to 120
field of view and aspect ratio of 1: 1, following which the viewpoint direction was set randomly on the sphere and projected to the equirectangular image, whose margin was filled with gray values.
4.1. Implementation Details
We trained the networks from scratch using Adam optimizer [20] with a fixed learning rate of and a mini-batch size of 8. During the optimization, the weighting factors of the likelihood were set to
and
; the priors for symmetry were set to
and
0.33; the concentration parameter
was set to 3.0; the mixture ratio of the two approximations was set to
. Furthermore, we considered C = 5 types of symmetry, namely the 90
, 180
and 270
rotational symmetries and plane symmetries to the 0
and 90
axes.
Our network consisted of a multi-layer ResBlock with circular padding (RBCP), which is a unit module of residual networks [14] with circular padding, as depicted in Fig. 8. The RBCP has three modes: (s) standard, (d) down-sampling, and (u) up-sampling, and the behavior of each of these is different, as depicted in Fig. 8. The structure of each function is as follows:
• F: two-layer RBCP(s) and five-layer RBCP(d).
• and
: one-layer RBCP(s).
• and
: seven-layer RBCP(s) and share the weights of the first six layers.
• : one-layer RBCP(d).
• G: one-layer RBCP(s) and five-layer RBCP(u).
• D: five-layer RBCP(d) and conv. with 1 output channel.
The encoder F outputs and
, which are outputs of the fifth and final (seventh) layers, respectively. Fur-
Table 1. Comparison with each method.
Table 2. Comparison with each viewpoint and FOV.
thermore, we set the size (height, width, and channel) of , and
to
, and
, respectively. After the third layer of G and D, we employed self-attention [36] to harness the distant spatial context. Both D output the confidences for
partial regions. These configurations were matched to PICNet [38] for a comparison.
4.2. Comparison with Related Works
We compare our method with pix2pix [17] and PICNet [38]. Pix2pix has been used for peripheral-image generation [19] and as a basis for spherical-image generation [1]. PICNet is a state-of-the-art image-completion method; its input comprises partial equirectangular images, similar to the proposed method. Additionally, circular padding is added before each convolution layer.
4.2.1 Qualitative Evaluation
Fig. 7 depicts the examples of spherical images generated from the frontal 90views by using each method. In our proposed method, the symmetry parameter
was set to five variations, namely, (h, h, h, l, l), (l, h, l, l, l), (l, l, l, h, l), (l, l, l, l, h) and (l, l, l, l, l) where h = 1.0, l = 0.3, for the multiple of 90
and 180
rotational symmetries, the plane symmetries to the 0
and 90
axes, and asymmetry, respectively. Furthermore, PICNet generates multiple samples using randomized latent variables. Although PICNet can generate a variety of images, it cannot explicitly control the generated contents. For example, unlike our method, none of the generated images using PICNet possesses a spatial structure that significantly differs from that of the rest. More generation results are available in the Appendix B.
4.2.2 Quantitative evaluation
We evaluated each method by generating 10,000 images from partial images with randomized viewpoints and FOV for the test dataset by using the Frechet inception distance (FID) score [15], which measures the distance between the
Figure 7. Qualitative comparison between spherical images generated using our method and those generated using previous works. Our method enables scene-symmetry control.
distribution of ground truth and that of the generated images. The symmetry parameter of our proposed method was sampled from the prior in Eq. (12). The evaluation results are presented in Table 1. Our method displayed a superior FID score than those of the other methods, meaning that our method can generate plausible images by capturing the distribution of high-order features of the spherical images.
We also evaluated the generated image from the fixed view, i.e., the front 30, 60
, 90
, 120
, top 90
, and bottom 90
views. The generated samples are depicted in Fig. 9, and the evaluation metrics are listed in Table 2. We can generate spherical images satisfactorily from a wide FOV. However, the generation results in poor quality from a narrow FOV, and top and bottom viewpoints, as these views do not have sufficient information about the entire image.
We now quantitatively validate the symmetric controlling. Here, we define a symmetry-evaluation metric (SEM) as a normalized autocorrelation between regular features and symmetric-transformed (in Section 3.5) features obtained from the fifth block in VGG16 [29] for each symme-
Figure 8. Residual block with circular padding. There are three modes: (s) standard, (d) down-sampling, and (u) up-sampling. For example, during the standard mode, only the modules marked with (s) are used.
Figure 9. Generation results from various views for the same ground truth (the fourth image from the left in Fig. 7). Top: input images, Bottom: generated images.
Figure 10. Symmetry-evaluation metric of the four symmetric types for test images and generated images. The boxplot shows median, quartile, maximum, and minimum.
try type. We set to 0.00, 0.25, 0.50, 0.75, and 1.00 for i corresponding to the targeted symmetric type, and other elements
were fixed to low symmetry intensity, i.e., 0.3. Figure 10 depicts the SEM of the four symmetric types for the generated images (SEM for the multiple of 90
rotational symmetry is an average of SEM for 90
, 180
and 270
rotations). This proves that varying our symmetry parameter s allows controlling the symmetry of the gener-
Figure 11. FID scores for high and low symmetric images.
Table 3. Comparison with each loss function.
ated image for a sufficiently wide range, compared with the test images. Furthermore, compared to the rotation symmetry, the control range of the plane symmetry is narrower, as there are some infeasible cases in which the symmetry axis is in the asymmetric partial image (e.g., the eighth row and fifth column image in Fig. 7). Furthermore, when the targeted symmetry parameter is 0.00, the SEM is slightly large because it is affected by the other symmetry parameters fixed to 0.3. To validate the plausibility of these generated images, we divided the test images into 2,048 images with higher SEM and 2,048 images with lower SEM, for the four symmetry types, following which we measured the FID scores of the generated images for each dataset, as depicted in Fig. 11. When s is large at a certain level, it performs well for the high symmetric images. Furthermore, whichever value is set to s, except approximately 0, our method outperforms the baseline for either high or low symmetric images. Notably, our method outperformed the baseline even for low symmetric images. The circular shift and flip not only control the symmetry but also allow performing convolutions using distant positions of the spherical image (i.e., 180opposite side), leading to spherical-image generation while considering the consistency of the entire space.
4.3. Ablation Study
We compared the generation quality with the evaluation function in Eq. (5),
in Eq. (7) and their combination in Eq. (8) as shown in Table 3. The combination method is superior in terms of FID. The estimated s using
achieved low L1 loss; however, it has less variance. It is conditioned with the input spherical image
, so the network was not trained for various s. On the other hand, using
, the network learned with the wide range of s sampled from p(s); however, it did not learn with s corresponding to
. By combining the two functions, the network can be trained for both the symmetry of
and various symmetry with a symmetry parameter. Additional ablation studies are described in the Appendix C.
We proposed a novel, method to generate sphericalimages from a single NFOV image by controlling scene symmetry. We incorporated scene symmetry into CVAEs as a latent variable, and the scene symmetry was implemented as a circular shift of the hidden variables of the neural networks. Furthermore, our experimental results showed that the proposed method can generate various plausible spherical images controlled from symmetric to asymmetric. Future work includes generating high-resolution images and controlling the local features of the generated images.
This work was partially supported by JST CREST Grant Number JPMJCR1403, and partially supported by JSPS KAKENHI Grant Number JP19H01115. We would like to thank Antonio Tejero de Pablos, Atsuhiro Noguchi, Li Yang, Sho Inayoshi, Takuhiro Kaneko, Wataru Kawai and Yusuke Kurose for helpful discussions.
[1] N. Akimoto, S. Kasai, M. Hayashi, and Y. Aoki. 360-degtee image completion by two-stage conditionalgans. In ICIP, 2019. 2, 3, 6
[2] M. Assens, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor. Scanpath and saliency prediction on 360 degree images. In Signal Processing: Image Communication, 2018. 2
[3] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera. Filling-in by joint interpolation of vector fields and gray levels. In TIP 10(8), pages 1200–1211, 2001. 2
[4] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: Fine- grained image generation through asymmetric training. In ICCV, 2017. 5
[5] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Gold- man. Patchmatch: A randomized correspondence algorithm for structural image editing. In ToG, 2009. 2
[6] A. Benjamin, P. Condurache, and A. Geiger. Spherenet: Learning spherical representations for detection and classi-fication in omnidirectional images. In ECCV, 2018. 2
[7] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In SIGGRAPH, 2000. 2
[8] H. T. Cheng, C. H. Chao, J. D. Dong, T. L. Wen, H. K.and Liu, and M. Sun. Cube padding for weakly-supervised saliency prediction in 360 videos. In CVPR, 2018. 2
[9] T. S. Cohen, M. Geiger, J. Koehler, and M. Welling. Spheri- cal cnns. In ICLR, 2018. 2
[10] A. Criminisi, P. Perez, and K. Toyama. Object removal by exemplar-based inpainting. In CVPR, 2003. 2
[11] C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Dani- ilidis. Learning so(3) equivariant representations with spherical cnns. In ECCV, 2018. 2
[12] K. Fukushima and S. Miyake. Neocognitron: A new algo- rithm for pattern recognition tolerant of deformations and shifts in position. In Pattern Recognition 15(6), pages 455– 469, 1982. 2
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPs, 2014. 2
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2015. 6
[15] M. Heusel, H. Ramsauer, T. Unterthiner, and B. Nessler. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPs, 2017. 6
[16] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. In SIGGRAPH, 2017. 1, 2
[17] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. 2, 6
[18] R. Khasanova and P. Frossard. Graph-based classification of omnidirectional images. In ICCV Workshops, 2017. 2
[19] N. Kimura and J. Rekimoto. Extvision: Augmentation of visual experiences with generation of context images for a peripheral vision using deep neural network. In CHI, 2018. 2, 6
[20] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In arXiv preprint arXiv:1412.6980, 2014. 6
[21] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In arXiv preprint arXiv:1312.6114, 2013. 2, 4
[22] A. B. L. Larsen, S. K. Sonderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In arXiv preprint arXiv:1512.09300, 2015. 5
[23] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. In Neural computation 1(4), pages 541–551, 1989. 2
[24] Y. K. Lee, J. Jeong, J. S. Yun, W. J. Cho, and Y. Kuk-Jin. Spherephd: Applying cnns on a spherical polyhedron representation of 360arXiv preprint arXiv:1811.08196, 2018. 2
[25] Y. Li, S. Liu, J. Yang, and M.-H. Yang. Generative face completion. In CVPR, 2017. 1, 2
[26] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smol- ley. Least squares generative adversarial networks. In ICCV, 2017. 5
[27] N. Perraudin, M. Defferrard, T. Kacprzak, and R. Sgier. Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. In Astronomy and Computing 27, pages 130–146, 2019. 2
[28] D. J. Rezende, S. Mohamed, and D. Wierstra. Auto-encoding variational bayes. In ICML, 2014. 2
[29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 7
[30] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NeurIPs, 2015. 2, 3
[31] Y.-C. Su and K. Grauman. Learning spherical convolution for fast features from 360 imagery. In NeurIPs, 2017. 2
[32] J. S. Sumantri and I. K. Park. 360 panorama synthesis from a sparse set of images with unknown fov. In arXiv preprint arXiv:1904.03326, 2019. 2, 3
[33] K. Tateno, N. Navab, and F. Tombari. Distortion-aware con- volutional filters for dense prediction in panoramic images. In ECCV, 2018. 2
[34] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018. 2
[35] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recogniz- ing scene viewpoint using panoramic place representation. In CVPR, 2012. 1, 2, 6
[36] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self- attention generative adversarial networks. In arXiv preprint arXiv:1805.08318, 2018. 6
[37] Y. Zhang, J. Xiao, J. Hays, and P. Tan. Framebreak: Dra- matic image extrapolation by guided shift-maps. In CVPR, 2013. 2
[38] C. Zheng, T.-J. Cham, and J. Cai. Pluralistic image comple- tion. In CVPR, 2019. 1, 2, 3, 5, 6
In this section, we describe the details of the probabilistic framework of our method, which generates a spherical image from a partial image (e.g., an NFOV image)
by using a scene-symmetry parameter
. We introduce
to a latent variable. Fig. 12 shows the assumed graphical model, which indicates the causal relationship between variables. Under this assumption, the joint distribution is expressed as follows:
Therefore, the conditional probability for is described as follows:
Figure 12. The graphical model indicates the causal relationship of a spherical image, a partial image, a latent variable, and a symmetry parameter.
Next, we introduce the variational lower bound of as follows:
where KL(q||p) denotes the KL divergence between q and p, and denotes the expected value for the probability distribution q. In the transformational formula, we employed Jensen’s inequality for the second row and Eq. (20) for the third row.
To maximize the variational lower bound, we employed the following two types of decomposition approximations. This is because it is difficult to perform the maximization directly. We have
Our method can generate spherical images using the approximation in Eq. (4). The right-hand side of Eq. (21) can be transformed as follows:
On the other hand, using the approximation in Eq. (23), the right-hand side of Eq. (21) can be transformed as follows:
KL(q||p) takes the minimum value of 0 when q = p. Furthermore, Eq. (25) can be transformed as follows:
and
for
and
re- spectively. Thereafter, we get the reconstruction likelihood
and the generation likelihood
in the body of this paper.
B.1. Additional Examples
We show the additional examples of generated spherical images in Fig. 14, Fig. 15 and Fig. 16. The viewpoints of the input image are randomized uniformly in the sphere, and FOV is randomized uniformly from 30to 120
. In our method, the symmetry parameter
was set to five variations, namely, (h, h, h, l, l), (l, h, l, l, l), (l, l, l, h, l), (l, l, l, l, h), and (l, l, l, l, l), where h = 1.0, l = 0.3, for the multiple of 90
and 180
rotational symmetries, the plane symmetries to the 0
and 90
axes, and asymmetry, respectively . As shown in these figures, our method can generate spherical images with various symmetric types from a single NFOV image, even if the ground truth is asymmetry. Although it is difficult to reconstruct ground truth especially from an image with a viewpoint near pole or a narrow FOV, plausible spherical images are generated in most cases. Our method is useful for generating the plausible spherical image that has the desired specific symmetric type including asymmetry.
B.2. Transformation from Symmetry to Asymmetry
We show the samples that are controlled continuously from symmetric to asymmetric in Fig. 17. We set the to 0.00, 0.25, 0.50, 0.75 and 1.00 and
, where i is the index of the targeted symmetry type. It indicates that our method can change the symmetry continuously. When
for 90
rotational symmetry, the quality of generated image is low. These results are corresponding to FID score in Fig. 11 in the body. When
for 0
plane symmetry or 90
plane symmetry, there are influences of other types of symmetry as
. These results are corresponding to higher SIM than
in Fig. 10 in the body, because other types of symmetry have a little effect on SEM.
C.1. Circular Padding
Figure 13 indicates the difference in generated images depending on the presence or absence of the circular padding. The generated image is rotated 180around the
Figure 13. Effect of the circular padding
Table 4. Comparison with the symmetry control function.
gravity axis so that the joined portion can be seen, and the left and right ends are displayed in the center. From the fig-ure, it can be seen that the discontinuity at the left and right ends is eliminated by the circular padding.
C.2. Symmetry Control
We compare the generation quality with and without the symmetry control function. We note that we train the network from scratch without the symmetry control function, which is different from setting s = 0 in the learned network with the symmetry control function. In addition, we compare the networks learned with and without the weight (in Eq. (16)) corresponding to the partial image region in the symmetry control function. Table 4 shows the FID for each setting. It reveals that the symmetry control function is effective for improving plausibility because it can generate symmetric images in test data and allows performing convolutions using distant positions of the spherical image even if the symmetry intensity is low. In addition, the weight corresponding to the partial image region is useful because it can use information in highly reliable areas with emphasis.
Figure 14. Additional examples of spherical images generated using our method and previous works for natural scene.
Figure 15. Additional examples of spherical images generated using our method and previous works for the scene including buildings
Figure 16. Additional examples of spherical images generated using our method and previous works for indoor scene.
Figure 17. The generated spherical images that are controlled continuously from symmetric to asymmetric.