Deep Coarse-to-fine Dense Light Field Reconstruction with Flexible Sampling and Geometry-aware Fusion

2019·Arxiv

Abstract

Abstract

A densely-sampled light field (LF) is highly desirable in various applications, such as 3-D reconstruction, post-capture refocusing and virtual reality. However, it is costly to acquire such data. Although many computational methods have been proposed to reconstruct a densely-sampled LF from a sparsely-sampled one, they still suffer from either low reconstruction quality, low computational efficiency, or the restriction on the regularity of the sampling pattern. To this end, we propose a novel learning-based method, which accepts sparsely-sampled LFs with irregular structures, and produces densely-sampled LFs with arbitrary angular resolution accurately and efficiently. We also propose a simple yet effective method for optimizing the sampling pattern. Our proposed method, an end-to-end trainable network, reconstructs a densely-sampled LF in a coarse-to-fine manner. Specifically, the coarse sub-aperture image (SAI) synthesis module first explores the scene geometry from an unstructured sparsely-sampled LF and leverages it to independently synthesize novel SAIs, in which a confidence-based blending strategy is proposed to fuse the information from different input SAIs, giving an intermediate densely-sampled LF. Then, the efficient LF refinement module learns the angular relationship within the intermediate result to recover the LF parallax structure. Comprehensive experimental evaluations demonstrate the superiority of our method on both real-world and synthetic LF images when compared with state-of-the-art methods. In addition, we illustrate the benefits and advantages of the proposed approach when applied in various LF-based applications, including image-based rendering and depth estimation enhancement. The code is available at https://github.com/jingjin25/LFASR-FS-GAF.

1 INTRODUCTION

THE light field (LF) is a high-dimensional function de-scribing light rays through every point traveling in every direction in the free space [1], [2]. This function is initially introduced for LF rendering, which is an attractive method for generating novel views from a given set of pre-acquired views. Instead of the traditional image-based rendering (IBR) methods, LF rendering treats the captured images as samples of the LF function, and the novel views can be generated by re-sampling a slice from the function in real-time, during which no geometry information is required. To avoid ghosting effects, the LF is required to be densely sampled [3]. Densely-sampled LFs including sufficient information will also facilitate a wide range of applications, such as accurate depth inference [4], [5], 3-D scene reconstruction [6] and post-capture refocusing [7]. In addition, with the rapid development of virtual reality technology, a densely-sampled LF becomes vital as it provides smooth angular parallax shift as well as natural focus details, which are important for a satisfying immersive viewing experience [8], [9], [10].

The densely-sampled LF is highly desirable but raises great challenges for the acquisition. For example, LF images with high angular resolution can be captured using a camera array [11] for simultaneous sampling from different viewpoints or computer-controlled gantry [12] for timesequential sampling at different positions. However, the former is expensive and bulky, and the latter is limited to static scenes. The commercialization of hand-held LF cameras such as Lytro [13] and Raytrix [14] makes it convenient to acquire LF images. These cameras are cheaper and portable by encoding 4-D LF data into a single 2-D sensor. However, due to limited sensor resolution, a trade-off between spatial and angular resolution exists.

Instead of relying on the development of hardware, many computational methods have been proposed for reconstructing a densely-sampled LF from a sparse one, which can be realized with low cost commercial devices. Previous works [15], [16], [17], [18], [19], [20] either estimate disparity maps as auxiliary information, or use specific priors such as sparsity in transformation domain for dense reconstruction. With recent development of deep learning solutions for visual modeling, some learning-based methods [21], [22], [23], [24] have been proposed. However, most of the existing methods require the input sub-aperture images (SAIs) to be sampled with a specific or regular pattern, which raises difficulties for practical acquisition. Moreover, since the scene geometry is inexplicitly and insufficiently modeled in these methods, the aliasing problem becomes serious in the reconstructed images when the input LF is extremely undersampled, i.e. the samples have large disparities.

As a preliminary work [25], we proposed a learning-based model for densely-sampled LF reconstruction. The reconstruction of all novel SAIs are performed in one forward pass during which the intrinsic LF structural information among them is fully explored. See more details in Section 2.2. Although this method can produce impressive and state-of-the-art results on extensive real-world images captured by the Lytro Illum camera, the performance degradation caused by sparse sampling and the problem of non-flexibility still exits. In this paper, built upon [25], we provide a few distinguishable improvements, enabling flexible and accurate reconstruction of a densely-sampled LF from sparse sampling. We inherit the coarse-to-fine framework in [25]. That is, the proposed model consists of two modules, namely the coarse SAI synthesis and the efficient LF refinement. Specifically, the coarse SAI synthesis module independently synthesizes novel SAIs using geometry-based warping, where we take the sampling with large disparities and arbitrary patterns into consideration. We also propose a novel confidence-based strategy for handling the occluded regions when blending the warped images from different viewpoints. We further refine the coarse results by exploiting all the intermediate SAIs with efficient pseudo 4-D filters. Such a refinement module is capable of improving the reconstruction quality by utilizing the intrinsic LF parallax structure.

In summary, the main contributions of this paper are as follows:

• we propose an end-to-end learning-based method for the reconstruction of densely-sampled LFs from sparsely-sampled LFs. Our method maintains high reconstruction quality when the sampling disparity increases, and improves the generality by enabling flexible input positions as well as flexible output angular resolution. We also propose effective strategies for handling occlusions and preserving the LF parallax structure;

• we investigate how the sampling pattern affects the reconstruction quality, and propose a simple yet effective method for optimizing the sampling pattern;

• we design various and extensive experiments to evaluate and analyze our method as well as those under comparison comprehensively; and

• we demonstrate and discuss the benefits of the proposed approach to LF-based downstream applications.

The rest of this paper is organized as follows. Sec. 2 comprehensively reviews existing methods for view synthesis and densely-sampled LF reconstruction. Sec. 3 presents the proposed approach and investigates the optimization for sampling patterns. In Sec. 4, extensive experiments are carried out to evaluate the performance of the proposed approach. The benefits of the proposed approach to practical LF-based applications are validated and discussed in Sec. 5. Finally, Sec. 6 concludes this paper.

2 RELATED WORK

2.1 View Synthesis

View synthesis, taking one or more views as inputs to render novel views, is a long-standing problem in the field of computer graphics and computer vision. Most algorithms leverage the scene geometry information for view synthesis, that is, to extract/learn the global/local geometry from the input viewpoints and use the resulting geometry information to warp the input views, followed by blending for novel view rendering [26], [27]. However, the forward warping operation typically leads to a hole-filling problem in occlusion areas. Flynn et al. [28] proposed to project input views to a set of depth planes and learn the weights to average the color of each plane. This method needs to learn specific geometry for different target viewpoints. To overcome this shortage, some methods based on 3-D scene representation were proposed. Penner et al. [29] presented a soft 3-D representation by preserving depth uncertainty. Tulsiani et al. [30] modeled the 3-D structure of the scene by learning to predict a layer-based representation, which represents multiple ordered depths per pixel along with color values. Zhou et al. [31] proposed to use multi-plane images where each plane encodes color and transparency maps. Through these methods, novel views at varying positions can be rendered by simply forward projecting their corresponding representations. Besides, many methods aim at reconstructing 3-D scenes and synthesizing novel views from a single image (e.g., [32], [33], [34], [35]). However, these methods are still limited over simple and non-photorealistic synthetic objects.

2.2 LF Reconstruction

LF rendering needs densely-sampled LFs as inputs. In what follows, we only focus on the methods that reconstruct a densely-sampled LF from a sparsely-sampled one. Available solutions can be roughly classified to two categorizes: non-learning based methods and learning based methods.

Non-learning based methods. Many traditional solutions that are originally adopted for natural image processing, such as Gaussian model and sparse representation, have been explored for LF processing tasks. Among them, Mitra et al. [16] modeled the LF patches using a Gaussian mixture model to address many LF processing tasks. Although it can achieve promising results to a certain extent, it is not robust against noise. Shi et al. [18] explored sparsity in the continuous Fourier domain to reconstruct densely-sampled LFs from a small set of samples. Vagharshakyan et al. [20] proposed an approach using the sparse representation of epipolar-plane images (EPIs) in the shearlet transform domain. These methods require the sparsely-sampled LF to be sampled in a regular grid. Moreover, some methods explore the compressive LF photography. Marwah et al. [17] proposed a compressive LF camera architecture which allows LF reconstruction based on overcomplete dictionaries. To reduce the computational cost for dictionary learning,

Fig. 1: The flowchart of the proposed method for reconstructing a densely-sampled LF with SAIs from a sparselyand arbitrarily-sampled LF with K SAIs. Our proposed model consists of two phases, i.e., the coarse SAI synthesis and the efficient LF refinement.

Kamal et al. [36] exploited a joint tensor low-rank and sparse prior for compressive reconstruction. These methods were specifically designed for coded LF acquisition.

Many works on LF reconstruction leverage explicit depth information for LF reconstruction. Zhang et al. [19] proposed a depth-assisted phase-based synthesis strategy for a micro-baseline stereo pair. Patch-based synthesis methods were presented by Zhang et al. [37], in which the center SAI is decomposed into different depth layers and LF editing is performed on all layers. However, this method has limited performance for view synthesis, especially for complex scenes. Some works were developed based on the idea of warping given SAIs to novel SAIs guided by an estimated disparity map. Wanner and Goldluecke [4] formulated the SAI synthesis problem as an energy minimization problem with a total variation prior, where the disparity map is obtained through global optimization with a structure tensor computed on the 2-D EPI slices. This approach considers disparity estimation as a separate step from view synthesis, which makes the reconstruction quality heavily depend on the accuracy of the estimated disparity maps. Although subsequent research [5], [15], [38] has shown significantly better disparity estimations, ghosting and tearing effects are still presented.

Learning-based methods. With the great success of deep convolutional neural networks in the field of image processing [39], [40], [41], [42], many learning-based methods have been proposed for densely-sampled LF reconstruction. Yoon et al. [21] jointly super-resolved the LF image in both spatial and angular domain using a network that closely resembles the model proposed in [43]. Their approach is limited to scale 2 angular super-resolution and cannot flexibly adapt to sparsely-sampled LF inputs. Following the idea of single image super-resolution, Wu et al. [23], [44] proposed an LF reconstruction method which focuses on recovering the high frequency details of the bicubic up-sampled EPIs. In these methods, a blur-deblur scheme was proposed to address the information asymmetry problem caused by sparse angular sampling. Based on the observation that an EPI shows clear structure when sheared with the disparity value, Wu et al. [24] proposed to fuse a set of sheared EPIs for LF reconstruction. Wang et al. [45] also proposed a method based on EPIs, which applies 3-D convolutional layers to recover the details on horizontal and vertical EPIs sequentially. However, since each EPI is a 2-D slice of the 4-D LF, the accessible spatial and angular information of these EPI-based models is severely restricted. Moreover, for these models, novel SAIs must be synthesized horizontally or vertically in 2-D angular domain, resulting in accumulated errors. Yeung et al. [25] proposed an end-to-end network for densely-sampled LF reconstruction. By exploring the relationships between SAIs with pseudo 4-D filters, this method achieves state-of-the-art performance over a large number of real-world scenes captured by the Lytro camera.

In addition, depth information is also utilized in some learning-based methods for LF reconstruction. Srinivasan et al. [46] proposed to synthesize a 4-D LF image from a 2-D RGB image based on estimated 4-D ray depth. However, this method requires a large training dataset and only works on simple scenes since the information contained in single 2-D images is extremely limited. Kalantari et al. [22] proposed to synthesize novel SAIs with two sequential networks that perform depth estimation and color prediction successively. Although this method achieves good performance on LF images captured by the Lytro camera, the depth estimation and color prediction module are implemented in a straightforward manner, which leaves room for improvement. Jin et al. [47] also proposed to make use of the geometry information to handle LF images with large disparities.

3 THE PROPOSED APPROACH

3.1 4-D LF and Problem Formulation

A 4-D LF can be represented with the two-plane parameterization structure, which uniquely describes the propagation direction of a light ray via two points from two parallel planes, i.e., the angular plane (u, v) and the spatial plane (x, y). Let denote a densely-sampled LF containing SAIs of spatial dimension , which are sampled on the angular plane with a regular 2-D grid of size . Let U be the set of 2-D angular coordinates of the SAIs in I, i.e. . The SAI at u is denoted as . Let denote a sparsely-sampled LF with K SAIs, P be the set of the 2-D angular coordinates of the SAIs in , i.e., , and be an SAI in located at . Moreover, the SAIs of a sparsely-sampled LF are assumed to be arbitrarily sampled from a certain densely-sampled LF, i.e., and . The unsampled SAIs, which belong to I but do not appear in are denoted by with the operator \ returing the difference between two sets.

Our goal is to learn as close to as possible based on such that a densely-sampled LF denoted by can be reconstructed, together with . This problem can be implicitly formulated as:

where f denotes the mapping function to be learnt, and is the operator to combine two sets.

3.2 Overview of the Proposed Method

SAIs in I are correlated to each other, which reveals the LF parallax structure. Specifically, under the Lambertian assumption and in the absence of occlusions, the relationship between SAIs of I can be expressed as

where x = (x, y) is the spatial coordinates, and d is the disparity at the pixel . Being aware of this unique characteristic as well as the great success of deep learning, we propose a learning-based approach to explore the LF parallax structure for densely-sampled LF reconstruction, i.e., constructing a deep network to learn f, as shown in Fig. 1. Our approach consists of two modules, namely the coarse SAI synthesis network and the LF refinement network , which predicts in a coarse-to-fine manner. To be specific, by explicitly learning the scene geometry from input SAIs, the coarse SAI synthesis network individually generates novel SAIs, giving an intermediate densely-sampled LF denoted as :

The independent synthesis of the novel SAIs greatly saves computational time and memory usage during testing stage. Then, the efficient refinement network learns residuals for by exploring the complementary information between the SAIs to recover the LF parallax structure, leading to the final output:

By characterizing the sparsely- and densely-sampled LFs, our approach improves the flexibility and accuracy of the reconstruction of a densely-sampled LF. Specifically, our approach has the following characteristics:

• it overcomes the aliasing problem caused by sparse sampling, making it possible for sparsely-sampled LFs with different angular sampling rate as inputs;

• it enables SAIs with arbitrary angular sampling patterns to be used as inputs, which brings more flexibility for the densely-sampled LF reconstruction. Moreover, we further investigated to optimize

the sampling patterns for improving reconstruction quality;

• beyond the early mentioned goal, our method can produce densely-sampled LFs with user-defined angular resolution, making it more flexible for densely-sampled LF reconstruction in various scenes; and

• it is able to accurately recover the valuable LF parallax structure, which is crucial for various applications based on a densely-sampled LF.

In the following, the details of the proposed approach are presented step-by-step.

3.3 Coarse SAI Synthesis

This module aims at independently synthesizing intermediate novel SAIs denoted by , which is formulated as

To handle the inputs with large disparities, we utilize the geometry information explicitly for novel SAI synthesis. That is, we learn the disparity map at from and synthesize the target SAI via backward warping. To deal with the challenge posed by the irregular sampling patterns, we construct the disparity estimation network by learning correspondence from the plane-sweep volumes (PSVs) [48]. We also propose a new strategy for blending the warped images, which is able to alleviate the artifacts around occlusion boundaries caused by warping. To this end, this module consists of three steps: PSV construction, disparity estimation, warping and blending.

PSV construction. A naive way of disparity estimation is via directly extracting features from using sequential convolutional layers. However, for randomly-sampled SAI inputs, i.e. the angular position set P always varies, it is dif-ficult to properly provide the network with indicators w.r.t the sampling and target positions, making the prediction unreliable (see results in Fig. 8). Instead, we use PSVs for disparity estimation. A PSV with respect to a target position is constructed by backward warping, i.e., reprojecting with respect to a set of disparity planes {d}, resulting in a set of warped images :

In this way, the arbitrary sampling positions of input SAIs as well as the target position for synthesis are encoded into the PSVs during its construction.

The disparity inference from a PSV is based on principles of photo-consistency. However, in occlusion areas or non-Lambertian surfaces, the relationships between the matching patches of different SAIs are complicated. We propose to feed the whole PSV into the disparity estimation network, which is different from the way adopted in [22], where simple hand-craft features such as mean and standard deviation of the PSV across disparity planes are used. With the convolutional network’s powerful ability in learning the representation, we are able to accurately estimate the disparity maps at challenging regions with the rich information provided by the PSVs.

Disparity estimation. The disparity estimation network is designed to predict a disparity map at the target position based on V. The network consists of a cost calculator to learn the matching cost for each disparity plane, and an estimator to predict the disparity value.

For cost calculator, several convolutional layers are applied to per disparity plane using shared weights. For a typical disparity plane , features measuring the similarity and diversity between images warped from different input SAIs are extracted from. We use kernel size to obtain a relatively large receptive field and set the number of channels in the final layer as 4 in the cost calculator. For the disparity estimator, all features from each disparity plane are concatenated together. Then sequential convolutional layers are used to predict the disparity value. Instead of selecting the disparity value with a minimum cost from the predefined disparity set, we let the network learn the disparity value, so that the number of the predefined disparity plane, as well as the width of the network (i.e., the channel number), can be reduced. The number of channels in the hidden layers of the estimator is set to 200 at the front layer, and then gradually decreased from 200 to 64, 32, 16 and 1 to output a disparity map finally.

Warping and blending. The novel SAI at the target position can be synthesized by warping the input SAIs in using the predicted disparity map . Specifically, the resulting image by warping to the target position can be expressed as

Since the input SAIs contain valuable information of the scene from different viewpoints, they will contribute to the target SAI in different areas. The warped images inevitably show artifacts around occlusion boundaries, and locations of the artifacts vary among different source SAIs. Direct combination of the images warped from different viewpoints by simple average or convolutional layers trained with the loss [22] will produce blurry effects, especially when the input SAIs have large disparities. Therefore, we propose a blending strategy to fuse the images warped from different input SAIs to generate the novel SAI by using adaptive dense confidence maps. Specifically, the confidence maps are learned to indicate the pixel-wise accuracy of the images warped from different input SAIs. Then it is expected that the more accurate regions can be selected to form the synthesized SAIs. This strategy properly handles the occlusion problem after warping and preserves clear textures in the synthesized novel SAI (see details in 4.3).

The K confidence maps corresponding to the K input SAIs, along with the disparity maps, are predicted by the final layer of the disparity estimation network. It is feasible because the network has learnt the relationships between the input SAIs and implicitly modeled their relationships to the target SAI. Then the blending can be formulated as:

where is the confidence map for k-th input SAI, and is the element-wise multiplication operator.

3.4 Efficient LF Refinement

In the coarse SAI synthesis phase, novel SAIs are independently synthesized, and the LF parallax structure among them are not well taken into account, resulting in possible photometric inconsistencies between SAIs in the intermediate LF image . Therefore, an efficient refinement network is designed to further exploit the structure of , which is expected to recover the photo-consistency and further improve the reconstruction quality of the densely-sampled LF. Since the goal is to correct possible flaws inconsistent across SAIs while preserve high-frequency textures, residual learning is used in this module. In summary, we first exploit the LF parallax structure from and then reconstruct residual maps for it, as formulated in Eq. (4).

The LF parallax structure. To exploit the LF parallax structure within , 4-D convolution is a straightforward choice. However, the computational cost required by 4-D convolution is very high. Instead, pseudo filters or separable filters, which reduce model complexity by approximating a high dimensional filter with a combination of filters of lower dimensions, have been applied to solve different computer vision problems, such as image structure extraction [49], 3-D rendering [50] and video frame interpolation [51]. This has been recently adopted in [52] for LF material classification and [53] for LF spatial super-resolution, which verifies that pseudo 4-D filters can achieve comparable performance to 4-D filters.

Therefore, we adopt the pseudo 4-D filter which approximates a single 4-D filtering step with two 2-D filters. Specifically, the intermediate feature maps are reshaped between the stack of spatial feature maps and the stack of angular ones so that the convolution is performed alternatively on the spatial and angular domains. Such a design reduces the computation required by a 4-D convolution significantly, while it is still capable of extracting information from both spatial and angular information from the LF image effectively.

Residual reconstruction. After exploring the relationship among angular dimension, the residual maps are reconstructed separately for each SAI in the intermediate LF image. Several layers of 2-D spatial convolution are applied to learn a residual map from the extracted spatial-angular deep features for each SAI. Here each SAI is processed independently for two reasons. First, we believe the previous spatial-angular convolutions are capable of exploiting the LF parallax structure. Second and more importantly, in this way, we can build a fully-convolutional network on both spatial and angular dimension, such that flexible output angular resolution is achieved. Finally, the reconstructed residual map is added to the previously synthesized intermediate LF image as the final reconstructed LF .

3.5 The Loss Function

All modules in our approach are differentiable, leading to an end-to-end trainable network. The loss function for training the network consists of three parts. The first part provides supervision for the intermediate LF by calculating the absolute error between the intermediate LF images and ground-truth ones, i.e.,

Fig. 2: Illustration of the relationship between the minimum distance of the sampling patterns and the reconstruction quality tested on the HCI dataset. The blue dots denote the patterns generated randomly. The green dots and their annotations correspond to the patterns in Fig. 3. The results of the optimized patterns by our method are highlighted as red stars.

Fig. 3: Illustration of different sampling patterns. From top to bottom are sampling patterns with 4, 3 and 2 input SAIs, respectively. (f), (l) and (r) depict the optimized sampling patterns by our algorithm for the tasks and , respectively.

To promote smoothness of the predicted ray disparity, we penalize the norm of the second-order gradients [54], denoted as :

where and are the second-order gradients for the spatial domain of the disparity map . Finally, the output reconstructed LF image is optimized by minimizing the absolute error as:

Thus, our final objective is written as

where and are the weighting for the reconstruction accuracy and the disparity smoothness, which are empirically set to 1, 0.001 and 1, respectively.

3.6 Optimized Sampling Pattern

Optimizing the sampling pattern for densely-sampled LF reconstruction is a valuable topic, which could further exploit the full potential of the reconstruction algorithm, and improve the reconstruction quality using as few hardware resources as possible. Additionally, optimizing the sampling pattern may be beneficial to its application in LF compression (see more details in Sec. 6). In this section, we first investigate how the sampling pattern affects the reconstruction qualitatively and experimentally, then we propose a simple yet effective method for optimizing the sampling pattern tailored to our reconstruction model.

Intuitively, the reconstruction quality is influenced by how thoroughly the scene has been recorded by the sparsely-sampled input. Since most foreground objects can be completely captured from different viewpoints, the occluded regions are the critical challenge. There are several factors that affect the amount of information that could be captured with LF over the occluded areas. One of the factors is the overall distance between the novel SAIs and the sampled SAIs. That is, SAIs nearby can provide more references for novel SAI reconstruction compared to those far away. Additionally, sampling patterns with SAIs distributed at more diverse locations along the horizontal and vertical directions are better than their counterparts with less variation, as the former sees more occluded regions. Finally, this issue should be related to the scene content. Factors such as the geometry complexity between objects can play an important role.

We experimentally investigated the effect of the sampling pattern on reconstruction quality. First, we define a metric, namely minimum distance, which is the average of the angular Euclidean distances of all novel SAIs to their nearest input SAI in the 2-D sampling grid. We then conducted the following experiments, in which we randomly selected some sampling patterns for and dense reconstruction, respectively, then fitted the relationships between their minimum distance against

Fig. 4: Visual comparisons of different methods on the synthesized center SAI for the task (fixed models). Selected regions have been zoomed in for better comparison. It is recommended to view this figure by zooming in.

their reconstruction quality with a second degree polynomial. Fig. 2 illustrates the results, where we can see that with the increase of the minimum distance of the sampling pattern, the corresponding reconstruction quality decreases in general. Moreover, the corresponding sampling patterns of the green dots are illustrated in Fig. 3. It can be seen that patterns with smaller variation along horizontal or vertical directions always stay below the fitted curve (e.g., with close values of the minimum distance, the sampling pattern 4(b) performs better than 4(c), and similar scenarios can be found between 3(l) and 3(i), and 2(q) and 2(n)), which indicates that the divergence is indeed a factor influencing the reconstruction quality.

Based on the above observations, we propose a simple

TABLE 1: Comparison of attributes for densely-sampled LF reconstruction algorithms, where flexible input means whether the method is feasible for an arbitrary sampling pattern, and flexible output means whether the method can produce densely-sampled LFs with flexible angular resolution.

TABLE 2: Quantitative comparisons (PSNR/SSIM) of the proposed approach with the state-of-the-art ones under task . The input sparsely-sampled LFs are sampled at the four corners during both training and test.

Test set Disparity Vagharshakyan et al. Wu et al. Wu et al. Wang et al. Kalantari et al. Yeung et al. Ours (fixed) [20] [23] [24] [45] [22] [25]

yet effective strategy for optimizing the sampling pattern, which is formulated as:

where is the (l, k)-th entry of the indicator matrix , which indicates whether the k-th sampled SAI is the nearest one in all samples to the l-th novel SAI. We first find a solution of the optimization problem in Eq. (13) using the deterministic annealing based method [55], [56]. As the solution varies with initialization, we select the one producing the minimum objective value after repeating the algorithm with random initialization for 5 times. In addition, as the resulting optimized positions may not be located on the grid, we consider the divergence along both horizontal and vertical directions to round the solutions. In this way, we can obtain the optimized sampling patterns as depicted in Fig. 3(f), 3(l) and 3(r). As demonstrated in Fig. 2, the corresponding quantitative reconstruction quality under the sampling patterns by our algorithm achieves the highest when compared with others, which indicates the effectiveness of our sampling pattern selection algorithm. Furthermore, we experimentally verified the effectiveness of the proposed strategy for optimizing the sampling pattern on LFs with different scene content, see section 4.3. Note that flexible and optimized sampling is not applicable to the micro-lens-based LF camera with a fixed optical sampling pattern.

4 EXPERIMENTAL RESULTS

4.1 Datasets and Implementation Details

Both synthetic LF images from the 4-D LF benchmarks [57] [58] and real-world LF images captured with a Lytro Illum camera provided by Standford Lytro LF Archive [59] and Kalantari et al. [22] were employed to train and test. Specifically, 20 synthetic images and 100 real-world images were used for training, while 9 synthetic data, including 4 LF images from the HCI [57] dataset and 5 LF images from the HCI old [58] dataset, and 3 datasets with 70 real-world LF images captured with a Lytro Illum camera were used for test, namely 30scenes [22], Occlusions [59] and Reflective [59]. These datasets cover several important factors in evaluating the methods for LF reconstruction. Specifically, the synthetic datasets contain high-resolution textures to measure the ability of maintaining high-frequency details. The real-world datasets can evaluate the performance of different methods under natural illumination and practical camera distortion. Moreover, the HCI dataset contains LF images with large disparities, which emphasizes the robustness on more sparse sampling. The Occlusions and Reflective datasets focus on challenging scenes in which the assumption of photo-consistency is not guaranteed.

During training, patches of spatial size were randomly cropped, and the batch size was set to 1 due to the limitation of the computational memory. Moreover, we adopted ADAM [60] optimizer with and . The learning rate was initialized as and reduced by a half when the loss stops decreasing. The spatial resolution of the model output was kept unchanged at with padding of zeros. We implemented the model with PyTorch. The code will be publicly available.

TABLE 3: Quantitative comparisons of the proposed approach with Kalantari et al. [22] on the reconstruction with arbitrary sampling patterns under task . Sampling patterns (a), (c) and (f) (depicted in Fig. 3) are used for comparison.

TABLE 4: Quantitative comparisons of the proposed approach with Kalantari et al. [22] on the reconstruction with arbitrary sampling patterns under task . Sampling patterns (g), (j) and (l) (depicted in Fig. 3) are used for comparison.

TABLE 5: Quantitative comparisons of the proposed approach with Kalantari et al. [22] on the reconstruction with arbitrary sampling patterns under task . Sampling patterns (m), (p) and (r) (depicted in Fig. 3) are used for comparison.

4.2 Comparison with State-of-the-Art Methods

Besides our preliminary work Yeung et al. [25], we also compared with 5 state-of-the-art learning-based methods that were specifically designed for densely-sampled LF reconstruction, i.e., Vagharshakyan et al. [20], Wu et al. [23], Wu et al. [24], Wang et al. [45], and Kalantari et al. [22] 1. Table 1 lists the feature comparisons of these algorithms in terms of whether they are learning-based, geometry-based, whether they are flexible with arbitrary input patterns, and whether they can produce the reconstruction with flexible angular resolution. We conducted various experiments for comparisons, listed as follows:

• as 5 out of 6 methods under comparison, i.e. Vagharshakyan et al. [20], Wu et al. [23], Wu et al. [24], Wang et al. [45], and Yeung et al. [25], are unable to handle

the input with flexible and irregular sampling patterns, we first designed the experiment , in which the same and fixed sampling pattern was used during both training and testing, such that all compared methods can be evaluated. We name our method Ours (fixed) under such a training setting. See subsection 1);

• as both Ours and Kalantari et al. [22] can accept flexi-ble and irregular sampling patterns, we designed the experiments and , in which sparsely-sampled LFs each containing K SAIs with arbitrary positions and structures were fed into the network during training, and some of patterns illustrated in Fig. 3 were used during testing. Here we considered three cases, i.e., K = 2, 3, 4, respectively. See subsection 2); and

• we also evaluated the running time for different methods. See subsection 3).

1) Comparison on the reconstruction with fixed input sampling patterns. This comparison was performed over the task

Fig. 5: Visual comparisons of different methods on the synthesized center SAI for the task (flexible models). Selected regions have been zoomed in for better comparison. It is recommended to view this figure by zooming in.

, which attempts to reconstruct a densely-sampled LF with SAI from a sparsely-sampled LF with SAIs distributed regularly. Here the SAIs of a sparsely-sampled LF are located at the four corners of the densely-sampled LF to be reconstructed, as shown in Fig. 3a. We used the average value of PSNR and SSIM over all synthetic novel SAIs to quantitatively measure the quality of reconstructed densely-sampled LFs, and the corresponding results are listed in Table 2. It can be observed that:

• the performance of all methods decreases when the disparity between input SAIs increases;

• EPI-based methods, including Vagharshakyan et al. [20], Wu et al. [23], Wu et al. [24], and Wang et al. [45], are inferior to others. The possible reason is that only 2 rows or columns of pixels are available during the reconstruction of each EPI, making it difficult to recover the intermediate linear structures without modeling the 2-D spatial structure, especially when the scenes are complicated. Among them, Wu et al. [24] performs relatively better, as depth information is utilized as guidance;

• Kalantari et al. [22] achieves good results on real-world datasets, which indicates the effectiveness of geometry-based warping. However, it fails on the

HCI dataset with larger disparities. The reason is that Kalantari et al. [22] uses hand-crafted features to estimate the disparity and simple convolutional layers to combine the warped images, which makes it difficult to build long distance connection between SAIs with large disparities;

• Yeung et al. [25] achieves the best results on the real-world datasets, indicating that the pseudo 4-D filters effectively explore the spatial and angular relationships between input SAIs. However, this method also does not work well on the HCI dataset, because it entirely relies on deep regression for novel view synthesis, which indicates the importance of explicit geometric modeling for the reconstruction based on sparse sampling; and

• our approach achieves the highest PSNR/SSIM for the HCI and HCI old datasets, and comparable performance with Yeung et al. [25] at 30scenes, Occlusions and Reflective datasets, showing the advantages of the proposed framework.

We also visually compared the reconstruction results of different algorithms, as shown in Fig. 4. It can be observed that Wu et al. [23], Wu et al. [24] and Wang et al. [45] fail to recover delicate structures, such as the leaves and the

textures on the wall, while Kalantari et al. [22] and Yeung et al. [25] struggle with large disparities. In contrary, our approach produces accurate estimations, which are closer to the ground-truth ones. Moreover, the most valuable information of LF images

is the LF parallax structure, which implicitly represents

the scene geometry. In Figs. 4, and 5, we visualized the EPIs of reconstructed LF images to compare the ability of different reconstruction methods on the preservation of the LF parallax structure. It can be seen that the EPIs of Ours (fixed) preserve clearer linear structures, which are closer to the ground truth ones. Moreover, the advantage of our method on preserving the LF parallax structure is also quantitatively and qualitatively demonstrated in Sec. 5.2, where the depth maps estimated from reconstructed LFs by different methods are compared [61]. Finally, we provided supplementary videos for animations of reconstructed LFs to visually evaluate the view consistency (see https://github.com/jingjin25/LFASR-FS-GAF). 2) Comparison on the reconstruction with flexible

input sampling patterns. We performed comparisons over random input positions

with Kalantari et al. [22] and our approach. During training, the input SAIs were selected at random positions, and the input patterns illustrated in Fig. 3 were used for testing. We report the quantitative results of task and in Table 3, 4 and 5, respectively. It can observed that our method improves the PSNR by around 4 dB on synthetic datasets and around 0.4-1 dB on real-world datasets. To visually compare the outputs from Kalantari et al.

[22] with our method, we calculated the error maps of the

reconstructed center SAI under task in Fig. 5. The results further demonstrate the advantages of our proposed approach. As shown in the results of synthetic data in Fig. 5 (see the first row), basic textures are severely blurred or distorted in the reconstructed SAI of Kalantari et al. [22] when the inputs have large disparities, while our method can reconstruct most of the high-frequency details. For real-world LF reconstruction in Fig. 5 (see the second row), Kalantari et al. [22] produces artifacts near the boundaries of the foreground objects, while fine edges and small objects are well preserved in the results by our method. 3) Comparison of the running time. We compared the running time (in seconds) of different

methods for reconstructing a densely-sampled LF, and Table

6 lists the results. All methods were tested on a desktop with Intel CPU i7-8700 @ 3.70GHz, 32 GB RAM and NVIDIA GeForce RTX 2080 Ti. From Table 6, it can be observed that our approach, taking about only 0.8 seconds to generate a novel SAI, is much faster than other methods except Wang et al. [45] and Yeung et al. [25]. Although Wang et al. [45] and Yeung et al. [25] are the faster ones, our approach is superior in terms of reconstruction quality and angular flexibility.

4.3 Ablation study

In this section, we experimentally validated the effectiveness of our view sampling optimization strategy on LFs with different scene content, as well as the effectiveness of three

Fig. 6: The images and sampling patterns used to investigate the effectiveness of the optimized sampling patterns on LFs with different scene content. 12 different scenes are manually selected. The optimized sampling pattern (l) obtained by our method is compared with 6 neighboring patterns (1)-(6).

Fig. 7: Illustration of the effectiveness of the optimized sampling patterns on LFs with different scene content. The selected LF scenes and sampling patterns are illustrated in Fig. 6. The red pentagrams mark the highest PSNR achieved with the optimized sampling pattern by our method, and red dots mark the highest PSNR achieved with other patterns.

components of our network, including the disparity estimation module, the blending strategy and the refinement module.

1) The effectiveness of the optimization strategy for the sampling pattern on different scene content.

As there is no metric to quantify the scene content complexity, we manually select images covering different scenes and captured with different camera settings. As shown in Fig. 6, the selected images vary in geometry complexity (e.g. 30scenes 2 and 30scenes 3), object category (e.g. Occlusions 1 and Occlusions 3), camera parameters (e.g. Occlusions 3 and Occlusions 4), and data acquisition method (e.g. HCI and 30scenes), etc. 6 sampling patterns neighboring to the optimized one by our strategy were used for comparisons, as illustrated in the bottom row of Fig. 6. The PSNR of reconstructed LFs from inputs with different sampling patterns on LFs with different scene content is plotted in Fig. 7. It can be seen that although the PSNR values of reconstructed LFs present different trends when the sampling patterns change, the highest PSNR values have been achieved with the same sampling pattern by our method for most cases (9 out of 12). In addition, our selected sampling pattern can

TABLE 6: Comparisons of the running time (in seconds) of different methods for reconstructing a densely-sampled LF.

Fig. 8: Visual comparisons of the intermediate by-product disparity maps estimated by directly applying convolutional layers to the input SAIs, Kalantari et al. [22] and our network. Kalantari et al. [22] (inter) denotes the modified network of Kalantari et al. [22] with an intermediate supervision for the warped images using ground-truth targets.

Fig. 9: Demonstration of the effectiveness of our blending strategy. The estimated disparity map, the zoom-in of the images warped from the input SAIs, the learned confidence maps and the blended images are presented.

achieve a comparable PSNR value to the highest one even when it is not optimal. Although the selected images cannot cover all scenarios, our experiment shows that the proposed optimization strategy is generally applicable in most of the cases we have experimented with.

2) The effectiveness of the disparity estimation module.

In our approach, the disparity maps are estimated by constructing PSVs, which are fed into the subsequent network. Alternative ways include applying convolutional layers to the input SAIs straightly, or abstracting hand-craft features from PSVs as the input of a network [22]. To validate the advantages of our disparity estimation module, we visually compared the by-product disparity maps estimated by these three manners. Note that by training the network of Kalantari et al. [22] using codes provided by the authors, the estimated disparity maps for the HCI dataset are nearly all zeros. We believe the reason is that the only objective of the network is to optimize the final reconstruction by applying a loss function to the last refinement module. For LF datasets

TABLE 7: Effectiveness verification of the refinement mod- ule in our approach. We compare the reconstruction quality of the LF images generated by our method without the refinement module and the LF images by our method with all modules under tasks and over HCI and 30scenes.

HCI 35.60/0.954 36.54/0.961 37.33/0.965 38.68/0.971 30scenens 40.12/0.979 41.18/0.982 41.57/0.983 42.83/0.986

HCI 35.39/0.953 36.38/0.960 37.15/0.963 38.43/0.970 30scenens 39.77/0.977 40.65/0.981 41.49/0.983 42.57/0.986

with large disparities, such a loss function can not effi-ciently back-propagate to the disparity values via warping operators. Therefore, we modified their source code, and re-trained their network by adding an intermediate supervision for the warped images using ground-truth targets, denoted as Kalantari et al. [22] (inter). Then the estimated disparity maps become reasonable. In addition, the average PSNR value of its final reconstructions on HCI dataset is also improved by around 0.3 db. As shown in Fig. 8, it can be observed that our method produces disparity maps with much fewer error in both background and occlusion boundaries. 3) The effectiveness of the blending strategy. The blending strategy in our approach is designed to

address the occlusion issues during the fusion of the images

warped from different input SAIs. To validate the effectiveness of the proposed blending strategy, the intermediate results before and after blending are visualized in Fig. 9. It can be observed that the errors around occlusion boundaries in the intermediate images warped from different source SAIs are closely related to the location of the source SAIs, and appear in different positions. The learned confidence maps are able to indicate these error areas in each warped image, and provide guidance for the fusion of the warped images. Blending over the guidance of confidence maps helps to remove these errors, while the correct regions of each warped image are preserved. Moreover, to demonstrate the advantage of the proposed

blending strategy, we quantitatively compared the blended

results by our method and the method used in [22]. First, we removed the refinement module from our model, such that the remaining view synthesis network consists of disparity estimation, warping and the confidence-based blending. We denote this model as Ours conf blend. Then, we replaced the confidence-based blending with the blending strategy used in [22], i.e., using convolutional layers to directly combine the warped image. This new model, denoted as Ours cnn blend, was trained using the same datasets as ours. In this way, the only difference between these two models are the blending mechanisms. We compared their reconstruction quality, and the results are listed in Table 8, where it can be seen that Ours conf blend achieves higher PSNR/SSIM values than Ours cnn blend, validating the advantage of our confidence-based blend strategy. 4) The effectiveness of the refinement module. To demonstrate the effectiveness of the refinement mod-

TABLE 8: Effectiveness verification of the confidence-based fusion compared with blending using convolutional layers in [22]. The PSNR/SSIM values are provided for comparisons. 4(a) and 4(f) are two sampling patterns for the task depicted in Fig. 3.

ule, we quantitatively compared the quality of the LF images generated by our method without the refinement module and the LF images by our method with all modules, and Table 7 lists the results. It can be seen that the refinement provides around 1 dB PSNR improvement, which indicates that the refinement module can efficiently exploit the complementary information between the synthesized SAIs and improves the intermediate LF images.

5 APPLICATIONS

In this section, we will discuss two applications, which will benefit from our accurate, flexible and efficient method for the reconstruction of densely-sampled LFs.

5.1 Image-based rendering (IBR)

IBR aims at generating novel views from a set of captured images. Comprehensive review on IBR can be found in [63]. Among IBR techniques, LF rendering is attractive as novel views can be generated by straightforward interpolation without the need of any geometric information such that real-time rendering can be achieved. To produce novel views without ghosting artifacts, LF rendering requires the LF to be densely sampled, with disparities between neighboring views to be less than 1 pixel [3]. Therefore, for a sparsely-sampled LF that does not meet the sampling requirement, our method can reconstruct a densely-sampled LF with desired angular resolution to enable subsequent LF rendering. More generally, as our method is capable of generating novel views at arbitrary viewpoints from a set of sparsely-sampled SAIs, it can realize IBR directly.

To validate the effectiveness of our approach on the IBR application, we performed comparisons of dense reconstruction under different sampling baselines for different output angular resolution. Specifically, we compared the performance of different algorithms when reconstructing densely-sampled LFs from corner SAIs sampled at a grid, and reconstructing densely-sampled LFs from corner SAIs sampled at a grid on HCI dataset. As the ground-truth images are unavailable, we visually compared the center SAIs of reconstructed LF images. Moreover, to compare the ability of preserving the LF parallax structure, horizontal and vertical EPIs are presented. Fig. 10 shows the results, and it can be observed that our method can produce novel SAIs with sharp textures and construct EPIs with clear linear structures, even when the input sampling baselines are relatively large.

Fig. 10: Visual comparisons on LF reconstruction with flexible output angular resolution. We present the results of reconstruction from 4 corner SAIs of a sampling grid (top), and the results of reconstruction from 4 corner SAIs of a sampling grid (bottom). The center SAI of the LF images reconstructed from different algorithms are presented. Horizontal and vertical EPIs corresponding to the colored lines are shown below the center SAI, and regions with obvious artifacts or blurring are highlighted with yellow boxes. It is recommended to view this figure by zooming in.

Fig. 11: Visual comparisons of the depth estimation results (as the depth is inverse proportional to the disparity, we do not make a distinction between them). The center SAIs of the LF images, the disparity maps estimated from the ground-truth densely-sampled LFs, the sparsely-sampled LFs, the reconstructed densely-sampled LFs by different algorithms are presented from left to right. It is recommended to view this figure by zooming in.

5.2 Depth estimation enhancement

The value of an LF image lies in the implicitly encoded scene geometry information. By finding correspondences in different SAIs, depth maps can be estimated from the LF images. A densely-sampled LF leads to more accurate and more robust depth inference, as matching points can be detected more easily and occlusion problems can be alleviated by multiple viewpoints. Therefore, the proposed method can be used to enhance LF depth estimation.

Here, we present the depth maps estimated from sparsely-sampled LF images as well as those estimated from densely-sampled LF images reconstructed by different algorithms. The state-of-the-art depth estimation algorithm [5] was applied, and Fig. 11 shows the results. It

TABLE 9: Quantitative comparisons (100MSE) of the depth estimated from the ground-truth densely-sampled light fields, the sparsely-sampled light fields, the reconstructed densly-sampled light fields by different methods. The three numbers from left to right are the results of [5]/ [62]/ [15]. The lower, the better. The best and second best results of different reconstruction methods under an identical depth estimation algorithm are highlighted in red and blue, respectively.

TABLE 10: Quantitative comparisons (Bad Pixel Ratios with threshold 0.07) of the depth estimated from the ground-truth densely-sampled light fields, the sparsely-sampled light fields, the reconstructed densly-sampled light fields by different methods. The three numbers from left to right are the results of [5]/ [62]/ [15]. The lower, the better. The best and second best results of different reconstruction methods under an identical depth estimation algorithm are highlighted in red and blue, respectively.

can be observed that the reconstructed densely-sampled LFs enable better estimations than sparsely-sampled LF ones, and the depth maps from our method are more accurate than those from others, especially in the regions including fine details and occluded boundaries. Additionally, the high accuracy of estimated depth maps further validates the advantage of our method on preserving the LF parallax structure.

Moreover, we provided quantitative comparisons of the depth maps estimated from different reconstructions. For robust and reliable evaluation, 3 widely-used and robust depth estimation methods, i.e., [5], [15], [62], were used to avoid possible errors or possible adaptation to the reconstruction methods. Mean square error (MSE) and Bad Pixel Ratio (BPR) between the estimated depth map and its ground-truth were used to measure the accuracy. BPR measures the percentage of pixels with an error large than the threshold in the estimated depth map. Tables 9 and 10 list the results, where it can be seen that when evaluated with different depth estimation algorithms, the MSE and BPR of Ours (fixed) are the lowest or second lowest under most cases, compared with other methods, especially on the LF image with a large disparity, i.e., StillLife. Particularly, the MSE values of Ours (fixed) are even lower than those of the depth maps estimated from the ground-truth densely-sampled LFs in some cases. The reason could be twofold: no method can guarantee perfectly accurate estimations, and sometimes the adopted depth estimation methods adapt to the reconstructed LFs by our method better; and much noise is present in the raw LF images [57], while the noise might be suppressed by our reconstruction algorithm to some extent.

6 CONCLUSION AND FUTURE WORK

We have presented a novel learning-based algorithm for the reconstruction of densely-sampled LFs from sparsely-sampled ones. Owing to the deep, effective and comprehensive modeling of the unique LF parallax structure, including the geometry-based SAI synthesis based on position-aware PSVs, the adaptive blending strategy and the efficient LF refinement network, our method breaks the obstacle in an arbitrary sampling pattern and sparse sampling, not only achieving over 4 dB improvement on synthetic data and 1 dB improvement on real-world data, but also preserving the valuable LF parallax structure better, compared with state-of-the-art methods. Besides, we proposed a simple yet effective algorithm to optimize the sparse sampling pattern for better reconstruction quality. Last but not least, the potential of our method on improving subsequent LF-based applications has been validated and discussed.

During the sampling pattern optimization, we have built a scene content-independent strategy, which only considers the overall distance between the novel views and the sampled ones and the distribution divergence of the sampling. In fact, the optimal sampling pattern should vary with the scene content, such as the geometry complexity and textual information. In our future work, we plan to predict the scene content-dependent optimized sampling pattern via a CNN trained with the ground-truth optimal sampling patterns that can be obtained via an exhaustive search.

Second, we evaluated the quality of reconstructed LFs through the average of SAI-wise PSNR/SSIM and estimated depth maps. However, these metrics can only evaluate the LF image on the spatial dimension, or evaluate the angular consistency indirectly. It is thus highly desirable to build a standard and effective metric for evaluating the quality of 4-D LFs directly. Researchers from the image quality assessment field have started paying attention to this issue [61], [64], [65].

Another interesting line of future work is exploring the potential of the proposed framework on LF data compression. The huge data size of LF images poses great challenges to both data storage and transmission. In [66], an LF image is partitioned into key SAIs and non-key SAIs, and non-key SAIs are compensated by the reconstruction from key SAIs. Only the key SAIs and residual of non-key SAIs are encoded. Our framework adapting to flexible inputs can be naturally utilized to optimize the combination of key SAIs so that the reconstruction quality of non-key SAIs can be improved using the same number of key SAIs, and likewise the compression performance. Moreover, our experimental results have demonstrated that using the optimized sampling patterns, the number of key SAIs can be reduced without penalizing the reconstruction performance, which means the encoding bits of key SAIs can be saved. In the future, we will comprehensively study how the sampling pattern and the number of input views affect the compression performance, and experimentally verify the application of the proposed framework on LF compression.

ACKNOWLEDGEMENT

We thank the authors of [45] for sharing their source codes.

REFERENCES

[1] M. Levoy and P. Hanrahan, “Light field rendering,” in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, 1996, pp. 31–42.

[2] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, “The lumigraph,” in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, 1996, pp. 43–54.

[3] J.-X. Chai, X. Tong, S.-C. Chan, and H.-Y. Shum, “Plenoptic sampling,” in SIGGPRAH, 2000, pp. 307–318.

[4] S. Wanner and B. Goldluecke, “Variational light field analysis for disparity estimation and super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 606– 619, 2014.

[5] J. Chen, J. Hou, Y. Ni, and L.-P. Chau, “Accurate light field depth estimation with superpixel regularization over partially occluded regions,” IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 4889–4900, 2018.

[6] C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and M. Gross, “Scene reconstruction from high spatio-angular resolution light fields,” ACM Transaction on Graphics, vol. 32, no. 4, pp. 73:1–73:12, 2013.

[7] J. Fiss, B. Curless, and R. Szeliski, “Refocusing plenoptic images using depth-adaptive splatting,” in IEEE International Conference on Computational Photography (ICCP), 2014, pp. 1–9.

[8] J. Yu, “A light-field journey to virtual reality,” IEEE MultiMedia, vol. 24, no. 2, pp. 104–112, 2017.

[9] F.-C. Huang, K. Chen, and G. Wetzstein, “The light field stereoscope: Immersive computer graphics via factored near-eye light field displays with focus cues,” ACM Transaction on Graphics, vol. 34, no. 4, pp. 60:1–60:12, 2015.

[10] R. S. Overbeck, D. Erickson, D. Evangelakos, M. Pharr, and P. Debevec, “A system for acquiring, processing, and rendering panoramic light field stills for virtual reality,” ACM Transaction on Graphics, vol. 37, no. 6, pp. 197:1–197:15, 2018.

[11] B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy, “High performance imaging using large camera arrays,” ACM Transaction on Graphics, vol. 24, no. 3, pp. 765–776, 2005.

[12] “The (New) Stanford Light Field Archive,” http://lightfield. stanford.edu/acq.html, [Online].

[13] “Lytro illum,” https://www.lytro.com/, [Online].

[14] “Raytrix,” https://www.raytrix.de/, [Online].

[15] T.-C. Wang, A. A. Efros, and R. Ramamoorthi, “Occlusion-aware depth estimation using light-field cameras,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3487–3495.

[16] K. Mitra and A. Veeraraghavan, “Light field denoising, light field superresolution and stereo camera based refocussing using a gmm light field patch prior,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 22–28.

[17] K. Marwah, G. Wetzstein, Y. Bando, and R. Raskar, “Compressive light field photography using overcomplete dictionaries and optimized projections,” ACM Transactions on Graphics, vol. 32, no. 4, pp. 46:1–46:12, 2013.

[18] L. Shi, H. Hassanieh, A. Davis, D. Katabi, and F. Durand, “Light field reconstruction using sparsity in the continuous fourier domain,” ACM Transactions on Graphics, vol. 34, no. 1, pp. 12:1–12:13, 2014.

[19] Z. Zhang, Y. Liu, and Q. Dai, “Light field from micro-baseline image pair,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3800–3809.

[20] S. Vagharshakyan, R. Bregovic, and A. Gotchev, “Light field recon- struction using shearlet transform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 1, pp. 133–147, 2018.

[21] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learn- ing a deep convolutional network for light-field image super-resolution,” in IEEE International Conference on Computer Vision Workshops (ICCVW), 2015, pp. 24–32.

[22] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning- based view synthesis for light field cameras,” ACM Transactions on Graphics, vol. 35, no. 6, pp. 193:1–193:10, 2016.

[23] G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai, “Light field re- construction using convolutional network on epi and extended applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1681–1694, 2019.

[24] G. Wu, Y. Liu, Q. Dai, and T. Chai, “Learning sheared epi structure for light field reconstruction,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3261–3273, 2019.

[25] W. F. H. Yeung, J. Hou, J. Chen, Y. Ying Chung, and X. Chen, “Fast light field reconstruction with deep coarse-to-fine modeling of spatial-angular clues,” in European Conference on Computer Vision (ECCV), 2018, pp. 137–152.

[26] S. E. Chen and L. Williams, “View interpolation for image syn- thesis,” in Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques (SIGGPAPH ’93), 1993, pp. 279– 288.

[27] L. McMillan and G. Bishop, “Plenoptic modeling: An image-based rendering system,” in Proceedings of the 22Nd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’95), 1995, pp. 39–46.

[28] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the world’s imagery,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[29] E. Penner and L. Zhang, “Soft 3d reconstruction for view synthe- sis,” ACM Transaction on Graphics, vol. 36, no. 6, pp. 235:1–235:11, 2017.

[30] S. Tulsiani, R. Tucker, and N. Snavely, “Layer-structured 3d scene inference via view synthesis,” in European Conference on Computer Vision (ECCV), 2018, pp. 302–317.

[31] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” ACM Transactions on Graphics, vol. 37, no. 4, pp. 65:1–65:12, 2018.

[32] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View synthesis by appearance flow,” in European conference on computer vision (ECCV), 2016, pp. 286–301.

[33] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, “Multi-view supervision for single-view reconstruction via differentiable ray consistency,” in IEEE conference on computer vision and pattern recognition (CVPR), 2017, pp. 2626–2634.

[34] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg, “Transformation-grounded image generation network for novel 3d view synthesis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3500–3509.

[35] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Multi-view 3d models from single images with a convolutional network,” in European Conference on Computer Vision (ECCV), 2016, pp. 322–337.

[36] M. Hosseini Kamal, B. Heshmat, R. Raskar, P. Vandergheynst, and G. Wetzstein, “Tensor low-rank and sparse light field photography,” Computer Vision Image Understanding, vol. 145, no. C, pp. 172–181, 2016.

[37] F. Zhang, J. Wang, E. Shechtman, Z. Zhou, J. Shi, and S. Hu, “Plenopatch: Patch-based plenoptic image manipulation,” IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 5, pp. 1561–1573, 2017.

[38] H.-G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y.-W. Tai, and I. So Kweon, “Accurate depth map estimation from a lenslet light field camera,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1547–1555.

[39] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016.

[40] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883.

[41] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super- resolution using very deep convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.

[42] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 624–632.

[43] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European Conference on Computer Vision (ECCV), 2014, pp. 184–199.

[44] G. Wu, M. Zhao, L. Wang, Q. Dai, T. Chai, and Y. Liu, “Light field reconstruction using deep convolutional network on epi,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1638–1646.

[45] Y. Wang, F. Liu, Z. Wang, G. Hou, Z. Sun, and T. Tan, “End-to- end view synthesis for light field imaging with pseudo 4dcnn,” in European Conference on Computer Vision (ECCV), 2018, pp. 333–348.

[46] P. P. Srinivasan, T. Wang, A. Sreelal, R. Ramamoorthi, and R. Ng, “Learning to synthesize a 4d rgbd light field from a single image,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2243–2251.

[47] J. Jin, J. Hou, H. Yuan, and S. Kwong, “Learning light field angular super-resolution via a geometry-aware network.” in AAAI, 2020, pp. 11 141–11 148.

[48] R. T. Collins, “A space-sweep approach to true multi-image match- ing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1996, pp. 358–363.

[49] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable filters,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2754–2761.

[50] L.-Q. Yan, S. U. Mehta, R. Ramamoorthi, and F. Durand, “Fast 4d sheared filtering for interactive rendering of distribution effects,” ACM Transactions on Graphics, vol. 35, no. 1, pp. 7:1–7:13, 2015.

[51] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive separable convolution,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 261–270.

[52] T.-C. Wang, J.-Y. Zhu, E. Hiroaki, M. Chandraker, A. A. Efros, and R. Ramamoorthi, “A 4d light-field dataset and cnn architectures

for material recognition,” in European Conference on Computer Vision (ECCV), 2016, pp. 121–138.

[53] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung, “Light field spatial super-resolution using deep efficient spatial-angular separable convolution,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2319–2330, 2019.

[54] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, “Sfm-net: Learning of structure and motion from video,” arXiv preprint arXiv:1704.07804, 2017.

[55] J. Hou, L.-P. Chau, N. Magnenat-Thalmann, and Y. He, “Human motion capture data tailored transform coding,” IEEE transactions on visualization and computer graphics, vol. 21, no. 7, pp. 848–859, 2015.

[56] S. Lloyd, “Least squares quantization in pcm,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.

[57] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “A dataset and evaluation methodology for depth estimation on 4d light fields,” in Asian Conference on Computer Vision (ACCV), 2016, pp. 19–34.

[58] S. Wanner, S. Meister, and B. Goldluecke, “Datasets and bench- marks for densely sampled 4d light fields,” in VMV, 2013, pp. 225–226.

[59] A. S. Raj, M. Lowney, R. Shah, and G. Wetzstein, “Stanford lytro light field archive,” http://lightfields.stanford.edu/LF2016.html, [Online].

[60] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980, 2014.

[61] P. Paudyal, F. Battisti, and M. Carli, “Reduced reference quality as- sessment of light field images,” IEEE Transactions on Broadcasting, vol. 65, no. 1, pp. 152–165, 2019.

[62] S. Zhang, H. Sheng, C. Li, J. Zhang, and Z. Xiong, “Robust depth estimation for light field via spinning parallelogram operator,” Computer Vision and Image Understanding, vol. 145, pp. 148–159, 2016.

[63] C. Zhang and T. Chen, “A survey on image-based renderingrep- resentation, sampling and compression,” Signal Processing: Image Communication, vol. 19, no. 1, pp. 1–28, 2004.

[64] V. Kiran Adhikarla, M. Vinkler, D. Sumin, R. K. Mantiuk, K. Myszkowski, H.-P. Seidel, and P. Didyk, “Towards a quality metric for dense light fields,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 58–67.

[65] Y. Tian, H. Zeng, J. Hou, J. Chen, and K.-K. Ma, “Light field image quality assessment via the light field coherence,” IEEE Transactions on Image Processing, vol. 29, pp. 7945–7956, 2020.

[66] J. Hou, J. Chen, and L.-P. Chau, “Light field image compres- sion based on bi-level view compensation with rate-distortion optimization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 2, pp. 517–530, 2018.

designed for accessibility and to further open science