Understanding surrounding 3-dimensional (3D) environments is an essential perception task for numerous robotic applications including manipulation, exploration, and navigation [1–4]. Robots may rely on accurate depth estimation of a scene to avoid obstacles and manipulate the objects. For depth estimation, we typically utilize depth sensors such as stereo cameras, structured-light depth sensors, and time-of-flight sensors, but depth sensors are usually expensive when compared to a single RGB camera. Researchers have been working on depth estimation with a single monocular RGB camera, but the accuracy of monocular depth is not high enough so that no popular depth sensors on the market rely on monocular depth estimation. While almost all prior works on monocular depth estimation assume passive sensing that means camera motion is uncontrollable, can we obtain accurate depth estimation with a single RGB camera combined with active sensing?
Most high-quality depth sensors are built upon the principles of stereo matching and time of flight, rather a monocular camera. Stereo cameras are equipped with two color cameras displaced horizontally so that the corresponding pixels in two cameras are on the same horizontal lines. Stereo matching estimates disparity maps that encode the differences in pixels between corresponding pixels in stereo images [5]. Some depth sensors such as the first generation of Kinect utilize structured lights in infrared images to ease the stereo matching process but projecting infrared speckle patterns requires high power consumption. A time-of-flight sensor
Authors are with the Hong Kong University of Science and Technology, Hong Kong SAR, China. W. Yuan (weihao.yuan@connect.ust.hk) and R. Fan (eeruifan@ust.hk) are with the Department of Electronic and Computer Engineering. M. Y. Wang (mywang@ust.hk) is with the Department of Mechanical and Aerospace Engineering and the Department of Electronic and Computer Engineering. Q. Chen (cqf@ust.hk) is with the Department of Computer Science and Engineering and the Department of Electronic and Computer Engineering.
Fig. 1: Multiscopic vision system. The camera is moved under control in an active perception system such that the captured images are co-planar and with the same parallax. The point P on a pyramid cannot be perceived from the left view but can be seen from other viewpoints.
such as LiDAR measures the time of flight of light between the sensor and the object to further infer the depth values and also has high power consumption when emitting light. Compared to a single RGB camera, all these existing depth sensors are expensive with more cameras or projectors and have power consumption. In this paper, we show that we can obtain high-quality depth with a single camera with active sensing.
If a camera can be controlled actively (with a robotic arm), can we obtain a high-quality 3D understanding of the scene by capturing multiple images at different specified locations? With such an active sensing strategy, we have nearly perfect camera pose estimation of all the captured images and more constraints can be enforced in the reconstructed 3D model. Both the magnitude and direction of the pixels disparities can be controlled such that we can search the correspondence easily. In numerous industrial environments, a color camera is usually installed on moving agents such as autonomous ground vehicles (AGV) and robot arms that can control the camera movement.
In this paper, we study active perception with a single camera for depth estimation by taking multiple images at specified camera poses. We refer to the problem of depth estimation with multiple images captured at aligned camera locations as multiscopic vision, as an analog to stereo vision with two horizontally aligned images. Inspired by the principle of stereo vision that depth estimation with two perfectly aligned images is relatively easier than with two images with arbitrary unknown camera poses, we believe capturing multiple images with aligned camera locations can bring benefits in obtaining more accurate and comprehensive depth estimation.
We design an active perception system which uses a monocular camera mounted on the robot arm to produce a series of images surrounding a center image. These images are highly regular to form a super stereo framework, multiscopic vision, as is shown in Fig. 1. We command the robot arm to move the camera along the image plane so that all images are flat co-planar. Then the search for pixel correspondence can be conducted only on a fixed line direction. If we further move the camera along the horizontal or vertical axis, the disparity will only be along the horizontal or vertical. And if the camera is moved with the same distance for every surrounding image, the disparity of each pixel relative to the center image should be the same, which is a strong regularization for computing an accurate disparity map.
Multiscopic vision system brings clear benefits to depth estimation when compared to multiview stereo (MVS) and stereo matching. From MVS that can perform stereo matching between pairs of images 1, our system can easily aggregate multiple cost volumes in our framework because all the captured images are aligned horizontally or vertically. For stereo matching methods, finding pixel correspondences is challenging because occlusion, reflection, illumination, notexture can influence the matching easily. In a multiscopic system, depth estimation is much more robust in the presence of multiple cost volumes that can be easily combined.
Our experiments show that multiscopic vision with multiple aligned images generates much more accurate depth estimation than stereo matching methods with only two images. Furthermore, the depth map produced by multiscopic vision contains fairly few occlusion pixels because each pixel in the central image is likely to appear in one of the surrounding images, as shown in Fig. 1.
Our main contributions concerning monocular active perception for multiscopic vision are summarized as follows:
1) We design an active perception system that captures multiscopic images with arbitrary baselines using a monocular camera mounted on an eye-in-hand robot arm.
2) Our proposed multiscopic vision system produces more accurate depth maps by utilizing multiple images in co-planar, same-parallax structure. Also, the computed depth maps are nearly occlusion free.
3) We evaluate and validate our multiscopic vision system on a public benchmark dataset as well as our real-world applications.
We first review prior works related to stereo vision with active perception and then camera array systems for capturing multiple images. Afterward, we will discuss stereo vision systems with a monocular camera.
Active perception is widely employed in robotic applications such as exploration and manipulation [7, 8]. Active movement can assist in the localization of the manipulated objects under occlusions [9] or explore an unseen environment better [10]. For stereo vision, since the baseline is critical for correspondence matching, some works about actively adjusting the baseline were proposed [11, 12]. A linear slider was used in [12] to change the baseline of two stereo cameras such that the baseline could be adaptive to the distance between the camera and the environment. This enables better 3D reconstruction of different scenes.
Another approach for depth estimation is based on camera arrays in which many cameras are placed on arrays [13– 15]. Thus the baseline could be changed by choosing camera pairs in different positions, and the cost volumes could be constructed with the fusion of redundant images to solve the partial invisibility problem [14, 15]. The occluded surfaces for one camera could be reconstructed with the assistance of other cameras. However, building a camera array with multiple cameras is bulky and expensive. Another difficulty is the rectification of different cameras.
To take advantage of identical camera parameters, some stereo vision systems use a single camera to perform depth estimation. By analyzing the optical structure, Adelson and Wang proposed a single lens stereo system with a plenoptic camera that could produce photos from different viewpoints [16]. These captured images could be then used as stereo images for depth estimation. However, the stereo baseline is usually limited to the size of the lens aperture. Similar works using plates or mirrors to guide the light were proposed to obtain virtual stereo images. These optics design also introduces complex optical uncertainty and geometric calculation [17–20].
None of these prior works exploit the high regularization of multiple images captured by the perception system. In contrast, we use a low-cost monocular camera to capture images in horizontally or vertically aligned camera positions. The cost volumes for stereo matching between the reference image and surrounding images can be easily combined together to form a robust cost volume for depth estimation.
We start illustrating how active perception can work for stereo matching with a monocular camera and present stereo matching algorithms related to our multiscopic vision system.
A. Active Stereo Perception
Our multiscopic vision system is capable of capturing multiple images that are combined to reconstruct the 3D scene. If we only use two of these images, the depth estimation problem would degenerate to a stereo matching problem. We begin with how a monocular camera can be applied to stereo matching problems.
Fig. 2: A stereo image pair captured using the proposed active perception system. The baseline between these two images is 20 mm.
First, a monocular camera is installed on the end of a robot arm so the camera can be moved freely. Then we program the robot to move the camera along the horizontal axis of the image plane and two images are taken, as shown in Fig. 2. Unlike a binocular stereo camera, the image pairs can be captured with arbitrary baselines in this active perception system. Different baselines can be applied for different purposes: accurate depth estimation of distant objects may require large baselines while stereo matching is easier with smaller baselines (less occlusion and smaller disparity).
Researchers have proposed several classic stereo matching methods such as the naive block matching, dynamic programming [21], semi-global matching [22], belief propagation [23], graph cuts [24] , and matching with convolutional neural networks [25]. Specifically, we study the naive block matching method (BM), the graph cut method (GC), and the deep learning-based method (MC-CNN) [25].
B. Block Matching
Naive block matching is a simple and straightforward stereo matching method. This method minimizes the matching error between two blocks in the left image and the right image. To find the most similar block, we need to check all possible blocks in the same row from the minimum disparity to the maximum allowable disparity. The sum of absolute difference (SAD) is often used to measure the similarity between two blocks. For a pixel (u, v) in the left image, its SAD cost with block size and disparity d can be calculated as
where is the cost at point
is the radius of the block,
is the intensity of the pixel at (x, y) in the left image and
denotes the intensity of pixel
in the right image. The center of the reference block is (u, v) and the total number of pixels within this block is
.
For the naive block matching algorithm, we simply apply the Winner-Take-All (WTA) strategy to select the correspondence with the lowest SAD cost. To improve the continuity of the results, we perform subpixel enhancement on the discrete disparity:
Fig. 3: The disparity maps obtained by two stereo matching algorithms, displayed in Jet colormap. (a) is using naive stereo block matching and (b) is using stereo graph cuts. Note that the disparities on occluded regions are not estimated accurately, and the stereo matching on the metal tabletop is not accurate due to reflection.
where d is the integer disparity, is the subpixel disparity, and c is a cost volume such as
.
Applying the naive block matching algorithm on image pairs captured by our active perception system, we get a noisy disparity map as displayed in Fig. 3(a). Also, the depths on occluded regions and reflective tabletop are not reconstructed correctly.
C. Graph Cuts
Graph cuts is one of the most popular global optimization methods for stereo matching. It is a process that assigns a label (i.e., disparity) to each pixel in the reference image such that the energy is minimized. Both our stereo graph cuts and multiscopic graph cuts are based on Graph Cuts stereo matching algorithm by Kolmogorov and Zabih [24].
In our graph cuts optimization, the energy is composed of 4 terms defined as
Data term is used to evaluate the similarity of two image patches. Note that our images may not be perfectly aligned due to the limited precision of robot arm movement, the epipolar line may deviate slightly from the horizontal or vertical direction. To compensate this, we use an improved Birchfield and Tomasi’s (BT) dissimilarity for the data term [24, 26]:
where and
are respectively the smallest and largest values on the subpixel neighborhood around pixel
in the right image. For a pixel q in the right image:
where . Therefore the stereo matching for correspondence is actually performed between the half higher row and the half lower row.
Occlusion term is used to maximize the number of matches. To encourage the disparity assignment in graph cuts optimization, any inactive pixel without assignment is penalized by energy K.
Fig. 4: Five images captured using our active perception system from different viewpoints. The parallaxes between adjacent views are same.
Smoothness term encourages to assign same disparity to adjacent pixels, especially for those with similar color. Thus if two adjacent pixels
in left image have different disparity assignments corresponding to pixels
in right image, a
penalty would be added as:
where is a threshold to evaluate the color similarity,
are penalty constants for similar and various pixels,
is the disparity difference truncated at a threshold
.
Uniqueness term enforces the uniqueness of pixel correspondences. In other words, for a pixel in the left image, we do not allow two pixels in the right image match it simultaneously. This will be punished by an infinity energy
.
We use the graph cuts optimization to minimize the energy E. To suppress the discontinuous disparity artifacts, input images are enlarged 4 times before the graph cuts optimization. As shown in Fig. 3(b), the resulted disparity map with graph cuts contains less noise, but artifacts on occlusion regions and the reflective tabletop still persist.
We first introduce our multiscopic vision system with active perception to capture axis-aligned images and then propose multiscopic matching algorithms for robust depth estimation.
A. Multiscopic Active Perception
In our multiscopic vision system presented in Fig. 1, we can move a monocular camera to the left and to the right along the horizontal axis, and move the camera up and down along the vertical axis. We capture one center image and four axis-aligned images with the same baseline in the left, right, bottom, and top views, as displayed in Fig. 4. The baseline between the center image and one neighboring image is 20
Fig. 5: A multiscopic structure with three images is formed by moving a camera horizontally along the image plane with the same distance. Thus there are three images captured from the left view, the center view, and the right view. The gray optical axes are perpendicular to the image planes in blue. The points O are optical centers. A point P in 3D space is projected onto the image plane at different time corresponding to three pixels in 2D images.
millimeters. With the center image as the reference, the other four images can jointly contribute to the disparity estimation. Besides, for each point seen in the center image, it is very likely that one of the other four images would contain the point. For example, the point P in Fig. 1 cannot be observed from the left view but can be perceived completely from other views.
B. Multiscopic Block Matching
The images in multiscopic vision are taken with parallel optical axes and co-planar image planes, as illustrated in Fig. 5. Since the baselines for four surrounding images are the same, the disparity of a pixel between the center image and any surrounding image should be the same. Note that the correspondences between the center image and another image are on the same row or column due to the horizontal or vertical movement of the camera. This is demonstrated in Fig. 5. Considering a multiscopic system with three images as an example, for a point P in 3D space, it is projected onto the camera image planes as three image pixels . The disparity
between
and
and the disparity
between
and
are the same.
In the real-world application, our multiscopic vision system takes five images, as shown in Fig. 4. The data term is composed of four parts, each for one surrounding image:
Fig. 6: The disparity results of various multiscopic algorithms. Block matching with mean, minimum and heuristic SAD cost fusion produce disparity maps (a), (b), (c) respectively and (d) is using multiscopic graph cuts with heuristic fusion. Note that the metal table surface in the bottom left corner and bottom right corner is reconstructed well now.
where denote the images taken from the right, left, top, and bottom views respectively.
Then fusing these four parts to form the final data term is crucial. One naive idea is to take the average,
The visual result of using shown in Fig. 6(a) indicates that it does remove much noise and reconstruct the reflective tabletop better, but the result is still affected by occlusion areas. For the center image, some regions can not be seen in some surrounding images. For instance, the region to the left of the toy cannot be seen in the right image. Thus the cost
for this region would be large and may affect the overall data term
. Therefore we consider another fusion strategy by choosing the smallest one when combining the four parts:
The visual result with is presented in Fig. 6(b). We can see that the occlusion region is reconstructed clearly but the noise is persistent in some areas. To overcome this, we design a heuristic fusion strategy. First we sort the four costs on each pixel and use three smallest costs
(
is the smallest). Then we remove the second largest cost if it is much larger than the other two:
which leads to a cleaner disparity map as shown in Fig. 6(c).
C. Multiscopic Graph Cuts
The optimization using graph cuts in multiscopic vision is similar to the two-frame stereo matching except modification on the data term and the smoothness term.
Data term now is also an integration of four parts:
cBTu, v, d) = max{0, I
u, v
d, v), I
d, v
u, v)},
cBTu, v, d) = max{0, I
u, v
d, v), I
d, v
u, v)},
cBTu, v, d) = max{0, I
u, v
u, v + d), I
u, v
u, v)},
cBTu, v, d) = max{0, I
u, v
u, v
, I
u, v
u, v)},
where are the smallest and largest values on the subpixel neighborhood in the right, left, top, bottom image respectively. These four costs are then merged using the same heuristic rule to get the fused cost
.
Smoothness term now should take the color and disparity discontinuity in every image into account. If two adjacent pixels
in the center image have different disparity assignments, then the penalty should be
(12) where are the corresponding pixels of
in the right, left, top, bottom images respectively. With the new energy, the result of the multiscopic graph cuts with the same hyper-parameters is displayed in Fig. 6(d). Compared with the stereo graph cuts displayed in Fig. 3(b), the occlusion parts and reflective tabletop are reconstructed much better and the noise is better suppressed.
In this section, we present the details of our system setup and experiments. The quantitative evaluation on the Middlebury Stereo Dataset and the qualitative test on real robot experiments are demonstrated to compare the multiscopic vision with the two-frame stereo matching.
Fig. 7: The camera is mounted on the end of a robot arm and moved horizontally and vertically to take pictures from different views.
Fig. 8: The disparity estimation results of different algorithms for two sets of images, Aloe and Lampshade from Middlebury Stereo Datasets. The first image is the reference RGB image, i.e., the left image for stereo algorithms and the center image for multiscopic algorithms. Two images are used for stereo algorithms and three images are used for multiscopic algorithms. BM denotes block matching and GC denotes graph cuts.
A. System Setup
To build the multiscopic vision system with active perception, we mount a monocular camera on the end of a robot arm, as displayed in Fig. 7. The sensor we use is an ordinary USB video camera with Sony IMX322 inside, whose resolution is . The robot arm is UR10, a collaborative industrial robot whose repeatability is
mm. UR10 has six rotating joints, so the end has 6 degrees of freedom. Thus the camera can move freely with any pose.
To capture a series of images with multiscopic structure, we command the UR10 to move the camera in its image plane, generating a series of co-planar images. For every movement with the same distance, we take one picture of the environment. Thus we can take as many images, and each of these images has the same parallax with its adjacent images. For example, we can take 9 images with 3 rows and 3 columns, which forms a multiscopic array. Also, we can adjust the baseline according to the need. For the sake of simplicity, we use five images in the real robot experiments to estimate disparity, as is demonstrated in Fig. 10.
To evaluate the performance of our multiscopic vision system, we conduct quantitative evaluation on the Middlebury Stereo Dataset 2006 [27] that contains calibrated and rectificated image sequence for depth estimation. We use three adjacent images to compute the disparity in our multiscopic vision system. Note that we can use our active perception system to capture more images and do multiscopic matching with five or even more images.
B. Evaluation on Middleburry
The images in the Middlebury Stereo Dataset are well calibrated and rectified, so it can quantitatively show the improvement of multiscopic matching without the influence of image calibration error. Since there are only images captured in the horizontal direction in this dataset, we choose only three images, the view 0, view 1, view 2 as the left image, center image, and right image for the multiscopic vision system. The baseline between view 1 and view 0 or
Fig. 9: The visual comparison between stereo matching graph cuts (I) and multiscopic graph cuts (II) in noisy, occluded and reflective areas.
view 2 is 40 mm. Thus with the center image as reference, there are two cost volumes to be combined. One is between the left image and the center image, and the other one is between the right image and the center image. Because these images are rectified, the fusion of these two costs can be directly using the smaller one according to Equ. 9.
We randomly choose two sets of images from the Middlebury Stereo Dataset, Aloe and Lampshade, and present their reconstruction results using stereo block matching, stereo graph cuts, multiscopic block matching and multiscopic graph cuts in Fig. 8 without any post-processing. The maximum searching disparity for Aloe and Lampshade is set to 60 and the minimum is set to 1. The resolution of these two sets of images are around and the block size for block matching is set to 11. The occlusion penalty K is set to 10 and the smoothness parameters
Fig. 10: The disparity estimation results of different algorithms for a reflective workpiece (zoom in). The multiscopic algorithms use 5 images to do the correspondence searching.
TABLE I: Matching Results of Middlebury Datasets
are set to 9, 3, 8, 5 respectively. We can visually see from these two sets of results that using three images to do the matching can reduce the noise and reconstruct the occlusion parts better.
For the Middlebury Dataset, we also include a baseline called MC-CNN [25] that measures the similarity of image patches with convolutional neural networks, and applies cross-based cost aggregation and semi-global matching. The pre-trained accurate Middlebury network model of MS-CNN is used in our experiments. For multiscopic matching, we fuse the cost volume between view 1 and view 0, and the cost volume between view 1 and view 2 according to the minimum rule.
Then we use five metrics to evaluate the matching results, summarized in TABLE I. The RMS is the root-mean-square error, AvgErr is the average absolute error, Bad0.5 is the percentage of ”bad” pixels whose error is greater than 0.5 and Bad1 and Bad2 denote greater than 1 and 2 respectively. It can be seen from these five metrics that the multiscopic framework can improve the correspondence matching a lot even with only three images. The average decrease on 21 image sequences of the average absolute error can reach 58.2% and the one for root-mean-square error is 50.2%. There is around 15% improvement even for the worst case.
C. Real Robot Experiments
The images captured by our system are not perfectly calibrated and rectified, so there is more noise in the correspondence matching. In our experiments on real robots, we first capture one center image and then capture four surrounding images from the left, right, top and bottom views. The first example, a toy, is presented in Fig. 3 and Fig. 6 and another example is presented in Fig. 10.
The fusion of four costs is according to the heuristic rule in equation (10). The maximum and minimum searching disparity for these two image sets are the same and set to 70 and 1. Because the alignment of these data is not perfect, there is more mismatching and noise. Thus the block size for block matching is set to 17 and the occlusion penalty K is set to 25 to encourage the correspondence matching. Other parameters are set as the same as Middlebury datasets.
The disparity maps in Fig. 6 and Fig. 10 clearly show the multiscopic vision system reduces a lot of noise on textureless areas, the occlusion parts, and reflective regions. The disparity estimation of reflective metal tabletop is noisy in stereo matching but looks accurate in multiscopic matching, as displayed in Fig. 9(c). Also, the reflective metal workpiece, which is everywhere in industrial environment, can be reconstructed much better.
In this work, we propose an active perception framework for multiscopic vision. A camera mounted on the end of a robot arm is controlled to move in the image plane and take multiple pictures with the same parallax. Both the magnitude and direction of the pixels disparities is under control such that we can search the correspondence easily. Depth reconstruction with five-frame multiscopic vision is presented in real-world robot experiments. We extend stereo matching algorithms to multiscopic algorithms by fusing four cost volumes between the center frame and surrounding frames, so the outliers in the estimated disparity map could be effectively suppressed. The evaluation on the Middlebury Stereo Dataset and real robot experiments show that a more accurate disparity map could be obtained with our multiscopic vision system. The average absolute error is decreased by 50.2% from stereo matching to multiscopic vision. The noise is significantly reduced on occluded areas and reflective surfaces.
We hope our work with multiscopic vision can inspire more subsequent works in depth estimation and robotic applications. In the future, we can explore the fusion of multiple cost volumes with convolutional neural networks. Also, we can study different image layouts in the multiscopic vision system.
[1] J. Biswas and M. Veloso, “Depth camera based indoor mobile robot localization and navigation,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2012, pp. 1697–1702.
[2] S. B. Goldberg, M. W. Maimone, and L. Matthies, “Stereo vision and rover navigation software for planetary exploration,” in Proceedings, IEEE Aerospace Conference, vol. 5. IEEE, 2002, pp. 5–5.
[3] H. Ye, Y. Chen, and M. Liu, “Tightly coupled 3d lidar inertial odometry and mapping,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, To appear, 2019.
[4] W. Yuan, K. Hang, H. Song, D. Kragic, M. Y. Wang, and J. A. Stork, “Reinforcement learning in topology-based representation for human body movement with whole arm manipulation,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, To appear, 2019.
[5] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International Journal of Computer Vision, vol. 47, no. 1-3, pp. 7–42, 2002.
[6] R. Szeliski, Computer vision: algorithms and applications. Springer Science & Business Media, 2010.
[7] S. Chen, Y. Li, and N. M. Kwok, “Active vision in robotic systems: A survey of recent developments,” The International Journal of Robotics Research, vol. 30, no. 11, pp. 1343–1377, 2011.
[8] R. Bajcsy, Y. Aloimonos, and J. K. Tsotsos, “Revisiting active perception,” Autonomous Robots, vol. 42, no. 2, pp. 177–196, 2018.
[9] G. Kahn, P. Sujan, S. Patil, S. Bopardikar, J. Ryde, K. Gold- berg, and P. Abbeel, “Active exploration using trajectory optimization for robotic grasping in the presence of occlusions,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 4783–4790.
[10] S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza, “An information gain formulation for active volumetric 3d reconstruction,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 3477–3484.
[11] W. Klarquist and A. Bovik, “Adaptive variable baseline stereo for vergence control,” in IEEE International Conference on Robotics and Automation (ICRA), vol. 3. IEEE, 1997, pp. 1952–1959.
[12] Y. Nakabo, T. Mukai, Y. Hattori, Y. Takeuchi, and N. Ohnishi, “Variable baseline stereo tracking vision system using high-speed linear slider,” in IEEE International Conference on
Robotics and Automation (ICRA). IEEE, 2005, pp. 1567– 1572.
[13] B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy, “High performance imaging using large camera arrays,” in ACM Transactions on Graphics (TOG), vol. 24, no. 3. ACM, 2005, pp. 765–776.
[14] V. Vaish, M. Levoy, R. Szeliski, C. L. Zitnick, and S. B. Kang, “Reconstructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2. IEEE, 2006, pp. 2331–2338.
[15] M. Maitre, Y. Shinagawa, and M. N. Do, “Symmetric multi- view stereo reconstruction from planar camera arrays,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2008, pp. 1–8.
[16] E. H. Adelson and J. Y. A. Wang, “Single lens stereo with a plenoptic camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 2, pp. 99–106, 1992.
[17] S. A. Nene and S. K. Nayar, “Stereo with mirrors,” in International Conference on Computer Vision (ICCV). IEEE, 1998, pp. 1087–1094.
[18] C. Gao and N. Ahuja, “Single camera stereo using planar parallel plate,” in Proceedings of the 17th International Conference on Pattern Recognition, 2004., vol. 4. IEEE, 2004, pp. 108–111.
[19] J. Gluckman and S. K. Nayar, “Rectified catadioptric stereo sensors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 224–236, 2002.
[20] S. Hu, Y. Matsumoto, T. Takaki, and I. Ishii, “Monocular stereo measurement using high-speed catadioptric tracking,” Sensors, vol. 17, no. 8, p. 1839, 2017.
[21] S. Forstmann, Y. Kanou, J. Ohya, S. Thuering, and A. Schmitt, “Real-time stereo by using dynamic programming,” in 2004 Conference on Computer Vision and Pattern Recognition Workshop. IEEE, 2004, pp. 29–29.
[22] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2008.
[23] J. Sun, N.-N. Zheng, and H.-Y. Shum, “Stereo matching using belief propagation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 7, pp. 787–800, 2003.
[24] V. Kolmogorov, P. Monasse, and P. Tan, “Kolmogorov and zabih’s graph cuts stereo matching algorithm,” Image Processing On Line, vol. 4, pp. 220–251, 2014.
[25] J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol. 17, pp. 1–32, 2016.
[26] S. Birchfield and C. Tomasi, “A pixel dissimilarity measure that is insensitive to image sampling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 4, pp. 401–406, 1998.
[27] H. Hirschmuller and D. Scharstein, “Evaluation of cost func- tions for stereo matching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2007, pp. 1–8.