Multi-View Photometric Stereo: A Robust Solution and Benchmark Dataset for Spatially Varying Isotropic Materials

2020·Arxiv

Abstract

Abstract

We present a method to capture both 3D shape and spatially varying reflectance with a multi-view photometric stereo (MVPS) technique that works for general isotropic materials. Our algorithm is suitable for perspective cameras and nearby point light sources. Our data capture setup is simple, which consists of only a digital camera, some LED lights, and an optional automatic turntable. From a single viewpoint, we use a set of photometric stereo images to identify surface points with the same distance to the camera. We collect this information from multiple viewpoints and combine it with structure-from-motion to obtain a precise reconstruction of the complete 3D shape. The spatially varying isotropic bidirectional reflectance distribution function (BRDF) is captured by simultaneously inferring a set of basis BRDFs and their mixing weights at each surface point. In experiments, we demonstrate our algorithm with two different setups: a studio setup for highest precision and a desktop setup for best usability. According to our experiments, under the studio setting, the captured shapes are accurate to 0.5 millimeters and the captured reflectance has a relative root-mean-square error (RMSE) of 9%. We also quantitatively evaluate state-of-the-art MVPS on a newly collected benchmark dataset, which is publicly available for inspiring future research.

Index Terms—Photometric Stereo, Isotropy, 3D Reconstruction, BRDF Capture.

I. INTRODUCTION

CLASSICAL photometric stereo algorithms [52] estimatea per-pixel normal map from a set of images taken by a fixed camera under different lighting conditions. While they can reconstruct high-frequency geometric details, they often cause low-frequency shape distortions in a coarse scale [27]. In comparison, multi-view photometric stereo [10] (MVPS) algorithms integrate the results of photometric stereo from multiple different viewpoints. This approach can correct the low-frequency shape distortion by multi-view geometry constraints and also inherits the advantages of capturing fine details from original photometric stereo algorithms. On the other hand, conventional photometric stereo methods typically make the following assumptions to facilitate normal estimation: (i) The camera is orthogonal; (ii) The surface material is Lambertian with no cast shadows;

M. Li is with the College of Computer Science and Technology and C. Diao is with the Cultural Heritage Institute and School of Art and Archeology, Zhejiang University, China.

B. Shi is with National Engineering Laboratory for Video Technology, Department of Computer Science and Technology, Peking University, Beijing, China.

P. Tan is with the School of Computing Science, Simon Fraser University, Canada.

(iii) Lighting is directional (e.g., distant point light sources).

However, the real world often presents much more complicated settings with perspective cameras, nearby light sources, and more complicated materials. It is thus extremely difficult to simultaneously recover the unknown material and object shape, even under known lighting conditions. To address this challenge, sophisticated hardware such as light stages [6], coaxial lights [13], and near-field light stages [17] have been designed. Though these methods achieve highly accurate results, the setups are expensive and complicated. We design a method with a simple low-cost setup so that it can be adopted more widely. Our setup only contains a digital camera and some LED lights. We design sophisticated calibration and reconstruction algorithms to address these difficulties. Compared with the advanced setups, e.g., the coaxial lights in [13], our method achieves lower but still useful accuracy (0.5 millimeters vs. 50 microns). We hope our lightweight solution can enable casual users to perform high-quality appearance capture in the future.

The conference version of this paper [58] relaxes the Lambertian material assumption of MVPS to deal with general spatially varying isotropic materials. Specifically, we exploit reflectance symmetries such as isotropy and reciprocity to deal with those general materials. According to [28], isotropy allows us to identify ‘iso-depth contours’, i.e., pixels corresponding to surface points of equal distances to the image plane, from photometric stereo images. We collect iso-depth contours from multiple viewpoints to reconstruct the complete 3D shape. Specifically, we first apply structure-from-motion [9] to reconstruct a sparse set of 3D points. We then propagate the depths of these 3D points along iso-depth contours in each viewpoint. Each propagation generates additional 3D points, whose depths can be further propagated in a different viewpoint. A surprisingly small number of 3D points (about two hundred) can be propagated to reconstruct the complete 3D shape (about two hundred thousand points). Once the shape is fixed, we use the same set of input images to infer the spatially varying materials. We model reflectance by the Bidirectional Reflectance Distribution Function (BRDF) and assume the BRDF at each surface point is a linear combination of a few basis isotropic BRDFs, each of which is a 3D discrete table to handle general materials. The basis BRDFs and mixing weights at each point are iteratively estimated by the ACLS method [20].

ing). This extension allows us to build a compact desktop scanner of a microwave oven size for appearance capture, where the object is only 400 mm away from the camera and LED lights. To handle perspective cameras, we divide the image plane using a 2D grid, where each grid cell acts as the image of an orthogonal camera, so that the original iso-depth contours can be evaluated safely. On the other hand, we also introduce a calibration method with a simple white board to calibrate the 3D position of each LED light and its radiance towards different directions. This sophisticated lighting model allows us to eliminate the undesired non-uniform lighting of nearby LED lights to simplify the photometric stereo problem. At last, to evaluate multi-view photometric stereo algorithms, we build the ‘DiLiGenT-MV’ dataset, which is a multi-view extension of the ‘DiLiGenT’ dataset in [44] for benchmarking photometric stereo algorithms. This new dataset contains images of 5 objects of complex BRDFs. The images are taken from 20 viewpoints and in each viewpoint, 96 calibrated point light sources are used. The ‘ground truth’ shape is available for quantitative evaluation. This ‘DiLiGenT-MV’ can be used to evaluate multi-view stereo methods (e.g., [41], [46]) under complex materials for lighting, be used to evaluate conventional single-view photometric stereo algorithms (e.g., [2], [45]) by treating each viewpoint independently. We quantitatively evaluate recent multi-view photometric stereo algorithms to further understand their pros and cons so as to encourage further research on unsolved issues. Our main contributions are threefold: • We propose a multi-view photometric stereo technique to work with general spatially varying isotropic materials, which allows faithful appearance capture (i.e., shape + BRDF capture).

• We relax the assumption of perspective camera and distant lighting to build a simple desktop capture setup, which enables casual users to perform high quality appearance capture.

• We present the ‘DiLiGenT-MV’ dataset with objects of complex materials and ‘ground truth’ shapes for benchmarking multi-view photometric stereo methods.

II. RELATED WORK

A. Image-based Modeling

These methods reconstruct a 3D shape and a ‘texture map’ to model objects from images. The methods in [5], [23] are two recent representative methods. Texture color at each surface point is decided according to its image projections. However, a texture map is often insufficient to represent general nonLambertian materials.

B. Shape Scanning and Reflectance Fitting

To obtain precise 3D shape, laser scanners and structured-light patterns were used in [22], [38], and [56]. Based on a precise 3D reconstruction, parametric reflectance functions can be fitted at each surface point according to the image observations, as in [39] and [21]. These methods require precise registration between images and 3D shapes. Since different sensors are used for shape and reflectance capture, this registration is difficult and often causes artifacts in misaligned regions. Some methods [1], [27] combine reflectance recovered from photometric stereo and shape recovered from structured-light, where registration is relatively simple. However, they need to capture images under both structured-light and varying directional light at each viewpoint, which is tedious and requires a more complicated setup than ours.

C. Photometric Appearance Capture

Our method belongs to photometric approaches that capture both shape and reflectance from the same set of images. Most of previous methods, e.g., [7], [10], [24], assumed specific parametric BRDF models such as Lambert’s or Ward’s model [51]. The performance of these methods degrades when the real objects have different reflectance from the assumed model.

Some other methods employed a sophisticated hardware setup to achieve high-quality results. Ma et al. [26] and Ghosh et.al. [6] used a light stage where the intensity of each LED on the stage was precisely controlled. Holroyd et al. [13] required specialized coaxial lights. This requirement of expensive and complicated hardware limits their wide application. Recently, a few algorithms [2], [12] were proposed for appearance capture by exploiting various reflectance symmetries that are valid for a broader class of objects. However, the method in [12] required up to a thousand input images at each viewpoint and [2] relied on fragile optimization. Tan et al. [50] and Chandraker et al. [4] both recovered iso-contours of depth and gradient magnitude for isotropic surfaces. Additional user interactions or boundary conditions are required to recover the 3D shape. A recent work [31] developed an uncalibrated mult-iview photometric stereo method to reconstruct meshes with fine geometric details by estimating a displacement map in the 2D texture domain. However, this work does not deal with surface reflectance. Along the direction of uncalibrated methods, Lu et al. [25] recovered surface normal for isotropic surfaces. But as evaluated in [44], this method requires many more input images per viewpoint and produces a larger error.

There is only limited works to deal with perspective cameras and near lighting effects until now. Under these settings, the photometric stereo problem becomes nonlinear even with the basic Lambertian model. Extra information is often incorporated to solve this challenge. Tigo et al. [11] employed a sparse depth map to simplify the computation and Xie et al. [53] overcame the difficulty by utilizing a mesh deformation technique. There are also methods to deal with the near-light effects of point lights, e.g., [11], [15], [29], [30]. However, these works are all based on Lambert’s model and only recover a normal map. Recently differential photometric stereo methods [33], [34] are proposed to solve the perspective and near-light effects by using nonlinear PDEs, while leads to complex optimization for normal and depth estimation.

The work closest to our method is [2]. Both methods are built upon reflectance symmetry embedded in ‘isotropic pairs’ introduced in [49]. There are three key differences between our method and [2]. First, we reconstruct a complete 3D shape rather than a single-view normal map. Second, we combine multi-view geometry and photometric cues to avoid fragile iterative optimization of shape and reflectance. Third, our method works with general tri-variant isotropic BRDFs under the perspective projection while [2] assumed bi-variant BRDFs and the orthogonal projection to simplify the optimization.

Our work is also related to BRDF acquisition methods such as [36], [55]. These methods are only applicable to near-flat surfaces where the surface normals are known beforehand. Our method can be considered as a generalization of these methods to non-planar surfaces.

D. Datasets for 3D reconstruction

The first common benchmark, Middlebury dataset, was proposed by Steitz et al. [42] for evaluating multi-view stereo on equal grounds, containing only two scenes with Lambertian surfaces. Later, Strecha et al. [48] proposed a new MVS benchmark dataset including 6 outdoor scenes with up to 30 images with higher resolution and ‘ground truth’ shapes captured by a laser scanner. This dataset covers well-textured scenes, though the online benchmark service is not available anymore. To compensate for the lack of diversity in [42], [48], Jensen et al. [16] published a number of real-world objects, which are still limited in the variety of scenes and viewpoints. Knapitsch et al. [19] and Schops et al. [41] provided the latest challenging datasets for indoor and outdoor scenes with high-resolution video data and ‘ground truth’ measurements obtained with a laser scanner, focusing on evaluating binocular stereo, multi-view stereo, or structure-from-motion. Yet, their objects share the same limitation as those in earlier datasets, i.e, lacking diversity in both surface reflectance and viewpoints.

Existing multi-view stereo datasets are limited in the variation of lighting conditions and challenging BRDFs, which is the key issue in photometric stereo. Shi et al. [44] proposed the ‘DiLiGenT’ dataset for single-view photometric stereo. We extend it to the multi-view setup and release the ‘ground truth’ 3D shape as well as camera calibration information to facilitate future research on MVPS.

III. SHAPE AND REFLECTANCE RECONSTRUCTION

We provide a block diagram of our system in Figure 1. We capture images from multiple viewpoints, and at each viewpoint, we capture photometric stereo images under different lighting conditions. Our algorithm robustly identifies iso-depth contours from these images at each viewpoint. On the other hand, we apply a structure-from-motion algorithm [9] to images from different viewpoints to reconstruct a sparse set of 3D points. We then derive a complete 3D shape by propagating the depths of these points along the dense iso-depth contours. This initial shape is further refined according to the method described in [27]. Once the shape is fixed, we estimate a set of basis isotropic BRDFs and their mixing weights at each surface point by the ACLS method [20] to model the surface reflectance. In the following, we first describe the method for orthographic camera and distant lighting in SectionSection III-A and Section III-B. We then relax the distant lighting and orthogrpahic camera assumptions in Sections Section III-D and Section III-E respectively.

A. Basic Iso-depth Contour Estimation

Assuming orthographic camera and directional lighting, Alldrin and Kriegman [28] observed that isotropy allows almost trivial estimation of iso-depth contours in the absence of global illumination effects such as shadows and inter-reflections. We generalize this algorithm to make it more robust in real data than the na¨ıve approach described in [28]. Specifically, we relax the assumption about lighting (i.e., precisely located on a view centered circle as in [28]) and propose a method to enhance robustness to global illumination effects.

When the light moves on a view-centered circle, the plane spanned by the viewing direction and the surface normal direction of an isotropic1 surface point can be recovered precisely according to the symmetry of the observed pixel intensity profile. In the camera local coordinate system, where the z-axis is aligned with the viewing direction, this plane gives the azimuth angle of the surface normal, which is the angle between the x-axis and the projection of normal in the xy-plane. For easier reference, we refer this direction of a projected surface normal as the azimuth direction in this paper. The details of the azimuth direction computation are in the Section 4.1 of the conference version of this paper [58].

Tracing Contours. Once an azimuth direction is computed at each pixel, we proceed to generate iso-depth contours. Starting from every pixel, we iteratively trace along the two directions perpendicular to the azimuth direction with a step of 0.1 pixel. Specifically, suppose the estimated azimuth angle is at a pixel x. We trace along the two 2D directions

to xand x. We then replace dand daccording to the azimuth angles of xand xrespectively and continue to trace. We stop tracing when the maximum number of iterations is reached (500 in our experiments). Pixels on one traced curve should have the same distance to the image plane. To avoid tracing across discontinuous surface points, we use the method described in the ‘NPR camera’ [35] to identify depth discontinuities. Further, we define a confidence measure for these traced contours as the inverse of the maximum curvature along them. Intuitively, smoother contours with relatively small curvature are more reliable.

B. Multi-View Depth Propagation

A standard structure-from-motion algorithm such as [23], [47] can reconstruct a set of sparse 3D points on the object. We capture experiment objects on a turntable with a checkboard pattern to ensure sufficient feature matching for textureless examples. Since structure-from-motion algorithms could be affected by moving highlights, we compute a median image

Fig. 1. System pipeline. We recover iso-depth contours from photometric stereo images and recover a sparse 3D point cloud by structure-from-motion. In the figure showing iso-depth contours, the gray intensity encodes the estimated azimuth angles, and the colored curves are iso-depth contours. We then propagate the depths of these 3D points along the iso-depth contours to recover the complete 3D shape. Once the shape is fixed, we estimate the spatially varying BRDF from the original input images.

Fig. 2. We propagate the depth of x to the iso-depth contour segment passes through its projection in the i-th view. This propagation generates new 3D points, e.g., y, whose depths in other images can also be propagated along their corresponding iso-depth contours

at each viewpoint by taking the median intensity of each pixel and use these images for feature matching. Reconstructed 3D points are combined with the traced iso-depth contours to recover the complete 3D shape.

Depth Propagation. As illustrated in Figure 2, given a reconstructed 3D point x, we project it to all images where it is visible. Suppose an iso-depth contour goes through its projection in the i-th image. We perform a depth propagation to assign the depth of x to all pixels on (If the depth of a pixel on is already known, we keep it unchanged). This propagation generates new 3D points, whose depths can be propagated in other images too. We begin with a sparse set of 3D points P reconstructed by structure-from-motion. Depth propagation with P in all images generates a large set of 3D points . We then replace P by and apply depth propagation iteratively. We keep iterating until is empty. Note when dealing with perspective cameras in Section III-E, we take each sub-divided cell as an individual orthographic camera.

Direct application of the algorithm described above will generate poor results. There are a few important issues which must be addressed for robust 3D reconstruction.

Point Sorting. We sort all points in P according to the confidence of their associated iso-depth contours. Note that if a point is visible in K different views, it is repeated K times in P and each repetition is associated with an iso-depth contour in one view. At each iteration, we only select half of the points in P of high confidence for depth propagation. We then remove those selected points, and insert into the sorted set P for the next iteration.

Visibility Check. We should not propagate the depth of a 3D point in an image where it is invisible. However, the visibility information is missing for 3D points generated by propagation. So we apply a consistency check when propagating the depth of a 3D point x to a contour C. We check pixels on C one by one, starting from the projection of x to the two ends of C. If a pixel p fails the check, we truncate C at p, and only assign the depth of x to pixels on the truncated contour. If the updated contour is too short (less than 5 pixels in our implementation), we do not propagate.

To evaluate consistency at a pixel p, we assign it the depth of x to determine its 3D position. We then use the surface normal of x to select L (L = 7 in our implementation) most front parallel views where x is visible. We assume p is visible in all these L images and check the consistency of the azimuth angles at its projections. The azimuth angles at corresponding pixels in two different views uniquely decide a 3D normal direction2. If different combinations of these L views all lead to consistent 3D normals (the angle between any two normals is within T degrees), we consider p as consistent. Otherwise, we discard one view that leads to the largest number of inconsistent normals and check consistency with the remaining views iteratively. We consider p consistent, if it is consistent over at least 3 views. Otherwise, it is inconsistent. For each consistent 3D point, we set its normal as the mean of all consistent normals. In our implementation, we begin with T = 3, and relax it by 1.3 times whenever is empty until T > 15.

We note the number of consistent views for each 3D point when inserting it to the set . Points are first sorted by the number of consistent views in descending order. Those with the same number of consistent views are sorted by the confidence of contours.

C. Shape Optimization

After depth propagation, we have a set of 3D points, each with a normal direction estimated. We apply the Poisson surface reconstruction [18] to these points to obtain a triangulated surface. This surface is further optimized according to [27] by fusing the 3D point positions and their normal directions.

D. Near Light Effects

In practice, we often use LED lights as light sources, which are nearby point lights leading to different illumination directions at different pixels. The lighting intensity is non-uniform for an LED light too. An LED light has different emission radiance towards different directions, which further falls off along those directions. We address these problems to make the algorithm more practical.

Lighting Directions. For the setups in our experiments, we calibrate the precise 3D positions of the light sources to compute spatially variant lighting directions at each pixel. Calibration details are provided in Section V-B. After calibrating light source positions, we take the average depth of an object (computed from the reconstructed sparse 3D points in Section III-B) to estimate an approximate 3D position of each pixel. The lighting directions at each pixel are then computed according to the 3D positions of that pixel and the light sources.

We interpolate observations under lighting directions lying on a view-centered circle, and compute the azimuth angle from these interpolated observations. More details of this process are in the Section 4.1 of the conference version of this paper [58].

Lighting Intensities. An LED light often has non-uniform intensity towards different orientations, which needs to be calibrated for high precision reconstruction. Furthermore, lighting intensities are inversely proportional to the square of its travel distance. We calibrate a 3D field of lighting intensities for each LED light, which is referred as lighting intensity volume in the following of this paper. Calibration details are provided in Section V-B. After calibration, we normalize all observations to the same lighting intensity, i.e., dividing each observed pixel intensity by the lighting intensity at its 3D position. However, we have no knowledge about the accurate 3D position at this stage, so we assume the observations lying on the same plane with an approximated constant depth.

Fig. 3. Left: an image plane I divided into an array of rectangular cells, each is considered as an orthographic camera. Right: black axes are the coordinates of the original perspective camera. Red and blue coordinates are those of the two orthographic cameras indicated by the red and blue frames in the left image.

Fig. 4. Definition of , which an isotropic BRDF is defined as a 3D lookup table.

E. Perspective Camera Effects

When the camera is nearby, we need to consider the perspective effects of camera projection. In other words, every pixel should have a different viewing direction. To address this problem, we divide the image plane to a 2D array of cells, typically , as shown in the left of Figure 3. Each cell can be considered as an individual camera with much smaller field-of-view (FoV), and this camera can be well approximated as an orthographic camera.

The extrinsic parameters of these sub-divided orthographic cameras can be easily computed by rotating the original perspective camera. Specifically, as indicated in the right of Figure 3, the principal axis of a sub-divided orthographic camera is simply the view ray passing through its cell center. Suppose the principal axis of the original perspective camera is Z. We define a rotation matrix R, which is the minimum rotation that rotates Z to . We apply R to the original camera axes X, Y, Z to obtain axes of the orthographic camera. The iso-depth contours in each sub-divided orthographic camera are in different image planes, as the and axes are different for different sub-dividing cells.

Note that the observed images of a sub-divided orthographic camera is generated by applying a homography to the original image. This homography can be easily computed from the intrinsic parameters and the rotation matrix.

F. Reflectance Capture

We assume the surface reflectance can be represented by a linear combination of several (K = 2) basis isotropic BRDFs. Once the 3D shape is reconstructed, we follow [20] to estimate the basis BRDFs and their mixing weights at each point on the surface. We consider the general tri-variant isotropic BRDF, which is a function of as shown in Figure 4. We discretize and into 90, 2, and 5 bins respectively all in the interval . Please refer to [37] for a justification of choosing this interval. Hence, a BRDF is represented as a vector by concatenating its values at these bins.

We build an observation matrix V, and factorize it into a matrix of mixing weights W and a matrix of basis BRDFs H as

M = 900 is the dimension of a BRDF. N is the number of 3D points. Each row of V represents the observed BRDF of a surface point. In constructing the matrix V, we avoid pixels observed from slanted viewing directions (the angle between viewing direction and surface normal is larger than 40 degrees in our implementation), where a small shape reconstruction error can cause a big change in their projected image positions. V contains missing elements because of incomplete observation. We apply the Alternating Constrained Least Squares (ACLS) algorithm [20] to iteratively compute the rows of W and columns of H.

To further improve reflectance capture accuracy, we first compute H from a subset of precisely reconstructed 3D points, whose reconstructed normals from different combinations of azimuth angles are consistent within 1.5 degrees. We then fix H and compute W at all surface points.

IV. MULTI-VIEW PHOTOMETRIC STEREO DATASET

In this section, we introduce the ‘DiLiGenT-MV’ dataset. It includes five objects: BEAR, BUDDHA, COW, POT2, and READING as shown in Figure 5. None of the existing multi-view benchmark datasets are suitable for evaluating our method since they have various limitations: simple reflectance [19], [41], a limited number of viewpoints [42], or inadequate lighting variations [16]. The ‘DiLiGenT’ dataset in [44] captures objects with complex reflectances under large illumination changes, but it only has a single viewpoint for each object. This motivates us to extend ‘DiLiGenT’ to multiple viewpoints. Our dataset is captured under the same configuration as ‘DiLiGenT’ except that we provide images illuminated by 96 different lights from 20 different viewpoints. We release the normal maps from all 20 viewpoints as well as the scanned 3D shape. Please refer to [44] for details of image capture procedure, lighting and camera calibration, scanning and alignment of ‘ground truth’ shape. The data is available for download at: https://sites.google.com/site/photometricstereodata/.

V. DATA CAPTURE SETUPS

We build two different data capture setups: a studio scanner setup and a desktop scanner setup. The studio and desktop scanner setups use automatically blinking LED lights synchronized with a video camera to speedup data capture. The studio and desktop scanner setups are shown in the left and middle of Figure 6 respectively. In the studio and desktop setups, the testing object is about and 400 millimeters away from the camera respectively. Therefore, we have to consider perspective camera effects (as in Section III-E) and near light effects (as in Section III-D), especially for the desktop setup.

The studio setup uses a PointGrey Grasshopper camera, which captures linear images at about resolution.

Fig. 5. Example images of five objects from different views (4 out of 20) in ‘DiLiGenT-MV’. From top to bottom: BEAR, BUDDHA, COW, POT2, READING.

The desktop setup uses a cheaper linear industry camera at resolution. We capture images viewpoint by viewpoint. After capturing images at one viewpoint, we manually rotate the object to capture the next viewpoint. The desktop scanner setup further uses an automatic turntable to automate this rotation, making the whole data capturing process automatic.

A. Capture Setups

A Studio Scanner Setup. As shown in the left of Figure 6, 72 LEDs are uniformly distributed on two concentric circles of diameter 400 and 600 millimeters respectively. A video camera is mounted at the center of these circles, facing the direction perpendicular to the board3. The camera is synchronized with the LED lights such that at each video frame, there is only one light on. At each viewpoint, we captured 30 images with different lighting directions in 12 seconds (at 2.5fps). (Please refer to Section VI-B for a justification of the number of images per viewpoint.)

A Desktop Scanner Setup. We also build a desktop scanner with a ring of LED lights as shown in the middle of Figure 6. The design is almost the same as that of the studio setup, with a smaller footprint. The diameters of the two LED circles are 150 and 300 millimeters respectively. The object is usually 400 millimeters away from the camera, such that effects of perspective cameras and point light sources must be considered. After capturing images at one viewpoint, the turntable will rotate 10 degrees to capture the next viewpoint.

Fig. 6. Left: the studio scanner setup. Middle: the desktop scanner setup. Our devices consist of a video camera, two circles of LED lights, and an automatic turntable (for the desktop setup). The object to camera distance is millimeters and 400 millimeters for the studio and desktop setup respectively. We model the perspective camera projection and near point lights for the desktop setup. Right: we can calibrate one parameter to determine 3D position of the LED lights with known radius of the underlying circles.

Fig. 7. Top view of the calibration setup. We capture a diffuse board at several known positions (black lines) to calibrate camera vignetting and lighting intensity. Some additional boards (blue lines) are used to calibrate the angle

Fig. 8. Color coded lighting intensity captured by our method. Left: lighting intensity at a plane which is about 1000 millimeters away from the light. Right: lighting intensity of a plane 500 millimeters away.

B. Calibration

We assume orthographic camera model for the studio setup to simplify the computation, and apply the sub-division scheme in Section III-E to model perspective camera effects only for the desktop scanner setup.

Lighting Directions. For the studio and desktop scanner setups, since the LEDs are uniformly distributed on circles with known radius, we only need to calibrate one parameter to determine their precise 3D positions. Here, is the reference angle of the first LED light as shown in Figure 6. To calibrate the angle , we capture a diffuse board at some slanted positions (indicated as blue lines in Figure 7) and compute the azimuth angle of the board’s normal direction. The computed azimuth angle should be , where is the true azimuth angle. The angle can be computed separately, by computing the 3D position of the board from a checkerboard calibration pattern. Hence, we can obtain by subtracting from the initial estimated azimuth angle. When computing azimuth angles, we performed a Delaunay triangulation based interpolation method as introduced in the conference version [58]. In the following, we describe the details of calibrating intensities of light sources for the studio and desktop scanners.

Lighting Intensities. To calibrate lighting intensities, we capture a diffuse board roughly parallel to the image plane at multiple depths as shown in Figure 7, where the black lines indicate the board positions in a bird view. The nearest board and the farthest board enclose the lighting intensity volume. A checkerboard calibration pattern is printed at the four corners of the board, such that its 3D position can be computed. Assume the board is Lambertian with unit albedo (we used the X-Rite white balance board in our experiments). At each point, the observed pixel intensity should be , were l, n are the local lighting and normal directions, and V is the light intensity. Hence, we can capture V at each point on the board as l. We linearly interpolate these captured values to obtain the result in a continuous 3D volume.

We empirically find that, when the distance between the object and the light is around 1000 millimeters (e.g., the studio setup in Section VI), we only need to compute the mean lighting intensity of each LED light for all pixels. This can be seen from the uniform intensity distribution from the left of Figure 8. When the distance is around 400 millimeters (e.g., the desktop scanner setup in Section VI), we have to record a different lighting intensity for each pixel for an LED. This is evident from the right of Figure 8.

VI. EXPERIMENTS

A. Evaluation Using Data Capture by Our Setups

In our experiments, the 3D points obtained from the structure-from-motion algorithm were often noisy. We only

Fig. 9. Results from the different data capture setups. (a) one of the input images; (b) the recovered shape rendered with uniform diffuse shading; (c) a rendering with the recovered reflectance model from the same viewpoint and lighting condition as the image in (a); (d) the color-coded shape error (in millimeters) compared to laser-scanned ‘ground truth’. The 1st-2nd rows are from the studio setup and the 3th-6th rows are from the desktop setup.

TABLE I SHAPE RECONSTRUCTION ERRORS OF THE OBJECTS SHOWN IN FIGURE 9 (IN MILLIMETERS).

keep points with reprojection error less than 0.5 pixels. Typically, about 200 initial points are obtained for each example. Our system can also easily incorporate manual intervention in the form of matched feature points to handle textureless regions. To provide a ‘ground truth’ validation, all experimental objects were scanned using a Rexcan III industrial scanner, which is accurate to 10 microns. Our results are registered with the scanned shapes using the iterative closest point (ICP) algorithm [3].

1) Quantitative Shape Reconstruction Errors: Some exam-

ples are provided in Figure 9 with a sample input image shown in (a) for each example. From top to bottom, we refer these examples as ‘Buddha-S’, ‘Teapot2-S’, ‘Teapot3-D’, ‘Gourd-D’, ‘Cat2-D’, and ‘Pig-D’ respectively for future reference, where the suffix -S and -D stand for the studio and desktop setups respectively. To better visualize the recovered shape, we render it with uniform diffuse shading in (b). Most of the geometry details are successfully captured by our methods, as exemplified by the wrinkles of the ‘Buddha-S’ clothes. Figure 9 (c) is a rendering according to the captured reflectance from the same viewpoint and lighting condition as the input image in (a). The rendered image closely resembles the input image, indicating high accuracy in both geometry and reflectance. To provide a quantitative evaluation on shape capture, we visualize the shape reconstruction error (measured in millimeters) in (d). Typically, larger errors are associated with concavities, which have fewer image observations due to occlusion and are also affected by stronger inter-reflections.

The ‘Buddha-S’ and ‘Teapot2-S’ examples were captured with the studio scanner setup. The ‘Buddha-S’ example contains many discontinuities at clothes folds and large concavities at the shoulder. The polished wooden ‘Buddha-S’ has focused and strong highlight, while the clay ‘Teapot2-S’ has soft and extended highlight. Our method consistently performed well on both of them. The ‘Teapot2-S’ example has relatively larger error at one side, mainly due to the imprecise SfM reconstruction at that area. We captured four examples by the desktop scanner, including ‘Teapot3-D’, ‘Gourd-D’, ‘Cat2-D’, and ‘Pig-D’. In particular, the ‘Cat2-D’ and ‘PigD’ examples are captured with the desktop scanner setup spanning about 30 degrees FoV in the camera, with significant perspective camera effects. They also present a variety of different materials, where the ‘Cat2-D’ has focused highlight and the ‘Pig-D’ has softer and more extended highlight.

Quantitative shape errors for all examples in Figure 9 are summarized in Table I. The studio setup achieves the best accuracy of around 0.5 millimeters mean error, but requires a large and bulky capture setup. The desktop setup balances data capture flexibility and shape accuracy.

2) Intermediate Results: Figure 10 shows some intermediate results at different stages of depth propagation for the examples ‘Buddha-S’, ‘Teapot2-S’, and ‘Pig-D’. Results shown in Figure 10 (a) are initial 3D points obtained from structure-from-motion. These points are the initial seeds for depth propagation. We choose a tight threshold to get rid of unreliable points. Results shown in (b) are the 3D points after the 1st iteration of depth propagation, where a large portion of the surface has been reconstructed. (c) and (d) are the 3D points after half and all of the propagation iterations respectively. The black points in (b) and (c) are those with a surface normal facing away from the camera. Typically, it takes 5-6 iterations of depth propagation to cover the complete surface. It is interesting to see that our method can propagate dense depth through a very small number of seed points in (a). Quick shape change makes the propagation slower, as the iso-depth contours tend to break at shape discontinuities. For example, half of the propagation iterations are spent to cover the mouth of ‘Buddha-S’ and the flower decoration on ‘Teapot2-S’. Results shown in (e) are the initial surface meshes after Poisson surface reconstruction. Some shape errors are visible, e.g., on the face of the ‘Buddha-S’ example, which are due to the drifting errors of depth propagation. Results shown in (f) are the final optimized shapes by the method in [27]. The surfaces become clearly smoother after optimization.

Figure 11 visualizes the BRDF mixture weights and the basis BRDFs for the ‘Buddha-S’, ‘Pig-D’, and ‘Teapot2-S’ examples. The red and green channels are the normalized mixture weight of the first and second basis BRDFs. Each basis BRDF is applied to render a sphere under front lighting for reference. Most of our examples are consisted of a shiny and a less shiny basis BRDFs. This can be seen clearly from the ‘Buddha-S’ example.

B. Performance Analysis of Our Method

1) Number of Image at Each Viewpoint: We evaluate the

accuracy of captured shape with different numbers of input images from each viewpoint. Similar analysis on BRDF capturing can be found in the conference version [58].

Figure 12 shows the mean shape error (in millimeters) with different numbers of input images in each viewpoint, starting with 10 images per view. We always chose equal number of uniformly distributed lights on both the outer and inner circles, i.e., starting with 5 LED lights for each circle. This is because our Fourier series fitting requires at least 5 LEDs from each viewpoint. In most of the examples, the mean shape error does not change significantly for different numbers of input images. The errors of ‘Teapot3-D’ and ‘Gourd-D’ drop slightly when the number of input image per viewpoint increases to 20.

Fig. 10. Intermediate results of depth propagation. (a) initial 3D points reconstructed by structure-from-motion; (b) 3D points obtained by the first iteration of depth propagation; (c) and (d) 3D points obtained after the middle and last iteration of depth propagation; (e) mesh after Poisson surface reconstruction from the points in (d); (f) final shapes after jointly optimizing shape and normal.

Fig. 11. The normalized BRDF mixture weights are visualized in different color channels. The corresponding basis BRDFs are used to render a sphere on the side.

Fig. 12. The mean shape error does not change significantly with different number of input images per viewpoint.

‘Cat2-D’ has the largest error, most likely because it spans the largest FoV to the camera.

2) Perspective vs Orthographic Camera: To evaluate our

sub-dividing method for perspective cameras as presented in Section III-E, we test the ‘Pig-D’ and ‘Cat2-D’ examples with and without sub-dividing the image plane. Figure 13 shows the error in estimated azimuth angles. As we can see, the ‘perspective camera’ model produces much more accurate results. It reduces the mean azimuth angle error from 12.1 degrees to 9.9 degrees for the ‘Pig-D’ example, and from 19.5 degrees to 12.6 degrees for the ‘Cat2-D’ example respectively. Figure 14 shows the shape reconstruction errors. Similarly, the ‘perspective camera’ model reduces the mean shape reconstruction error from 0.75 millimeters to 0.57 millimeters for the ‘Pig-D’ example, and from 1.66 millimeters to 0.76 millimeters for the ‘Cat2-D’ example respectively. This comparison demonstrates the effectiveness of the simple sub-dividing approach, which helps to reduce the setup size to fit a desktop. Note that the shape error in Figure 14 is much smoother than the azimuth angle error in Figure 13, because the shape is reconstructed by

Fig. 13. The color coded azimuth angle errors with and without the sub- dividing method presented in Section III-E. The 1st and 2nd rows are the results of the ‘Pig-D’ and ‘Cat2-D’ examples respectively. On the left is the result obtained without sub-dividing (‘orthogonal camera’), and on the right is the one with sub-dividing (‘perspective camera’). It is clear that the sub-dividing approach can significantly reduce errors in azimuth angles.

Fig. 14. The color coded shape reconstruction errors without the sub-dividing method (‘orthographic camera’) in Section III-E for the ‘Pig-D’ and ‘Cat2-D’ examples. Please refer to Figure 9 for their shape errors with sub-dividing (‘perspective camera’).

fusing azimuth angle information from many viewpoints and the later Poisson surface reconstruction and shape optimization [27] all help to smooth out the noise.

We further analyze the mean shape error with different numbers of sub-divisions to examine our algorithm for perspective cameras. The left of Figure 15 shows the mean shape error of the ‘Pig-D’, ‘Cat2-D’, and ‘Buddha-S’ examples with different numbers of sub-divisions. The shape error of the ‘Cat2-D’ is significantly reduced by this subdivision scheme, while the error of ‘Pig-D’ is only mildly decreased. To understand this, we further plot the mean shape error with different horizontal FoV of each sub-divided window in the right of Figure 15. The ‘Buddha-S’ example always spans a FoV less than 10 degrees, and thus sub-division is unnecessary. The ‘Pig-D’ example spans a 25-degree FoV. Thus, a sub-division will be sufficient. The ‘Cat2-D’ example has a much larger FoV and a sub-division is desired to achieve best results. Empirically, we find the orthogonal camera assumption can be safely made, when the FoV is less than 10 degrees.

Fig. 15. Left: mean shape error with different numbers of sub-division to the image plane to model perspective cameras. Right: mean shape error with different horizontal field-of-view (FoV) per sub-divided window.

3) Runtime Efficiency: Our implementation is not optimized for speed. We finish all experiments on a computer with 24 GB RAM and an 8-core 3.0 GHz CPU. At each viewpoint, our Matlab code computes azimuth angles in 1 minute, and traces iso-depth contours in 1.5 minutes. Depth propagation takes 16 minutes (for 40 viewpoints), and the final shape optimization takes 1 minute. It takes about 15 minutes to compute the basis BRDFs from 5,000 samples with ACLS. Our output mesh typically has about 150,000 points with average spatial distance 0.095 millimeters. It takes another 45 minutes to compute their BRDF mixing weights. Much of the involved process including azimuth angle computation, iso-depth contour tracing, and BRDF mixing weight computation can be easily parallelized.

C. Evaluation on ‘DiLiGenT’ and ‘DiLiGenT-MV’

We first compare our method with the representative single-view photometric stereo methods, IA14 [14], ST14 [45], HS15 [8], SH17 [43], and ZK19 [57] on ‘DiLiGenT’ for normal estimation accuracy. Then we evaluate our method and another representative MVPS method PJ16 [32] on ‘DiLiGenT-MV’ for both normal and shape accuracy. For all validated methods, we use the parameters provided in the original codes or suggested by the original papers. We use all 96 images from all 20 viewpoints to evaluate different methods, with an exception of [32], whose executable fails when being fed with over 10 images for each viewpoint. Since the FoV of the camera used for both datasets is less than 10 degrees, we employ the ‘orthogonal’ algorithm for the following evaluation according to Section VI-B.

1) Quantitative Normal Accuracy on ‘DiLiGenT’: We

choose BEAR, BUDDHA, COW, POT2, and READING from the ‘DiLiGenT’ dataset to validate our method. We projected the final results of multi-view based methods to get normals at different viewpoints to compare them with those single view based methods. Results of single-view photometric methods are from the benchmark result on ‘DiLiGenT’ online.

Multi-view methods (i.e., our method and PJ16 [32]) generate better results than single-view methods on most objects except for BUDDHA, as they have access to more views to overcome the difficulties in photometric stereo, e.g., correcting the low-frequency shape distortions. The BUDDHA example is an exception because the strong inter-reflections at wrinkles, which are not modeled in these two evaluated multi-view methods. This explains why some more sophisticated single- view algorithms produce better results. Our method outperforms other methods on the challenging example COW in Figure 16, because the non-Lambertian material here satisfies the isotropy assumption well as discussed in [44].

2) Quantitative Shape and Normal Accuracy on ‘DiLiGenT-

MV’: We employ GIPUMA [46] to generate initial points and only keep points with reprojection error less than 0.5 pixels. At very large textureless regions, we also manually establish some corresponding points to facilitate further processing. Note that our method can be initialized by a small set of discrete 3D points, while PJ16 [32] needs an initial mesh as initialization. Therefore, we apply the Poisson surface reconstruction on the initial 3D points to generate a mesh for PJ16 [32], and feed the initial 3D points to our method for initialization. The initial meshes are shown in Figure 17 (a). Figure 17 (b) and (c) are the final results by PJ16 [32] and our method respectively. The mean and median shape errors are shown in Table III. Our method reduces the average mean errors across all examples from 1.04 mm to 0.52 mm (a drop of 50%), and from 0.82 mm to 0.50 mm (a drop of 40%) for the average median error. As we can tell from Figure 17, PJ16 [32] often fails on textureless regions with non-Lambertian reflectances, such as the ear of COW or the book of READING, where the initial mesh is quite unreliable. In comparison our method is much more robust at these regions.

For further evaluation, we also evaluate the normal angular errors for each viewpoint of both methods in Figure 18. As illustrated in Figure 18, our method outperforms PJ16 [32] overall, since our method is based on isotropic reflectance, which is a more realistic assumption in dealing with real objects. Figure 18 further proves that the initial 3D points improve the final normal enormously (e.g., COW) and large errors appear in occluded and concave areas (e.g., BUDDHA), owing to the missing of image observation and strong inter-reflections. These are consistent with the observation of the mesh estimation.

VII. DISCUSSION

We propose a multi-view photometric stereo method to capture both the shape and reflectance of real objects. Our method is general and works with spatially varying isotropic BRDFs. It involves simple hardware setup of a video camera and some LEDs. The captured 3D shape is accurate up to 0.5 millimeters and the reflectance has a RMSE as low as 9%. We also quantitatively evaluate state-of-the-art MVPS using a newly collected benchmark dataset ‘DiLiGenT-MV’, which is publicly available for inspiring future research.

Our method has a few limitations. First, our method cannot model anisotropic material. It also cannot handle translucent objects and mirror surfaces. Second, our ring-light capture setup contains only two circles of LEDs. Hence, we only capture the BRDF of a point with two different values. As a result, during reflectance capturing, we can only discretize to two levels, and cannot capture Fresnel effects faithfully. We could increase the number of circles of LED lights, or fit parametric Fresnel terms [40] to solve this problem. Third, the calibration of our system still requires some skills for amateur users. An interesting direction is to extend the compressive sensing framework in [54] to strategically plan the illumination. Further enhancing its robustness to noisy 3D points from SfM is another interesting direction for future research. Last, it is still challenging for MVS and MVPS methods to reconstruct high-quality 3D points from the texture-less or highly nonLambertian surfaces of ‘DiGiLenT-MV’.

VIII. ACKNOWLEDGEMENT

We thank Todd Zickler, Kyros Kutulakos, and Stephen Lin for their helpful discussions and suggestions. We also thank Sai-Kit Yeung and Zhipeng Mo for helping on building the ‘DiLiGenT-MV’ dataset. This work is partially supported by the National Natural Science Foundation of China under Grant No.61872012 and 61425025, the Key Research and Development Program of Zhejiang Province of China under Grant No.2018C03051, and Principles Research for the Conservation of Heritage Sites Grant S17-176000-016. Changyu Diao is supported by the Key Scientific Research Base for Digital Conservation of Cave Temples in Zhejiang University, State Administration for Cultural Heritage of China. Ping Tan is supported by the Canada NSERC Discovery Grant No.611664.

REFERENCES

[1] D. Aliaga and Y. Xu. Photogeometric structured light: A self-calibrating and multi-viewpoint framework for accurate 3d modeling. In Proc. of Computer Vision and Pattern Recognition, 2008. 2

[2] N. Alldrin, T. Zickler, and D. Kriegman. Photometric stereo with non-parametric and spatially-varying reflectance. In Proc. of Computer Vision and Pattern Recognition, 2008. 2, 3

[3] P. J. Besl and N. D. McKay. A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:239–256, 1992. 9

[4] M. Chandraker, J. Bai, and R. Ramamoorthi. A theory of differential photometric stereo for unknown brdfs. In Proc. of Computer Vision and Pattern Recognition, 2011. 2

[5] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:1362–1376, 2010. 2

[6] A. Ghosh, T. Chen, P. Peers, C. A. Wilson, and P. Debevec. Estimating specular roughness and anisotropy from second order spherical gradient illumination. Computer Graphics Forum, 28, 2009. 1, 2

[7] D. B. Goldman, B. Curless, A. Hertzmann, and S. M. Seitz. Shape and spatially-varying brdfs from photometric stereo. In Proc. of International Conference on Computer Vision, pages 341–348, 2005. 2

[8] T.-Q. Han and H.-L. Shen. Photometric stereo for general brdfs via reflection sparsity modeling. IEEE Transactions on Image Processing, 24(12):4888–4903, 2015. 11, 13

[9] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, New York, NY, USA, 2 edition, 2003. 1, 3

[10] C. Hernandez, G. Vogiatzis, and R. Cipolla. Multiview photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30:548–554, 2008. 1, 2

[11] T. Higo, Y. Matsushita, N. Joshi, and K. Ikeuchi. A hand-held photometric stereo camera for 3-d modeling. In Proc. of International Conference on Computer Vision, pages 1234–1241. IEEE, 2009. 2

[12] M. Holroyd, J. Lawrence, G. Humphreys, and T. Zickler. A photometric approach for estimating normals and tangents. ACM Transactions on Graphics, 27, 2008. 2

[13] M. Holroyd, J. Lawrence, and T. Zickler. A coaxial optical scanner for synchronous acquisition of 3d geometry and surface reflectance. ACM Transactions on Graphics, 2010. 1, 2

[14] S. Ikehata, D. Wipf, Y. Matsushita, and K. Aizawa. Photometric stereo using sparse bayesian regression for general diffuse surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(9):1816– 1831, 2014. 11, 13

TABLE II MEAN NORMAL ANGULAR ERRORS (IN DEGREE) ON ‘DILIGENT’.

Fig. 16. Normal error maps of our method on ‘DiLiGenT-MV’. The error range is [0, 45] in degree.

TABLE III MEAN AND MEDIAN SHAPE ERRORS (IN MM) OF [32] AND OURS ON ‘DILIGENT-MV’.

[15] Y. Iwahori, H. Sugie, and N. Ishii. Reconstructing shape from shading images under point light source illumination. In Proc. of International Conference on Pattern Recognition, volume 1, pages 83–87. IEEE, 1990. 2

[16] R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs. Large scale multi-view stereopsis evaluation. In Proc. of Computer Vision and Pattern Recognition, pages 406–413, 2014. 3, 6

[17] K. Kang, C. Xie, C. He, M. Yi, m. Gu, Z. Chen, K. Zhou, and H. Wu. Learning efficient illumination multiplexing for joint capture of reflectance and shape. ACM Transactions on Graphics, 38(6):165, 2019. 1

[18] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. In Proc. of Eurographics Symposium on Geometry Processing, pages 61–70, 2006. 5

[19] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 36(4):78, 2017. 3, 6

[20] J. Lawrence, A. Ben-Artzi, C. DeCoro, W. Matusik, H. Pfister, R. Ramamoorthi, and S. Rusinkiewicz. Inverse shade trees for non-parametric material representation and editing. ACM Transactions on Graphics, 25:735–745, July 2006. 1, 3, 5, 6

[21] H. Lensch, J. Kautz, M. Goesele, W. Heidrich, and H.-P. Seidel. Imagebased reconstruction of spatial appearance and geometric detail. ACM Transactions on Graphics, 22:234–257, 2003. 2

[22] M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton, S. Anderson, J. Davis, J. Ginsberg, J. Shade, and D. Fulk. The digital michelangelo project: 3d scanning of large statues. In Proc. of SIGGRAPH, pages 131–144, 2000. 2

[23] M. Lhuillier and L. Quan. A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:418–433, 2005. 2, 3

[24] J. Lim, J. Ho, M.-H. Yang, and D. Kriegman. Passive photometric stereo from motion. In Proc. of International Conference on Computer Vision, pages 1635–1642, 2005. 2

[25] F. Lu, Y. Matsushita, I. Sato, T. Okabe, and Y. Sato. Uncalibrated photometric stereo for unknown isotropic reflectances. In Proc. of Computer Vision and Pattern Recognition, 2013. 2

[26] W.-C. Ma, T. Hawkins, P. Peers, C.-F. Chabert, M. Weiss, and P. Debevec. Rapid acquisition of specular and diffuse normal maps from polarized spherical gradient illumination. In Proc. of Eurographics Symposium on Geometry Processing, 2007. 2

[27] D. Nehab, S. Rusinkiewicz, J. Davis, and R. Ramamoorthi. Efficiently combining positions and normals for precise 3d geometry. ACM

Transactions on Graphics, 24:536–543, 2005. 1, 2, 3, 5, 9, 11

[28] A. Neil and K. David. Toward reconstructing surfaces with arbitrary isotropic reflectance : A stratified photometric stereo approach. In Proc. of International Conference on Computer Vision, 2007. 1, 3

[29] T. Papadhimitri and P. Favaro. A new perspective on uncalibrated photometric stereo. In Proc. of Computer Vision and Pattern Recognition, pages 1474–1481, 2013. 2

[30] T. Papadhimitri and P. Favaro. Uncalibrated near-light photometric stereo. 2014. 2

[31] J. Park, S. N. Sinha, Y. Matsushita, Y.-W. Tai, and I. S. Kweon. Multiview photometric stereo using planar mesh parameterization. In Proc. of International Conference on Computer Vision, 2013. 2

[32] J. Park, S. N. Sinha, Y. Matsushita, Y.-W. Tai, and I. S. Kweon. Robust multiview photometric stereo using planar mesh parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8):1591– 1604, 2016. 11, 12, 13, 14, 15

[33] Y. Qu´eau, B. Durix, T. Wu, D. Cremers, F. Lauze, and J.-D. Durou. Ledbased photometric stereo: Modeling, calibration and numerical solution. Journal of Mathematical Imaging and Vision, 60(3):313–340, 2018. 2

[34] Y. Qu´eau, T. Wu, F. Lauze, J.-D. Durou, and D. Cremers. A non-convex variational approach to photometric stereo under inaccurate lighting. In Proc. of Computer Vision and Pattern Recognition, 2017. 2

[35] R. Raskar, K.-H. Tan, R. Feris, J. Yu, and M. Turk. Non-photorealistic camera: depth edge detection and stylized rendering using multi-flash imaging. ACM Transactions on Graphics, 23:679–688, August 2004. 3

[36] P. Ren, J. Wang, J. Snyder, X. Tong, and B. Guo. Pocket reflectometry. ACM Transactions on Graphics, 30(4), 2011. 3

[37] F. Romeiro and T. Zickler. Inferring reflectance under real-world illumination. Technical Report TR-10-10, Harvard School of Engineering and Applied Sciences, 2010. 5

[38] S. Rusinkiewicz, O. Hall-Holt, and M. Levoy. Real-time 3d model acquisition. ACM Transactions on Graphics, 21:438–446, 2002. 2

[39] Y. Sato, M. D. Wheeler, and K. Ikeuchi. Object shape and reflectance modeling from observation. In Proc. of SIGGRAPH, pages 379–387, 1997. 2

[40] C. Schlick. An inexpensive BRDF model for physically-based rendering. Computer Graphics Forum, 13(3):233–246, 1994. 12

[41] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proc. of Computer Vision and Pattern Recognition, pages 3260–3269, 2017. 2, 3, 6

[42] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms.

Fig. 17. Reconstructed shapes by PJ16 [32] (b) and our method (c) given the initial mesh from (a).

In Proc. of Computer Vision and Pattern Recognition, volume 1, pages 519–528. IEEE, 2006. 3, 6

[43] H.-L. Shen, T.-Q. Han, and C. Li. Efficient photometric stereo using kernel regression. IEEE Transactions on Image Processing, 26(1):439– 451, 2017. 11, 13

[44] B. Shi, Z. Mo, Z. Wu, D. Duan, S. Yeung, and P. Tan. A benchmark dataset and evaluation for non-lambertian and uncalibrated photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):271–284, 2019. 2, 3, 6, 12

[45] B. Shi, P. Tan, Y. Matsushita, and K. Ikeuchi. Bi-polynomial modeling of low-frequency reflectances. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1078–1091, 2014. 2, 11, 13

[46] G. Silvano, L. Katrin, and S. Konrad. Massively parallel multiview stereopsis by surface normal diffusion. In Proc. of International Conference on Computer Vision, 2015. 2, 12

[47] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. ACM Transactions on Graphics, 25(3):835–846, July 2006. 3

[48] C. Strecha, W. Von Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multi-view stereo for high res-

olution imagery. In Proc. of Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008. 3

[49] P. Tan, S. P. Mallick, L. Quan, D. Kriegman, and T. Zickler. Isotropy, reciprocity and the generalized bas-relief ambiguity. In Proc. of Computer Vision and Pattern Recognition, 2007. 2

[50] P. Tan, L. Quan, and T. Zickler. The geometry of reflectance symmetries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:2506–2520, 2011. 2

[51] G. J. Ward. Measuring and modeling anisotropic reflection. In Proc. of SIGGRAPH, pages 265–272, 1992. 2

[52] R. J. Woodham. Photometric stereo: A reflectance map technique for determining surface orientation from image intensity. In Image Understanding Systems and Industrial Applications I, volume 155, pages 136–144. International Society for Optics and Photonics, 1979. 1, 13

[53] W. Xie, C. Dai, and C. C. Wang. Photometric stereo with near point lighting: A solution by mesh deformation. In Proc. of Computer Vision and Pattern Recognition, pages 4585–4593, 2015. 2

[54] W. Yang, Y. Ji, H. Lin, Y. Yang, S. B. Kang, and J. Yu. Ambient occlusion via compressive visibility estimation. In Proc. of Computer Vision and Pattern Recognition, 2015. 12

[55] D. Yue, W. Jiaping, T. Xin, S. John, L. Yanxiang, B.-E. Moshe, and G. Baining. Manifold bootstrapping for SVBRDF capture. ACM Transactions on Graphics, 29(4), 2010. 3

[56] L. Zhang, N. Snavely, B. Curless, and S. M. Seitz. Spacetime faces: high resolution capture for modeling and animation. ACM Transactions on Graphics, 23:548–558, 2004. 2

[57] Q. Zheng, A. Kumar, B. Shi, and G. Pan. Numerical reflectance compensation for non-lambertian photometric stereo. IEEE Transactions on Image Processing, 2019. 11, 13

[58] Z. Zhou, Z. Wu, and P. Tan. Multi-view photometric stereo with spatially varying isotropic materials. In Proc. of Computer Vision and Pattern Recognition, 2013. 1, 3, 5, 7, 9

Min Li received the B.E. degree from Northeastern University in China in 2011. She joined the school of computer science and technology at Zhejiang University in 2011, where currently she is a Ph.D. candidate under the supervision of Prof. Duanqing Xu. She was a visiting student at Simon Fraser University supervised by Prof. Ping Tan from 2015 to 2016. Her research interests include photometric methods in computer vision, reflectance and illumination modeling, and 3D reconstruction.

Zhenglong Zhou received his B.S. degree from Shanghai Jiao Tong University, China, in 2009 and Ph.D. degree from National University of Singapore in 2014, supervised by Prof. Ping Tan. He is currently a technical artist at Changyou. Before that, he worked at Giant and 360 as a software engineer in China. He is interested in computer vision and graphics engine development.

Zhe Wu received the B.E. degree from Tsinghua University in China in 2010 and the Ph.D. degree in computer vision from National University of Singapore in 2015. He is currently a vision engineer at DJI Innovations developing autonomous navigation systems for drones.

Fig. 18. Normal angular error statistics by using our method (top row) and PJ16 [32] (bottom row) on ‘DiLiGenT-MV’. Each subplot shows the results by one evaluated method for all views of each data; the X-axis is the ID of view number, and the Y -axis is the angular error (in degree); the statistics of angular errors for all pixels per normal map are displayed using the box-and-whisker plot: The red dot indicates the mean value, the black dot is the median, the top and bottom bounds of the blue box indicate the first and third quartile values, and the top and bottom ends of the vertical blue line indicate the minimum and maximum errors.

Boxin Shi is currently a Boya Young Fellow Assistant Professor at Peking University, where he leads the Camera Intelligence Group. Before joining PKU, he did postdoctoral research at MIT Media Lab, Singapore University of Technology and Design, Nanyang Technological University from 2013 to 2016, and worked as a Researcher at the National Institute of Advanced Industrial Science and Technology from 2016 to 2017. He received the B.E. degree from Beijing University of Posts and Telecommunications in 2007, M.E. degree from Peking University in 2010, and Ph.D. degree from the University of Tokyo in 2013. He won the Best Paper Runner-up award at International Conference on Computational Photography 2015. He has served as Area Chairs for ACCV 2018, and BMVC 2019.

Changyu Diao is currently an associate professor at Zhejiang University, where he leads the Cultural Heritage Digital Research group. Before joinning ZJU, received his B.S., M.S., and Ph.D. degrees in Computer Science and Technology from the Zhejiang University in China in 2000, 2003 and 2008, respectively. He joined the Cultural Heritage Institute of Zhejiang University of China in 2010. He has been engaged in cultural heritage digitization research for over ten years, including 3D digitization, information management, information processing and analysis, virtual exhibition, digital museum, etc.

Ping Tan is an associate professor with the School of Computing Science at Simon Fraser University (SFU). Before that, he was an associate professor at National University of Singapore (NUS). He obtained his PhD degree from the Hong Kong University of Science and Technology (HKUST) in 2007, and his Master and Bachelor degrees from Shanghai Jiao Tong University (SJTU), China, in 2003 and 2000 respectively. His research interests include computer vision, computer graphics, and robotics. He has served as an editorial board member of the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), International Journal of Computer Vision (IJCV), Computer Graphics Forum (CGF), and the Machine Vision and Applications (MVA), and served as area chairs for CVPR, SIGGRAPH, SIGGRAPH Asia, and IROS.