Geometric Proxies for Live RGB-D Stream Enhancement and Consolidation

2020·Arxiv

Abstract

Abstract

We propose a geometric superstructure for unified real-time processing of RGB-D data. Modern RGB-D sensors are widely used for indoor 3D capture, with applications ranging from modeling to robotics, through augmented reality. Nevertheless, their use is limited by their low resolution, with frames often corrupted with noise, missing data and temporal inconsistencies. Our approach consists in generating and updating through time a single set of compact local statistics parameterized over detected geometric proxies, which are fed from raw RGB-D data. Our proxies provide several processing primitives, which improve the quality of the RGB-D stream on the fly or lighten further operations. Experimental results confirm that our lightweight analysis framework copes well with embedded execution as well as moderate memory and computational capabilities compared to state-of-the-art methods. Processing RGB-D data with our proxies allows noise and temporal flickering removal, hole filling and resampling. As a substitute of the observed scene, our proxies can additionally be applied to compression and scene reconstruction. We present experiments performed with our framework in indoor scenes of different natures within a recent open RGB-D dataset.

1 INTRODUCTION

THE real time RGB-D stream output of moderncommodity consumer depth cameras can feed a growing set of end applications, from human computer interaction and augmented reality to industrial design. Although such devices are constantly improving, the limited quality of their stream still restraints their impact spectrum. This mostly originates in the low resolution of the frames and the inherent noise, incompleteness and temporal inconsistency stemming from single view capture.

We introduce a new multi-shape geometric superstructure to improve RGB-D streams on the fly by analyzing them. A sparse set of detected 3D primitive shapes is parameterized to record statistics extracted from the stream and forms a structure that we call proxies. This superstructure substitutes the RGB-D data and approximates the geometry of the scene. We define a geometric framework using the time-evolving statistics stored in our proxies and based on their consistent spatial support that can be then seen as geometric ”scaffolding”. Its purpose is to improve the RGB-D stream on the fly by reinforcing features, removing noise and outliers or filling missing parts, under the memory-limited and real time embedded constraints of mobile capture in indoor environments (Figure 1).

We designed such a lightweight geometric superstructure to be stable through time and space, which gives priors to apply several signal-inspired processing primitives to the RGB-D frames. They include filtering to remove noise and temporal flickering, hole filling or resampling (section 4). This allows structuring the data and can simplify or lighten subsequent operations, e.g. tracking and mapping, automated navigation, measurement, data transmission, rendering or physical simulation. While our primary goal is the enhancement of the RGB-D data stream, our framework can additionally be applied to compression (section 5.3.2) and scene reconstruction (section 6), as the proxy structure is a representation of the observed scene.

In practice, our system takes a raw RGB-D stream as input to build and update a set of geometric proxies on the fly. It outputs an enhanced stream together with a reconstruction of regularly-shaped (e.g., planar) areas in the observed scene. A selection of proxies based on the current RGB-D frame can be used for lightened transmission of the data. Proxies can be used as priors for triangulation and fast depth data meshing, with applications to rendering or simulation. On the contrary to previous approaches, which mostly rely on a full volumetric reconstruction to consolidate data, our approach is lightweight, with a moderate memory footprint and a transparent interfacing to any higher level RGB-D pipeline.

In particular, our contributions are:

• a stable and lightweight geometric superstructure for RGB-D data, unified for multiple shapes (section 3.1.2);

• construction and updating methods which are spatially and temporally consistent (section 3.1.1);

• a set of compact statistics that record local information from RGB-D samples (section 3.2);

• a collection of RGB-D enhancement methods based on our structure (section 4) which run on the fly to generate a global scene reconstruction (section 6).

This paper extends our preliminary work [1], including:

• generalization to cylinders and spheres (section 3.1.2)

• additional cell-wise color statistics (section 3.2.2)

• new experiments on synthetic scenes (section 5.1).

Fig. 1. Overview of our geometric proxy framework. From a stream of 2.5D RGB-D frames (left), proxies are built on the fly and updated over time (bottom, detailed in Figure 3) and used to apply different real-time processing primitives to the incoming RGB-D frames (top). The system outputs an enhanced data stream and a geometric regular model of the observed scene (right).

2 RELATED WORK

2.1 RGB-D Data Improvement

Filtering. Depth maps can be denoised using spatial filters e.g., Gaussian, median, bilateral [2], adaptive or anisotropic [3] filters, often refined through time, with the resulting enhanced stream potentially used for a full 3D reconstruction [4]. Other methods include non-local means, bilateral filters with time, Kalman filters, over-segmentation and region growing. Wu et al. [5] present a shape-from-shading method using the color component to improve the geometry, which allows adding details to the low quality input depth. They show applications of their method to improve volumetric reconstruction on multiple small scale and close range scenes.

Depth maps can be upsampled using cross bilateral filters such as joint bilateral upsampling [6] or weighted mode filtering [7]. Such methods are particularly useful to recover sharp depth regions boundaries and enforce depth-based segmentation.

Hole Filling. Range limits and high noise levels of depth sensing often create holes in RGB-D data. Given the material of observed objects and the type of technology used, e.g. time of flight, light coding or stereo vision, some surfaces are harder to detect. The orientation of the surface with regards to the sensor and the perturbations due to light sources can also lower the quality in certain areas. In order to fill these holes in the depth component, one can use the same spatial filters as those used for denoising, or morphological filters [3]. Inpainting methods [8], over-segmentation or multi-scale processing are also used to fill holes for e.g., depth image-based rendering (DIBR) under close viewing conditions.

Shape-based Depth Processing. A set of 3D shapes offers a faithful yet lightweight approximation for many indoor environments. Surprisingly, only a few methods have used shape proxies as priors to process 2.5D data, with in particular Schnabel et al. [9] who detect limits of shapes to fill in holes in static 3D point clouds. Fast sampling plane filtering [10] detects and merges planar patches in static indoor scenes. The detected planes allow filtering the planar surfaces of the input point cloud, however the primitives seem quite sensitive to the depth sensor noise and lack spatial consistency.

2.2 Simple Primitive Shape Detection

Methods that build high level models of captured 3D data, composed of simple geometric primitive shapes, are mostly based on RANSAC [11], the Hough transform [12] or Region Growing algorithms. In our embedded, real time, memory-limited context, we take inspiration from the RANSACbased method of Schnabel et al. [13] for its time and memory efficiency, by repeating shape detection through time to acquire a consistent model and cope with the stochastic nature of RANSAC.

Their efficient RANSAC implementation gives stochastic improvements to the critical steps of the algorithm in terms of complexity. For a regular RANSAC-based shape detection, minimal sets of three oriented points would be randomly picked a fixed and large number of times. Then, the shape parameters are estimated from this minimal set and inliers of the estimated shape are computed. The shape with the highest score is kept, its inliers are removed from the point cloud and the algorithm is ran again on the remaining data. Schnabel et al. replace the fixed number of loops with a stochastic condition to stop looking for shapes in the dataset, based on the number of detected shapes and number of randomly picked minimal sets. In addition, instead of searching the full point cloud for inliers of a given shape, they estimate this count in a random subset of the dataset and extrapolate it to the full point cloud. Other modifications allow improving the quality of detected shapes with a localized sampling and specific postprocessing.

Our framework is not attached to a particular shape detection method, and other algorithms such as point clustering [14] or agglomerative hierarchical clustering [15], could be used. For a complete overview of methods to detect simple geometric primitive shapes in captured 3D data, we refer the reader to a recent survey [16].

2.3 RGB-D Stream Reconstruction

Dense SLAM. Online dense simultaneous localization and mapping (SLAM) methods accumulate points within a map of the environment, while continuously localizing the sensor in this map. Recent dense SLAM systems include KDP SLAM [17] or ORB-SLAM2 [18]. Point-based fusion [19] is also used to accumulate points without the need of a full volumetric representation.

In our implementation, we use RGB-D SLAM [20] which leverages point features detected in the color component of the RGB-D frame to estimate camera motion. After detecting and matching SIFT, SURF or ORB features in subsequent color images, their 3D positions in both frames are computed using the depth component. Using these matching 3D points, a robust RANSAC-based estimation of the motion matrix allows discarding false positive matches. Sets of three matching points are randomly picked and the matrix transforming a set in the first frame into the second set is computed using a least squares method. Inliers of the transformation are estimated using their 3D position and orientation and the one giving the most inliers is kept. It is important to note that any existing method or device that localizes an RGB-D camera in its environment can be used instead.

Volumetric Depth Fusion. Depth fusion builds a volumetric representation of an input scene by accumulating depth observations into a voxel grid to update values of signed distance to the nearest model surface. Scene reconstruction methods using volumetric fusion became popular with KinectFusion [4], as the first online reconstruction method based on consumer grade depth sensor Kinect. Recent optimizations include VoxelHashing [21] for efficiency and BundleFusion [22] for accuracy. However, the need for a voxel grid representing the space leads to high requirements of memory.

Offline Surface Regularization. Several methods have been developed to include geometric primitives in the SLAM system, either to smooth and improve the reconstruction [23] or improve the localization of the sensor [24]. A recent offline method [25] makes use of planes to estimate the geometry of a room in order to remove furnitures and model the lighting of the environment. This allows the user to re-light and re-furnish the room as desired. Some recent algorithms make use of planes to smooth and complete the data within a depth fusion volume, such as methods by Zhang et al. [26] or Dzitsiuk et al. [27]. Offline improvement methods have been developed based on the volumetric representation of the scene, such as 3DLite [28] that builds a planar model of the observed scene and optimizes it to achieve a high quality texturing of the surfaces.

3 GEOMETRIC PROXIES

Basically, our model represents RGB-D data which is often seen and consistent through frames and space, hence revealing the dominant structural elements in the scene. To do so, it takes the form of a geometric superstructure, made of multiple shape proxies, all equipped with a local frame, bounds and, within the bounds, a regular 2D grid of rich statistics, mapped on the shape and gathered from the RGBD data. Figure 2 gives visual insight of a geometric proxy. A proxy can have the shape of a plane, a cylinder or a sphere. Our implementation is based on the efficient RANSAC shape detection method by Schnabel et al. [13].

3.1 Building Geometric Proxies

3.1.1 Shape Detection and Tracking

We build geometric proxies on the fly and update them through time using solely incoming raw RGB-D frames from the live stream. More precisely, for each new RGB-D image (color and depth), we run the procedure described in Figure 3 and shown in Figure 4 on one specific example.

The initial depth filtering is based on a bilateral convolution [2] of the depth map using a Gaussian kernel associated with a range check. This allows discarding points further than a depth threshold from the current point, which could create artificial depth values if taken into account. In our experiments, we choose to set this threshold to 20cm, which allows filtering together parts of the same object, while ignoring the influence of unrelated objects. We then estimate the normal field through the simple computation of the depth gradient at each pixel, due to the embedded processing constraint, using the sensor topology as domain.

The estimation of the camera motion from the previous frame is inspired from RGB-D SLAM introduced by Endres et al. [20], using point features from . However, any egomotion estimation algorithm can be used at this step, as all we need is the values of the six degrees of freedom localizing an RGB-D camera in its environment. Examples of such algorithms are given in section 2.3.

In order to keep or discard previously detected proxies, we define a voting scheme where samples of which are inliers of a given previous proxy cast their vote to this proxy and are marked. Then, the per-proxy vote count in the new frame indicates whether the proxy is preserved or discarded. Preserved proxies are updated with , hence see their parameters refined and occupancy statistics updated with new inliers. Discarded proxies are placed in probation state for near-future recheck with new incoming frames, and purged if discarded for too long. However, in order to avoid losing information on non-observed but important parts of the scene, we do not purge proxies that have been seen a large number of times, which stay in probation instead.

When new proxies have been detected, similar ones are merged together in order to avoid modeling different parts of geometric surfaces with multiple proxy instances. The proxy is then initialized with a bounding rectangle and a local frame computed to be aligned with the scene orientation, based on the Manhattan world assumption [29] (more details in the supplemental material, section S.2.3). Using the global scene axes to compute the local proxy frame leads to a fixed resolution and spatial consistency for the grid of all proxies and allows efficient recovery and fusion.

Fig. 2. Proxy model. Built upon a shape in 3D space (e.g., a plane), our proxy model is made of a local frame, bounds and a grid of cells containing statistics. These statistics are a collection of mean and variance values representing a smoothed local histogram, as well as quantized color information, all gathered from the RGB-D data (section 3.2). Activated cells are the ones containing inliers from many frames.

Fig. 3. Overview of proxy structure construction. This procedure is ran for each new RGB-D image , made of a color image a depth map . The proxy update allows accumulating samples from superstructure to compute statistics, as detailed in section 3.2. : low-pass filtered version of : camera motion matrix; : normal vectors associated with : proxies detected at frame candidate proxies; : proxies at frame : low-pass filtered depth points without inliers from

Finally, initial occupancy statistics are computed using , by recovering the coordinates of the cell in the proxy grid corresponding to each inlier. In order to take into account the point of view when modeling the scene, grid cell coordinates for an inlier depth point are computed by projecting it upon the detected shape following the direction between the camera and the point. This projection is illustrated in Figure S.3 of the supplemental material. As a last step, proxies are transformed from local depth frame into global 3D space in order to be tracked in the next frames.

3.1.2 Spatial Extent Modeling

In order to model the observed elements as faithfully as possible, we need to take into account the extent of these elements at the surface of the shape. In addition, we wish to store local information on parts of the shape to avoid smoothing small details, which requires quantization of the shape surface. When proxy shapes are detected, the shape is parameterized based on its local frame in order to define a consistent grid, fixed with relation to the actual object. In particular, densely modeling local geometry implies requirements of minimum cell distortion for uniform representation of local information and efficient unfolding to 2D space. The parameterization from 3D space at the surface of the shape to 2D space is described below for planes and revolution surfaces such as cylinders and spheres.

Planes. We define local axes in the space of a plane based on the reference orientation of the world, as detailed in the supplemental material (section S.2.3). This leads to a consistent orientation of the grid of all proxies, where local proxy axes and both belong to the plane. This local frame allows us to compute a consistent parameterization of the shape for any input image. In particular, a 3D point P belonging to a plane of origin point C will have coordinates in the local frame of the proxy defined as

Revolution Shapes. For the shapes of revolution such as cylinders and spheres, we want to maintain a unified representation allowing the use of the same 2D operators as for the plane. In that regard, we define a parameterization for both of these shapes, also based on their local axes. Figure 5 shows the parameterization of the revolution shapes.

The cylinder proxy has its and local axes orthogonal to its direction axis , allowing the use of cylindrical coordinates. Hence, the local axes define the angular position around the cylinder, while the axis defines the height of a point along the cylinder. On a cylinder of origin point C and radius r, a 3D point P would have its local shape coordinates defined as

Fig. 4. Example of proxy structure construction. Top: the input RGB-D frame is converted into a raw point cloud to which a low-pass filter is applied, followed by normal estimation. Right: RANSAC-based shape detection [13] is applied and used as a basis for the construction or the update of our proxies, following the structure shown in Figure 2. Bottom: accumulated proxy cells are visualized with colors weighted by their occupancy probabilities: darker cells have a low probability and represent low confidence areas, whereas brighter cells represent areas with high confidence. Proxies are then used to generate enhanced RGB-D frames, as detailed in section 4. The whole process runs on the fly.

(a) cylindrical parameterization (b) spherical parameterization (c) octahedral parameterization (d) octahedron (e) flattened octahedron

Fig. 5. Proxy grid parameterization for cylinders and spheres. Cylinders are parameterized with the cylindrical coordinates (a). As parameterizing a sphere with spherical coordinates creates heavy distortion at the poles (b), we choose to use the octahedral parameterization (c), which reduces stretch between grid cells. This parameterization defined by Praun and Hoppe [30] uses an octahedron (d) to reduce stretch at the surface of the sphere. The corresponding 2D grid (e) fully contains the extent of the 3D shape.

For the sphere, the straightforward parameterization would be based on spherical coordinates. Local axes and of the sphere proxy are defined in that sense, where they allow computing the azimuthal angle. In that parameterization, the polar angle is computed from the zenith direction defined as . However, as we can see in Figure 5 (b), the use of spherical coordinates implies strong distortion at the poles.

As we want to discretize the shape extent to locally represent the surface with statistics, it would be more meaningful to have a grid with similar cell sizes, while keeping a fast and efficient unfolding to 2D space. In consequence, we choose to parameterize a sphere with an octahedron, as defined by Praun and Hoppe [30], which strongly reduces stretch and conveniently unfolds to a square with no space lost.

On a sphere of center point C and radius r parameterized as an octahedron, a 3D point P would have its local shape coordinates defined as

Extent Discretization. From the 2D local coordinates of each shape as described above, we define a fixed-size grid on top of the shape surface. We simply discretize the coordinate values (u, v) of a given point at the surface of shape using a fixed cell size. We set the size to 5cm x 5cm, which corresponds to about four times the area of a depth pixel at a typical distance of 8 meters from the sensor. The area of a pixel at given depth z is given by . With and res = (320, 240), we have . Hence, this size ensures a minimum sampling of proxy cells by depth points even at far capture distances. In practice, the real case capture distance will not go beyond 5 to 6 meters due to the size of indoor rooms and limitations of the capture device. In consequence, the potential amount of visiting depth point per frame for a cell is guaranteed to be more than four.

Cells are activated when their visitation percentage over the recent frames (the last 100 frames in our experiments) is greater than a threshold (25% in practice). Once activated, a cell stays so until the end of the processing. We consider a cell as visited as soon as it admits one inlier data point i.e., a data point located within a threshold distance to the cell under a projection in the direction from the sensor origin to the point. This activation threshold allows modeling the actual geometry of the observed scene, while discarding outliers observations due to the low quality of the sensor.

3.2 Proxy Statistics

3.2.1 Accumulation of Depth Samples

As shown in section 3.1.1, we track 3D shapes in time and space to maintain a consistent geometric representation of the surroundings. This temporal and spatial consistency gives information on local geometry at its fixed location in the real world. Hence, we leverage the 3D sampling of this geometry as provided by the depth sensors, to refine our model with time. At each new frame, shape tracking provides a list of 3D sample points belonging to a shape. These 3D points bear geometric and appearance information such as distance to the shape, orientation deviation, curvature and color. The definition of our local extent model at the shape surface, as explained in section 3.1.2, allows accumulating this information into a unified geometric representation in shape space.

The first use of these accumulated depth samples at the shape surface is to compute global shape statistics. At each frame, the new set of inliers is used to compute shape parameters which are averaged with the previous parameters to get more consistent shapes. In addition, the standard deviation of shape parameters gives insight on their distribution and can be seen as a confidence value on the shape.

3.2.2 Local Statistics

The consistent grid defined at the surface of a shape based on a shape-specific parameterization was designed to gather statistics on local parts of the shape, in order to model small details. In each cell of the grid, a probability of occupancy is computed to get knowledge of the visitation rate of this cell. This allows understanding whether a cell models actual geometry in the scene, or if it just corresponds to noise or flickering parts and should be ignored. A real data example of statistics of occupancy and distance stored at the surface of shapes is shown as supplemental material (Figure S.2).

Shape Distance. Each cell of the grid includes a statistical model of the depth values gathered from the RGB-D data. These statistics are composed of a collection of mean and variance values of the distance to the shape. This local distribution represents a smoothed local histogram [31] made of Gaussian kernels. Using such a compressed histogram representation allows recording samples into a compact but faithful model made of a short list of normal distribution parameters, as well as smoothing out outlier depth values. The contribution of an inlier p of distance d(p) to the proxy is given as

This compressed model stores the repartition of shape inliers distances to the proxy and makes possible estimating the diversity of the values within each cell by counting the number of modes in the distribution, appearing naturally when building the histogram. If it has a single mode, then all values are similar and the surface of the proxy within the cell is most likely flat. If the distribution has two or more modes, then the values belong to different groups and the cell likely overlaps a salient area of the surface. A close-up example of flat and salient proxy areas is shown as supplemental material (Figure S.1), with representation of the associated smooth histograms and distribution modes.

Color. In order to accumulate faithful color information within our coarse grid, we define color samples at higher resolution than the cells. To do so, we take inspiration from the mesh colors methodology [32] which defines discrete color positions on faces and edges of triangle meshes. In our framework, each cell of grid contains a fixed color sub-resolution grid which is updated by each RGB-D sample point falling into the cell. Figure 6 details the update of color in the grid from an RGB-D sample.

For each RGB-D sample falling within the bounds of a given proxy cell, we compute a neighborhood n of color positions which will be updated with this sample’s color value. In order to update as much color positions as covered by a depth point, we compute an influence radius of the point based on its distance to the sensor. The topology of depth sensors implies that the further the depth point, the larger the size of the image pixel when unprojected to 3D. Hence, the influence radius is computed as the size of this depth pixel in 3D and discretized to get the size of the color point neighborhood at the surface of the shape.

For a 3D point sample of depth , with a width of grid cells defined as and a color point resolution defined by its power of two , the point influence radius and discrete color neighborhood radius n are given by

Then, for each color point within the computed neighborhood n, including points belonging to neighboring

Fig. 6. Updating color points of a proxy cell from the color component of an RGB-D point sample. An RGB-D point sample projected on the 2D proxy grid (red square) falls within a discretized color point (light green). w is the cell width in meters. Here, color points are defined with r = 2. Radii of influence area (dotted light red) and color point neighborhood n (dotted light blue) are computed from the point depth (see Equation 5). All color points within the computed neighborhood n are updated with a weight depending on the distance to the point and the size of the influence area (see Equation 6).

cells, the average color is updated with the point’s color. In order to take into account the uncertainty of information brought by depth samples far from the camera, we weight the color contribution with both the distance from the color point to the projected depth point and the color neighborhood size, which itself is computed from the point depth, as shown in Equation 5. Here, weighting the color inversely to the point depth allows further points to give a less important contribution to the color model. For color point neighbor , the color contribution weight of a given depth point is defined in Equation 6. In practice, we empirically define the standard deviation factor , which gives a good balance between consistency and sharpness of the textures.

4 RGB-D STREAM PROCESSING

Our first use of the consistent geometric proxy structure is as a support for filtering operators with the goal of:

• removing noise to smooth data while keeping small details and local saliency;

• filling holes due to specular elements or unseen parts;

• resampling the incoming point cloud at any desired resolution at the surface of shape proxies.

4.1 Filtering

While projecting the sensor’s data points onto their associated proxy would allow removing the acquisition noise and quantization errors due to the sensor, this would lead to the flattening of all shape inliers. In order to minimize the loss of details on the shape surfaces while keeping a lightweight data structure, we instead use the geometric proxies as a simple collaborative filter model.

To that end, we designed a custom filter to leverage the smoothed local histograms stored in each cell of the proxies. As explained in section 3.2.2, the number of detected modes allows distinguishing flat areas of the proxy surface from salient ones. For flat cells whose distribution has a single mode, we project the depth points on the shape along the direction between the camera and the point. We offset the points of the average distance to the shape only if it is above the noise threshold at the corresponding distance to the camera (see details on the noise threshold in the supplemental material, section S.2.2). This allows smoothing surface areas that are exactly on the shape while keeping flat areas offset from the shape as they are in the scene. For cells whose distribution has two or more modes, we do not perform any projection in order to keep the saliency and details of the surface.

Equation 7 details the smoothed local histograms-based filtering of inlier p to , belonging to cell c with modes and an average distance to the proxy of , and a noise threshold of .

The function proj(p) represents the projection of p on the proxy along the camera direction, which can also be seen as the intersection between the proxy shape and the camera ray passing through the pixel that generated point p. The function norm(p) represents the normal vector on the shape surface at the projected point location. An illustration of the filtering process on a planar example is shown as supplemental material (Figure S.3).

In addition, the proxy can also be used as a high level range space for cross bilateral filtering [6], where inliers of different proxies will not be processed together.

Temporal Flickering Removal. Based on time-evolving data points, the proxies consolidate the stable geometry of the scene by accumulating observations from multiple frames. Averaging those observations over time removes temporal flickering, after a few frames only.

4.2 Hole Filling

Missing data in depth is often due to specular and transparent surfaces such as glass or screens. With our framework, observed data is reinforced over multiple frames from the support of stable proxies, augmenting the current frame with samples from previous ones. In practice, the depth data that is often seen in incoming frames creates activated cells with sufficient occupancy probability to survive within the model even when samples for these cells are missing.

This hole filling, stemming naturally from the proxy structure, is completed by two additional steps. First, the extent of the proxies is extrapolated to the intersection of adjacent proxies – this is particularly useful to complete unseen areas under furniture for example. Second, we perform a morphological closing [33] on the grid of cells, with a square structural element having a fixed side of seven cells. This corresponds to closing holes of maximum 35cm by 35cm, which allows filling missing data due to small specular surfaces, e.g. computer screens or glass-door cabinets, while keeping larger openings such as windows or doors.

4.3 Resampling

Our proxies can be used to super-sample RGB-D streams on the fly. The low definition geometric component of raw frames can be enriched by the higher resolution information structured in the proxies. Our 3D structure can guide the process using both its color and geometric components, whose smoothness and stability appear naturally with the accumulation of samples. In addition, the local nature of statistics stored in the proxy cells allows generating high resolution data while keeping geometrical details and salient areas and avoid over-smoothing. The sub-resolution color component of the proxies can be used to assist the sampling process to recover even more detail. This results in high definition RGB-D data with controllable point density on the surface of the shapes.

5 EXPERIMENTS

5.1 Validation on Synthetic Data

We define a synthetic data processing framework, in order to validate RGB-D data modeling with our geometric proxies while testing all kinds of shapes, even when not commonly seen in human made environments. We use the Blender tool1 to generate a synthetic RGB-D image set from known 3D objects of planar, cylindrical and spherical shapes, along with ground truth camera poses. Figure 7 shows the proxies generated from this artificial dataset, along with a ground truth rendering at the same position. Results from different generated scenes are shown as supplemental material (section S.4).

In particular, processing data generated from synthetic models allows comparing texture information between proxy-generated and ground truth images from 3D models. Qualitative comparison of texture images used to generate the synthetic data and computed with the color point model of our proxies, is available as supplemental material (Figure S.5). This figure shows that our proxies allow re-parameterization of the texture information for the revolution surfaces, from bin packing for the cylinder or spherical for the sphere, into more meaningful and less distorted cylindrical or octahedral solutions.

5.2 Building Proxies

Our proxies are implemented through hardware and software components. The hardware setup is made of a computer with Intel Core i7 at 3.5GHz and 10GB memory. No GPU is used. The software setup has a client-server architecture, where the server runs within an embedded environment with low computational power and limited memory to trigger the sensor and process the data. The client’s graphical user interface allows controlling the processing parameters and getting a real time feedback of the stream. A limited range of intuitive parameters allows the user to control the trade-off between quality of the output and performance of the processing. In order to achieve a better quality and efficiency when building proxies, a few minor optimizations have been implemented, detailed in the supplemental material (section S.2).

5.2.1 Dataset

We run all of our experiments on the 3DLite [28] dataset2, containing 10 scenes acquired with a Structure sensor3 under the form of RGB-D image sequences. This choice was motivated by the availability of ground truth poses along with the visual data, as well as result meshes and performance metrics provided from processing with both BundleFusion [22] and 3DLite [28], with which we compare our method in section 6.1.

5.2.2 Quantitative Results

For the 10 scenes of the dataset, our method generates between 40 and 90 proxies per scene. Up to 130K proxy cells per scene allow modeling the input data. Geometric statistics on the generated proxies are available in Table 1 for all processed scenes. The accuracy of the proxy representation can be quantitatively assessed through the PSNR values in Table 2. In addition, the fast convergence of the proxy statistics is shown in Figure S.7 of the supplemental material, where the variation over time of the average distance to shape falls below 0.5mm after about 30 accumulated samples only.

TABLE 1 Statistics on the geometric proxies

Total number of proxies and cells for each scene of the 3DLite [28] dataset. The average number of observed proxies and cells per frame (/ fr.) is given.

The time required to build and update geometric proxies using our current implementation is around 150 ms for an input depth image of 320x240 pixels. A detailed graph presenting the repartition of the processing time for all steps is available as supplemental material (Figure S.8).

5.3 Stream Processing

5.3.1 Live Enhancement

Figure 8 shows examples of data improvement using the processing modules of our framework. Experiments show

Fig. 7. Simple shapes modeling from synthetic data. From a set of RGB-D images (left) rendered along the pink transparent camera path, proxies are built for planar, cylindrical and spherical synthetic elements. Both geometry and texture of dominant elements are recovered at lower resolution than the original while keeping a good visual quality (middle). For these simple shapes, a coarse proxy grid resolution allows recovering most of the geometry (right). More results on generated scenes are shown as supplemental material (section S.4).

that the proxies are particularly efficient to remove noise over walls and floors while keeping salient parts, and help reducing holes due to unseen areas, specular areas such as lights or glass, and low confidence areas such as distant points. Resampling the point cloud allows recovering structure if the sensor did not give enough data samples, e.g. on lateral surfaces. Timings for hole filling operations are given in the supplemental material (Figure S.8).

5.3.2 Compression

Compression of the input data is achieved by using directly the proxies as a compressed, lightweight geometric substitute to the huge amount of depth data carried in the stream, avoiding storing uncertain and highly noisy depth regions, while still being able to upsample back to high resolution depth using a bilateral upsampling. In particular, this is convenient to broadcast captures of indoor scenes where planar regions are frequent.

Substituting the geometric proxies to the RGB-D stream provides a simple yet effective lossy compression scheme for transmission, with the practical side effect of removing many outliers. Our efficient data structure leads to good compression ratios while keeping high peak signal-to-noise ratio (PSNR) and being fast for compression and decompression. Table 2 and Table 3 show evaluation metrics of the compression using proxies.

The proxies are designed with a unified parameterized 2D structure and are stored as simple grids of statistics with a local frame and bounding rectangle. As such, the compressed structure itself, i.e. the proxies, can benefit from image-based compression schemes such as JPEG [34] for offline export and storage, for which we report compression ratios and PSNR values as supplemental material (Table S.1). The JPEG export and load procedure for all proxies of a scene takes an average of about 40 ms.

In addition to the bandwidth saving, the compressed proxy representation enables smooth super-sampling of the geometric data, where the output point cloud density over proxy surfaces can be increased as desired. The geometric surface parameterization of each proxy offers a suitable domain for point upsampling operators, while a similar approach performed directly on the RGB-D stream is blind to the scene structure.

TABLE 2 Proxy compression metrics for all processed scenes

Compression ratios (frame-wise and scene global) are based on the raw size of a 320x240 depth map and the size of the proxies without the outliers. The peak signal-to-noise ratio (PSNR) is computed using the average root mean square error (RMSE) between raw depth points and proxy-filtered ones. These metrics do not account for the loss of outliers of detected shapes (25% of samples in average). We compare our compression performance to a state-of-the-art method based on H.264 [35] with a quality profile of 50.

TABLE 3 Proxy compression timings for all processed scenes

Timings are reported in milliseconds to process a single RGB-D frame. The compression time is the building of proxies averaged over all frames, while the decompression time is the generation of a depth map by applying visibility tests to the proxies. The frame-wise compression and decompression times for the method based on H.264 [35] were computed from the full frame set timing. While our Proxies add little overhead for construction, they are more than twice as fast as H.264 for decompression, allowing live display on low end platforms.

Fig. 8. Data improvement using geometric proxies. Raw RGB-D and proxy-improved data showing results of real time filtering, hole filling and resampling. Hole filled and resampled point clouds are generated by sampling the proxy surface with respectively 2x2 and 4x4 points per cell. Blue surrounded areas highlight a region where improvement using the proxies is significant compared to the low quality input RGB-D frames.

6 RGB-D DATA CONSOLIDATION

6.1 Proxy-based Reconstruction

While being lightweight and fast to compute, the proxies represent a superstructure modeling the dominant regular geometric elements of indoor scenes. In addition to being used to filter the input point cloud and generate an enhanced RGB-D stream as output, proxies themselves are a way to consolidate the RGB-D frames. Hence, meshing the proxy cells leads to a lightened organized structure and aggregating all proxies in a global space allows reconstructing a high quality surface model of the observed scene, generated on the fly.

Meshing. In practice, our meshing process iterates over all active cells of a given proxy to connect adjacent cells. However, revolution surfaces such as cylinders and sphere have a periodic nature and cells at opposite limits of the parameterized domain are actually adjacent in the Euclidean domain. Hence, to avoid discontinuity of the proxy surface mesh, we designed a closing methodology through a range check where cells at extremities of the 2D domain are connected to cells at the other extremities. In particular, we can see these connections in the lower part of the octahedral sphere in Figure 5 (c). The average time required to mesh the full proxy set is below 10ms, as shown in Figure S.8 of the supplemental material.

Texture Completion. Extrapolating grid cells in the space of the proxies, as detailed in section 4.2, allows recovering unseen geometry but not appearance. Hence, we apply a recent deep learning based image inpainting method [36] to generate meaningful pixel values at locations where the RGB-D sensor did not provide color information. In the 2D space of the proxies, exported textures have the structure of regular images, allowing the direct application of such off the shelf inpainting tools, as shown in Figure 9.

In the following, we compare the performance and quality of scene reconstruction using our proxies to state-of-the-art methods BundleFusion [22] and 3DLite [28]. As in the previous section, we run all of our experiments on the 3DLite dataset composed of 10 RGB-D scenes captured with a Structure sensor (section 5.2.1).

6.2 Qualitative Results

Figure 10 presents the reconstructed geometric models based on the corresponding proxies. As we can see, most large planar surfaces such as walls and floors are modeled with a single proxy instance. At scene scale, the relatively low color resolution of our proxy is sufficient to identify most elements of the scene. The figure also compares reconstructions with Proxies, BundleFusion and 3DLite from a global point of view. Qualitatively, proxy meshes are similar to BundleFusion while offering the structural regularity of 3DLite models. A close-up compared view is available as supplemental material (Figure S.9).

6.3 Quantitative Results

Table 4 presents performance and quality metrics for the 10 scenes of the dataset. The accuracy of the reconstruction using proxies can be quantitatively assessed and compared to 3DLite through the values of RMSE with BundleFusion as a reference. These metrics show that the lightweight and simple structure of the proxies leads to better performance both in timing and memory consumption, while keeping a quality comparable to that of state-of-the art methods.

With their low runtime and memory needs, our proxies offer a lighter alternative to most recent reconstruction methods characterized by volumetric or deep learning approaches, which have high requirements in computation costs and memory consumption. The generic format and implementation of the proxies avoid the need for tedious platform-specific tuning and make them well suited for embedded operation and modern mobile applications. In addition to the fact that our proxies are built and updated on the fly, the processing does not run on a GPU and requires far less memory than modern embedded devices offer.

Fig. 9. Example of an inpainted proxy texture image from the apt scene. Left: Full inpainted proxy texture image. Right: Same image with overlays showing inpainted and displayed pixels. The red outline shows the limit between acquired color information (inside) and inpainted content (outside). The blue outline shows the existing proxy cells that are actually displayed in 3D. The green overlay is the intersection of red and blue areas and corresponds to inpainted content that is actually displayed in 3D. These green areas are geometric parts that were not observed by the camera and were activated during the hole filling step. We can see that inpainted content at the borders of the image is not very meaningful, but gets better when surrounded by acquired data. This figure shows that the failure of inpainting to predict good colors at border locations is not an issue for the proxies, as most used inpainted content corresponds to filled holes surrounded by acquired colors that seem to help generating better colors.

Proxies (textured) 3DLiteBundleFusion Proxies (random colors)

Fig. 10. Left: examples of reconstructed scenes of the 3DLite dataset using our geometric proxies. Each proxy is shown with a different random color and textures generated from the color points stored in the proxies (see section 3.2.2). The meshes are made of quads when four activated cells are adjacent, and triangles otherwise. Right: qualitative comparison of reconstruction using our Proxies, BundleFusion [22] and 3DLite [28]. BundleFusion meshes, while being the most accurate and containing lots of details, are a heavy model and keep some noise and outliers from the incoming data. In addition, they lack knowledge of the scene structure and global geometry which prevents filling missing data in unobserved areas. 3DLite meshes have geometric details smoothed out because of the strong planar regularization. However, the textures are sharper and appearance details have a better quality. With a good balance between the two, proxy meshes are aware of the scene structure and composed of meaningful geometric elements. The local nature of the grid and its accumulated statistics retain saliency details at the surface of the shapes.

7 CONCLUSION

We introduced a new unified geometric framework for real-time processing of RGB-D streams, based on a superstructure of proxies built from spatially and temporally consistent shape instances.

By tracking geometric shapes over time and space, we define a shape-wise accumulation structure to record information brought by RGB-D samples. The use of surface shapes instead of a volumetric structure lightens the modeling while keeping a faithful representation of most elements seen in human-made environments. The statistics stored in the proxies maintain knowledge of the local geometry and appearance of the observed scene to keep as much detail as possible with the lowest hardware cost.

We leverage the compact, lightweight and consistent spatio-temporal support of these geometric proxies within a set of processing primitives designed to enhance RGB-D data or lighten subsequent operations. Our implementation runs at interactive rates on mobile platforms and allows fast enhancement and transmission of the captured data. Our structure can be meshed and used as a model of the observed scene, generated on the fly. Compared to

TABLE 4 Quantitative comparison of proxy-based scene reconstruction

The metrics are compared between BundleFusion (BF), 3DLite (3DL) and our Proxies. The processing time for one frame is averaged over all frames in the scene, and computed from the reported full scene timing for 3DLite. The memory consumption is the maximum used memory during processing. The root mean square error (RMSE) is computed using the Metro tool [37] between the BundleFusion mesh, taken as reference, and the 3DLite and Proxy meshes. We can see that the model generated by the proxies is similar in size and quality to the one generated by 3DLite, i.e. orders of magnitude lighter than the BundleFusion mesh, while requiring much less computing resources.

BundleFusion and 3DLite, reconstruction using our proxies provides a good balance between processing time, memory consumption and approximation quality.

Performance Improvements. Our current proxy model stores statistics on a uniform (yet sparse) grid, which could be improved using a sparse adaptive structure such as random access trees [38]. The multi-scale nature of indoor scenes is still to be considered within our geometric proxies. We could generate knowledge of global and local scene geometry at multiple levels, whose analysis could reveal scene information of e.g., similarities or clusters. While we use OpenMP to improve performance of shape processing, we could develop a finer parallel implementation to achieve a higher processing rate on embedded platforms.

Towards More Faithful Proxies. Our proxies are continuously made more accurate by averaging observations in RGB-D frames, but they are highly sensitive to the quality of the camera motion estimate. Introducing the shape proxies into the camera motion estimation using e.g., their parameters or geometric statistics, could give more robustness and stability to the shape tracking.

Geometric proxies model most structural components of indoor scenes, e.g. planar, however our current implementation simply ignores non regular shaped elements, such as most small objects. In order to complete this missing information, we could define a sparse voxel grid to locally model the few remaining depth samples.

Our proxy grid recovers unseen geometry at the surface of the shapes but sometimes fills data at locations of open doors or windows, which could be improved by modeling observed empty space or studying object semantics.

Surface Reconstruction. We could imagine using our proxy shapes and their accumulated data as regularization priors to be integrated into existing surface reconstruction frameworks. We could obtain a smoother model by applying Poisson surface reconstruction [39] on a regularized point cloud (preliminary results shown as supplemental material, Figure S.10). One could also use directly the geometric proxy information as a smoothing operator to regularize the mesh within the reconstruction process.

ACKNOWLEDGMENTS

REFERENCES

[1] A. Kaiser, J. A. Ybanez Zepeda, and T. Boubekeur, “Proxy Clouds for Live RGB-D Stream Processing and Consolidation,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018, pp. 252–268.

[2] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Sixth International Conference on Computer Vision. IEEE, 1998, pp. 839–846.

[3] S. Liu, C. Chen, and N. Kehtarnava, “A Computationally Efficient Denoising and Hole-filling Method for Depth Image Enhancement,” in SPIE Conference on Real-Time Image and Video Processing, N. Kehtarnavaz and M. F. Carlsohn, Eds. SPIE, 2016.

[4] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “KinectFusion: Real-time Dense Surface Mapping and Tracking,” in IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, October 2011, pp. 127–136.

[5] C. Wu, M. Zollh¨ofer, M. Nießner, M. Stamminger, S. Izadi, and C. Theobalt, “Real-time shading-based refinement for consumer depth cameras,” ACM Transactions on Graphics (TOG), vol. 33, no. 6, p. 200, 2014.

[6] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,” ACM Transactions on Graphics (ToG), vol. 26, no. 3, p. 96, 2007.

[7] D. Min, J. Lu, and M. N. Do, “Depth video enhancement based on weighted mode filtering,” IEEE Transactions on Image Processing, vol. 21, no. 3, pp. 1176–1190, 2012.

[8] R. Liu, Z. Deng, L. Yi, Z. Huang, D. Cao, M. Xu, and R. Jia, “Hole-filling Based on Disparity Map and Inpainting for Depth-Image-Based Rendering,” International Journal of Hybrid Information Technology, vol. 9, no. 5, pp. 145–164, 2016.

[9] R. Schnabel, P. Degener, and R. Klein, “Completion and reconstruction with primitive shapes,” Computer Graphics Forum, vol. 28, no. 2, pp. 503–512, March 2009.

[10] J. Biswas and M. Veloso, “Planar polygon extraction and merging from depth images,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2012, pp. 3859–3864.

[11] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, June 1981.

[12] R. Hulik, M. Spanel, P. Smrz, and Z. Materna, “Continuous Plane Detection in Point-cloud Data based on 3D Hough Transform,” Journal of visual communication and image representation, vol. 25, no. 1, pp. 86–97, January 2014.

[13] R. Schnabel, R. Wahl, and R. Klein, “Efficient RANSAC for Point- Cloud Shape Detection,” Computer Graphics Forum, vol. 26, no. 2, pp. 214–226, June 2007.

[14] D. Holz, S. Holzer, R. B. Rusu, and S. Behnke, “Real-time plane segmentation using RGB-D cameras,” in Robot Soccer World Cup XV, T. Roefer, N. M. Mayer, J. Savage, and U. Saranli, Eds. Springer, Berlin, Heidelberg, 2011, pp. 306–317.

[15] C. Feng, Y. Taguchi, and V. R. Kamat, “Fast plane extraction in organized point clouds using agglomerative hierarchical clustering,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 6218–6225.

[16] A. Kaiser, J. A. Ybanez Zepeda, and T. Boubekeur, “A Survey of Simple Geometric Primitives Detection Methods for Captured 3D Data,” Computer Graphics Forum, vol. 38, no. 1, pp. 167–196, February 2019.

[17] M. Hsiao, E. Westman, G. Zhang, and M. Kaess, “Keyframe-based dense planar SLAM,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2017, pp. 5110–5117.

[18] R. Mur-Artal and J. D. Tard´os, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.

[19] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb, “Real-time 3D Reconstruction in Dynamic Scenes using Point-based Fusion,” in International Conference on 3D Vision (3DV). IEEE, June 2013, pp. 1–8.

[20] F. Endres, J. Hess, J. Sturm, D. Cremers, and W. Burgard, “3D Mapping with an RGB-D Camera,” IEEE Transactions on Robotics, vol. 30, no. 1, pp. 177–187, 2014.

[21] M. Nießner, M. Zollh¨ofer, S. Izadi, and M. Stamminger, “Real- Time 3D Reconstruction at Scale Using Voxel Hashing,” ACM Transactions on Graphics (ToG), vol. 32, no. 6, p. 169, 2013.

[22] A. Dai, M. Nießner, M. Zollh¨ofer, S. Izadi, and C. Theobalt, “Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration,” ACM Transactions on Graphics (TOG), vol. 36, no. 3, p. 24, 2017.

[23] R. F. Salas-Moreno, B. Glocken, P. H. Kelly, and A. J. Davison, “Dense planar SLAM,” in IEEE International Symposium on Mixed and Augmented Reality. IEEE, September 2014, pp. 157–164.

[24] M. Kaess, “Simultaneous Localization and Mapping with Infinite Planes,” IEEE International Conference on Robotics and Automation (ICRA), pp. 4605–4611, May 2015.

[25] E. Zhang, M. F. Cohen, and B. Curless, “Emptying, Refurnishing, and Relighting Indoor Spaces,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, p. 174, 2016.

[26] Y. Zhang, W. Xu, Y. Tong, and K. Zhou, “Online Structure Analysis for Real-time Indoor Scene Reconstruction,” ACM Transactions on Graphics (TOG), vol. 34, no. 5, p. 159, November 2015.

[27] M. Dzitsiuk, J. Sturm, R. Maier, L. Ma, and D. Cremers, “De- noising, Stabilizing and Completing 3D Reconstructions On-the-go using Plane Priors,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2017, pp. 3976–3983.

[28] J. Huang, A. Dai, L. Guibas, and M. Niessner, “3DLite: Towards Commodity 3D Scanning for Content Creation,” ACM Transactions on Graphics (TOG), vol. 36, no. 6, p. 203, 2017.

[29] J. M. Coughlan and A. L. Yuille, “Manhattan world: Compass direction from a single image by bayesian inference,” in Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2. IEEE, 1999, pp. 941–947.

[30] E. Praun and H. Hoppe, “Spherical Parametrization and Remeshing,” in ACM Transactions on Graphics (TOG), vol. 22. ACM, 2003, pp. 340–349.

[31] M. Kass and J. Solomon, “Smoothed Local Histogram Filters,” ACM Transactions on Graphics (TOG), vol. 29, no. 4, p. 100, 2010.

[32] C. Yuksel, J. Keyser, and D. H. House, “Mesh Colors,” ACM Transactions on Graphics, vol. 29, no. 2, pp. 15:1–15:11, 2010.

[33] J. Serra, Image analysis and mathematical morphology. Academic Press, Inc., 1983.

[34] G. K. Wallace, “The JPEG still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. 18–34, 1992.

[35] F. Nenci, L. Spinello, and C. Stachniss, “Effective compression of range data streams for remote robot operations using H. 264,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, September 2014, pp. 3794–3799.

[36] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5505–5514.

[37] P. Cignoni, C. Rocchini, and R. Scopigno, “Metro: Measuring Error on Simplified Surfaces,” Computer Graphics Forum, vol. 17, no. 2, pp. 167–174, August 1998.

[38] S. Lefebvre and H. Hoppe, “Compressed random-access trees for spatially coherent data,” in Proceedings of the 18th Eurographics conference on Rendering Techniques, J. Kautz and S. Pattanaik, Eds. Eurographics Association, 2007, pp. 339–349.

[39] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction,” Proceedings of the fourth Eurographics symposium on Geometry processing, vol. 7, June 2006.

[40] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017.

[41] C. V. Nguyen, S. Izadi, and D. Lovell, “Modeling kinect sensor noise for improved 3d reconstruction and tracking,” in Second International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT). IEEE, October 2012, pp. 524–530.

[42] CyArk, “Brandenburg Gate - LiDAR - Terrestrial, Photogrammetry,” https://doi.org/10.26301/d51v-fq77, April 2018, collected by CyArk, Institute for Photogrammetry. Distributed by Open Heritage 3D.

S.1 PROXY STATISTICS

S.1.1 Illustration of Smoothed Local Histograms

Figure S.1 shows a close-up example of flat and salient proxy areas with representation of the associated histograms and distribution modes (section 3.2.2 of the paper).

Fig. S.1. Close-up example of local geometry stored in the smoothed local histogram (SLH) of each proxy cell. Accumulating depth samples into the histogram naturally reveals distribution modes indicating the local nature of the geometry. Here, cells over the wall and frame have a single mode, as all samples have a similar distance to the surface (light green and orange). Cells at the limit between wall and frame accumulate samples that can have the distance to surface of either the wall or frame, hence containing two histogram modes (light purple).

S.1.2 Real Data Example

Figure S.2 shows an real data example of statistics of occupancy and distance stored at the surface of our proxy shapes (section 3.2.2 of the paper).

S.2 IMPLEMENTATION DETAILS

S.2.1 Parallelization

Parallelization through OpenMP is used to process shapes faster. When tracking shapes, as more and more previous shapes have to be checked in new frames, their score in the new frame is computed in parallel. The update of proxies with the new shape observation is also conducted in parallel for all proxies. However, computations such as proxy clipping or cell update within a proxy can not be parallelized because of concurrent access and interdependencies. Optimization along these axes would require heavy refactoring of the current implementation.

S.2.2 Sensor-dependent Tuning

The axial and lateral noise introduced by the consumer depth cameras lead to artificial and erroneous data samples in the input. Based on the sensor noise model from Nguyen et al. [41], we estimate a noise threshold at a given distance to the camera, under which differences in geometry will not be considered as actual differences in the observed surfaces. Hence, in order to prevent wrong geometry from being modeled within the proxies, we modulate the distance thresholds throughout the processing by the noise estimated at the corresponding distance to the camera origin.

S.2.3 Scene Orientation

We aim at building a model which is consistent and intuitive with relation to the structure of the observed scene. We therefore orient the local frames of the proxies along the orientation of the scene, assuming that its structure follows the Manhattan world assumption [29]. This allows the cells on the walls to be oriented along the directions of the floor and orthogonal walls, and the cells on the floor and ceiling to be oriented along the directions of both walls. In order to detect the structural orientation of walls and floor in the scene, we look for nearly horizontal and vertical planes at the beginning of the processing, and use their orientations as a prior.

S.3 PROXY FILTERING

Figure S.3 illustrates the proxy filtering process (section 4.1 of the paper) on a planar example where, for a plane of normal and distance to origin l, projand

norm.

S.4 SYNTHETIC DATA

Figure S.4 shows proxies generated from a second artificial dataset (details in section 5.1 of the paper), along with a reference ground truth rendering at the same position.

Figure S.5 shows qualitative comparison of texture images used to generate the synthetic data and computed with the color point model of our proxies (section 5.1 of the paper).

Fig. S.2. Real data example of proxy statistics. Top left: Color image provided for information in order to understand the structure of the scene. Bottom left: Four planar proxies are represented in 3D with a single color per shape, on top of the current 3D point cloud. Right: Proxy cells are represented with a color code showing the current values of their statistics. Intensity represents occupancy probability, and hue – from blue (-5cm) to red (5cm) through white (0cm) – represents the distance to the shape. We can notice that cells behind table legs have a lower occupancy with darker shade, as they were not observed very often (green arrow). The geometric details due to the frame on the wall above the table are correctly recovered with a positive shape offset (yellow arrow).

Fig. S.3. Filtering 3D points using statistics from the proxy. Example of our custom filtering on a planar proxy inlier point p (yellow circle). Here, 3D point p has normal norm(p) at the proxy surface (red arrow) with orthogonal distance (red line), used to update the average p is projected along the camera ray (yellow line) into proj(p) (blue circle), which allows retrieving its cell c (light blue). In this specific example, we assume that c is a cell whose smoothed local histogram has one mode , hence we apply the shape offset (blue line) along the surface normal direction (green line) to get the filtered 3D point (green circle).

Figure S.6 shows proxies generated from a scan of the Brandenburger Tor (Berlin, Germany)4 rendered in Blender, along reference ground truth renderings at the same positions.

Figure S.7 shows the fast convergence of the proxy statistics after about 30 accumulated samples, through the decreasing variation over time of the average distance to shape.

S.5.2 Timing Repartition

Figure S.8 shows the detailed repartition of the processing time for all steps of the proxy construction.

S.5.3 JPEG Export Metrics

Table S.1 reports compression metrics for offline export and storage of the proxies using JPEG [34], detailed in section 5.3.2 of the paper.

Figure S.9 compares reconstructions with Proxies, BundleFusion and 3DLite from a local, close-up point of view. See section 6.2 of the paper for more details.

S.7 RECONSTRUCTION REGULARIZATION

Figure S.10 shows preliminary results of surface reconstruction applied before and after proxy filtering, as detailed in section 7 of the paper.

Fig. S.4. Simple shapes modeling a second synthetic scene, shown at two camera locations. The pink transparent curve represents the camera path. Proxies are built for planar, cylindrical and spherical synthetic elements from a set of rendered RGB-D images (left). For these simple shapes, the coarse proxy grid recovers most of the geometry and texture at lower resolution than the original (right). The yellow circle close-ups show how proxies record local geometric information. Although at lower resolution, we can notice slight irregularities on some proxy cells at the surface of the purple cylinder, corresponding to fine geometric details present in the original 3D model.

TABLE S.1 Proxy JPEG export metrics

The compression ratio is between the sizes of the full raw proxy set and the exported JPEG image files. The peak signal-to-noise ratio (PSNR) is computed using the average root mean square error (RMSE) between raw depth points and their positions after loading proxies exported to JPEG. The reported timing is the total time required to export and load all scene proxies.

Fig. S.5. Simple shape proxies modeling textures. Left: Texture images from 3D models used to generate the RGB-D sequence in Blender. Right: Our local color point model embedded within the 2D grid of the proxies allows recovering textures of 3D objects. Black areas on the proxies textures were not observed by the camera e.g., the poles of the sphere at the center and corners of the octahedron square (bottom right). As expected, the parameterization of our proxies leads to more consistent and less distorted textures as we can see between e.g., the octahedral and spherical sphere parameterizations (bottom). In addition, our proxies converted the texture of the cylinder from an automatic bin packing to a more meaningful cylindrical parameterized representation (middle).

Fig. S.6. Proxies generated from a scan of the Brandenburger Tor (Berlin, Germany). Blender was used to render RGB-D views of the scan (left) along the pink transparent camera path, three example views are shown. Most structural elements of the landmark are recovered by planar and cylindrical proxies (right). Data collected and processed by the CyArk non profit organization [42].

Fig. S.7. Convergence of the average distance to shape in proxy cells over time. Absolute value of the variation of the average distance to shape with relation to the number of depth samples accumulated within the proxy cell. At each new sample, we compute the absolute difference between the previous and current average values of distance to shape and average it over all cells in the scene. We observe the fast convergence of the depth average after about 30 accumulated samples, where the change in distance to shape value falls below 0.5mm. This shows that only a few seconds are needed for the statistics in the proxy to become stable.

Fig. S.8. Timing repartition for geometric proxy generation, averaged over all frames of each scene. Shape tracking is the longest processing with about 40% of the total time, as it iterates over all previous proxies and depth image samples. Proxy cell update is the second longest with about 30% due to the iterations over cells and color points for each cell. We can see that the tracking takes more time for the largest scene offices (8518 frames), as its size implies more previous shapes to iterate over.

Fig. S.9. Close-up views of reconstructions using our Proxies, BundleFusion [22] and 3DLite [28]. Top: We can notice the high detail but also high irregularity of the BundleFusion mesh (left). The 3DLite model (center) has sharp texture for high appearance quality, although no geometric details. The proxy mesh (right) has a lower color resolution but sufficient to identify elements in the scene. Its local information allows keeping geometric details smoothed out by 3DLite, such as the laptop on the table. Bottom: The BundleFusion mesh (left) is shown with only 30% of the original geometry for better visualization, but we can see that the high details imply a large amount of polygons. 3DLite (center) is aware of the geometry and models large planar areas with large triangles, refining it at the limits of elements. The regular grid of the proxies (right) is lightweight while storing accurate geometry at all locations of the shapes e.g., for the frame on the wall above the table.

Fig. S.10. Surface reconstruction on raw depth and proxy-filtered data. This preliminary result shows how geometric proxies can be used as a smoothing operator in the context of mesh reconstruction. Here, we apply Poisson surface reconstruction [39] to the raw depth point cloud (left) and point cloud generated by sampling our proxies (right). We can see that the planar elements (door on the left, ceiling at the top, and walls) are smoothly reconstructed and do not exhibit the artifacts seen on the raw depth due to sensor limitations.