Partial occlusions pose a major challenge to the successful recognition of visual objects because they reduce the evidence available to the brain. . . . As a result, recognition must rely not only on information about the physical object but also on information about the occlusion, scene context and perceptual experience [14].
For many scene understanding tasks such as creating a room mockup for VR or automatically estimating how many people a room can accommodate, it is sufficient to estimate positions, orientations, and rough proportions of the objects rather than exact point-wise surface geometry. Given a single 2D photograph, the goal of this paper is to select and place instances of 3D models, particularly the partially occluded ones, to recover the photographed scene arrangement under the estimated camera.
With the easy access to large volumes of image and 3D model repositories, and the availability of powerful supervised learning methods, researchers have investigated multiple subproblems relevant to the above goal, such as object recognition [15], localization [32], pose prediction [37],
Figure 1: We present SEETHROUGH, a method to detect objects (specifically chairs) from single images under medium to heavy occlusion by reasoning with 3D scene-level context information. Our method significantly improves detection rate over state-of-the-art alternatives.
or developed a complete system IM2CAD [22] that selects and positions 3D CAD models that are similar to the input imaged scenes. While these approaches work reliably in rooms with relatively low occlusion, under moderate to heavy occlusion the methods quickly deteriorate. A common source of failure is that under significant occlusion, state-of-the-art semantic segmentation or region detection methods begin to break down, and hence any system relying on them also fail (see Figure 1).
Unlike images with limited occlusion where direct image-space information is sufficient, occluded scenes require a different treatment. One possibility is to train an end-to-end network to go from single images to parameterized scene mockups. However, a major bottleneck is obtaining suitable training data. On the one hand, in our experiments the networks trained with synthetic 3D scene data do not easily translate to real-world data. On the other hand, obtaining real-world training data is difficult to scale as it requires complex annotations in 3D from single images. Instead, we propose a novel approach that heavily relies on 3D contextual statistics that can be automatically extracted from synthetic scene arrangement data.
Our key insight is that typical indoor scenes exhibit sig-nificant regularity in terms of co-occurrence of objects, which can be exploited as explicit priors to make predictions about object identity, placement and orientation, even under significant inter- or intra-object occlusions. For example, a human observer can easily spot heavily occluded chairs due to the presence of other visible nearby chairs and a table (see Figure 1), as we have a good mental model of typical chair-table arrangements.
We introduce SEETHROUGH that generates 2D keypoints from input images using a neural network, lifts the keypoints to candidate 3D object proposals, and then solves a selection problem to pick objects scored according to object cooccurrence statistics extracted from a scene database. We iterate the process by allowing already selected objects to reinforce selection of weakly witnessed occluded ones.
We tested our approach quantitatively on a new scene mockup dataset including partially occluded objects and show significant improvement of recognition over baseline methods on multiple quantitative measures. Although our current implementation is focused on the chair class, the method itself is not inherently limited to this, and could be extended to other classes with appropriately annotated data. (Full code, training data, and scene statistics will be available for research use. Supplementary material is available at http://geometry.cs.ucl.ac.
Scene mockups. 3D scene inference from 2D indoor images has recently received significant research focus due to the ubiquity of the new generation capture methods that enable partial 3D and/or depth capture. A significant amount of progress has been made following the early work of Hoeim et al. [18], first with approximating only room shape [9, 28, 26, 16], then inferring cuboid-like structures as surrogate furniture [10, 7, 39, 38, 33]. However, for detailed geometry prediction, the image input is generally supplemented with additional per pixel depth or point clouds [24]. Mattausch et al. [29] used 3D point cloud input to identify repeated objects by clustering similar patches. Li et al. [25] utilize an RGB-D sensor to scan an environment in real time, and use the depth input to detect 3D objects queried from a database. While these works take 3D data as input, our method relies only on a single RGB image.
Recently, Izadinia et al. [21] in their impressive IM2CAD system demonstrated scene reconstruction with CAD models from a single image using image based object detection (using FRCNN) and pose estimation approaches. Although their objective is similar to ours, the performance is bounded by the individual vision algorithms utilized in their pipeline. For example, if the segmentation misses an
object because of signifi-cant occlusion (inset shows top FRCNN [32] detections with scores), there is no mechanism to recover it in the reconstruction (see Section 5 for comparison). On the contrary, our novel pairwise based search incorporates high level relationships typical to indoor scenes to recover from such failures successfully.
3D2D alignment. Another way to create scene mockups is by directly fitting 3D models to the image. Pose estimation work [37, 35, 19, 26, 23, 4] also demonstrated that given object images, reliable 3D orientation can be predicted, which in turn might help with scene mockups. Lin et al. [27] used local image statistics along with image-space features to align a given furniture model to an image. Aubry et al. [4] utilized a discriminative visual element processing step for each shape in a 3D model database, which is then used to localize and align models to given 2D photographs of indoor scenes. Like most existing methods, their approach breaks down under moderate to high occlusion. Our method performs better, as other nearby objects can provide higher order information to fill in the lost information (see Section 5).
Priors for scene reconstruction. Scene arrangement priors have been successfully demonstrated in 3D reconstruction from unstructured 3D input, as well as scene synthesis [12]. Shao et al. [34] demonstrated that scenes with sig-nificant occlusion can be reconstructed from depth images by reasoning about the physical plausibility of object placements. Monszpart et al. [30] uses the insight that planar patches in indoor scenes are often oriented in a sparse set of directions to regularize the process of 3D reconstruction. On the other hand, based on priors between humans, Fisher et al. [13] leveraged human activity priors together with object relationships as a foundation for 3D scene synthesis. In contrast to the complex and high order joint relationships used in these works, our object centric templates are compact and primarily encode the repetition of similar shapes (such as two side by side chairs) across pose and location. This compact and simple template representation ensures that our search stays tractable at run-time.
In a scene with many chairs, we observe that the environment is not important for the recognition of the unoccluded chair – the shape of the object is clearly visible and immediately recognizable. However, under occlusion, the task of recognizing the object necessitates adding 3D contextual information. State-of-the-art methods based on FRCNN [32] correctly detect chairs that are visible, but miss partially occluded ones (see inset figure in Section 2). However, under occlusion, the task of recognition becomes easier with more contextual and cooccurence information (see Figure 2).
Figure 2: As humans, our understanding of scenes is heavily predicated on the context [14]. From left to right, less global information makes detection of chair harder.
Motivated by the above insight, we design SEETHROUGH to run in three key steps: (i) an image-space keypoint detection trained on AMT-annotated real photographs (Section 4.1); (ii) a candidate generation step that takes the estimated camera to lift detected 2D keypoints to 3D (deformable) model candidates (Section 4.2); and (iii) an iterative scene mockup stage where we solve a selection problem to extract a scene arrangement that proposes a plausible object layout using a common object co-occurrence prior (Section 4.3).
We now describe the three main steps of the SEETHROUGH system in detail starting with keypoint detection, followed by our approach for candidate object detection, and ending with our scene inference.
4.1. Keypoint Detection
At this stage our goal is to detect very subtle cues for potential object placements in a form of keypoints. A keypoint is a salient 3D point that appears across all objects of the same class (e.g., tip of a chair leg). We expect that a small number of (projected) keypoints will still be visible even under severe occlusions, and be useful in creating reasonable hypothesis for potential object placement. We represent this signal in two flavors: first, a keypoint map, a per-pixel function that indicates how likely a particular keypoint is to occur at that pixel (each keypoint has a separate map ), and second, keypoint locations which define the 2D coordinates for each keypoint. Both sets of information are used at different stages of our algorithm. We collected our own training data and trained a convolution neural network to detect a continuous keypoint probability function, which we further use to extract candidate keypoint locations.
We picked keypoints (
in our tests) (see supplemental material) and fine-tuned a variant of ResNet-50 neural network [15] to predict these keypoint maps in
output channels (see supplemental material for architecture details). We also tested the CPM architecture [36], but it yielded slightly inferior performance. While the latter focuses on keypoint detection it was pre-trained on human poses rather than general images, which is why we believe CPM did not generalized as well to our particuar task (see supplemental material).
The above network predicts continuous keypoint maps , and to extract the final keypoint locations (2D positions in the image) we used local maxima above a threshold
(Figure 3). We denote the set of these keypoint locations by
.
Figure 3: We trained a neural network on real images to detect keypoint maps, which are then converted to 2D keypoint locations via thresholding and non-maximal suppression.
4.2. Candidate Object Detection
The goal of this step is to propose multiple candidate objects based on the detected keypoints. While we do not know how to group points, we observe that a very small number of keypoints (as few as two) belonging to the same object, provide enough constraints to infer the scale and the orientation of a proxy 3D object. Hence, we can generate multiple candidates even with a sparse signal under moderate to high levels of occlusions. Using these generated candidates, we can recast the global inference problem as a discrete graph optimization problem, where we only need to solve for indicator variables, selecting a subset of candidates. Thus, we want higher recall at the expense of lower precision in this step. Furthermore, in order to incorporate a slightly bigger context than a single keypoint, we select subsets of points that can compose an object. At training time we learn a deformable template from a database of 3D models, and at test time we optimize the fitting of these templates to various subsets of keypoints.
Object template. Given a database of consistenly aligned 3D models M with manually labeled keypoints we use Principal Component Analysis (PCA) to project 3D coordinates of keypoints to a lower-dimensional space (we take eigenvectors that explain > 85% of the variance). Our template is parameterized by a linear combination of these eigenvalues with weights
(representing offset from the mean
). The final object template is defined by a weighted linear combination of the eigenvectors:
.We formulate an optimization problem where we solve for object parameters (i.e., p) while making sure that the object aligns with the detected keypoints. To relate our 3D deformable model to 2D images, we need a camera estimate. We use a variant of Hedau et al. [17] to estimate a rotation matrix
with respect to the ground plane, the focal length
, and define the camera’s location
to be at eye height (1.8m) above the world origin, giving camera parameters
. For each object we solve for a 2D translation across the ground plane t, azimuth
, scale s, and 3D chair template parameters p. Hence, the reprojection
of the i-th keypoint to image space is:
where is a keypoint on the deformed template,
is a rotation around the up vector, and
is a projection to the camera space.
As described next, we fit our template object in two stages: first, we propose a candidate based on a pair of points, and then, we refine these candidate parameters with respect to all keypoint maps.
(i) Initial proposals. To propose initial object candidates we sample all pairs of detected keypoints. We use a pair because it gives the smallest set to sample that provides enough constraints to extract an initial guess for object translation, scale, and orientation. For each pair, we initialize as , and optimize:
where and
are respectively the weights balancing scale and deformable template parameters (
and
in our tests).
(ii) Parameter refinement. For each of the initial proposals extracted above, we refine the fitting. Specifically, instead of considering point-locations, we define our objective with respect to soft keypoint maps , maximizing the probability of template corners to align with keypoints predicted by the neural network, i.e.,
with as defined in Equation 2. If
, we add the final parameters as a candidate placement to our candidate placement set O.
Selecting a 3D mesh. For the results presented in this paper we show 3D meshes rather than object templates. Particularly, we pick the closest 3D model from our database by projecting its keypoints into the object PCA space, finding the nearest neighbor of the deformed template, and finally deforming it using the optimized parameters p.
4.3. Scene Inference
We do not expect all individual objects selected as candidates to be in the scene, since they might overlap, or have inconsistent arrangement. First, we capture scene statistics obtained from a large scene dataset with a probabilistic model, and then use the model to formulate an alternating discrete and continuous optimization.
Learning scene model. We model higher level scene statistics via a graphical model where each object is a node and edges between pairs of nodes capture object-to-object co-occurrence relationships. We used a Gaussian Mixture Model (GMM) with (set to 5 in our tests) mixture components to model relative orientation
and translation
of pairs of chairs from a very large synthetic scene dataset [40]. We only take into account chairs that are within a distance
from each other, reasoning that far-away objects have weaker relationships. We use Expectation-Maximization algorithm to fit the GMM and add a small bias (0.01) to the diagonal of the fitted covariance matrices since objects in the database are axis-aligned.
Graph optimization. We formulate a graph labeling problem to decide which of the candidate objects should be included in the scene mockup, denoted by indicator variable , where
iff object
is included. We minimize the following objective function:
where is a unary penalty for an included object, and
is pairwise penalty for a pair of included objects.
We define the unary energy by projecting object’s keypoints to the image and convolve the resulting keypoint map with a Gaussian, following the same procedure we used to create ground truth keypoint maps. This provides a location map n. And we set:
where represents the Frobenius norm,
represents the Hadamard product, and logit
. Note that since we do not expect a single placement to explain the entire keypoint location map, we setup the score as a multiplicative one, with the value only being dependent on the agreement of the actual keypoints the placement exhibits.
We define the pairwise energy using the GMM model learned from the scene dataset:
where are the relative orientations and translation of the objects
.
We solve for the indicator variables using OpenGM [3] by converting the above formulation into a linear program and feeding it to CPLEX [1] to find the final set of selected objects.
Refined object fitting. After selecting the set of objects, the scene mockup is ready. However, we found that our scene priors can also improve the initial object fitting results. To achieve this, we add a term from our GMM model to the regularization term () in object fitting. We go through all candidate objects and re-optimize their parameters, keeping the selected objects fixed. As noted by Olson et al. [31], the structure of the negative log-likelihood (NLL) of a GMM does not lend itself to non-linear least squares optimization. Instead, we approximate the NLL of the full GMM by considering it as a Max-Mixture, reducing the NLL to the weighted distance from the closest mixture mean. We define the Max-Mixture likelihood function
of the new candidate w.r.t. the already placed object, and is the weight of the kth mixture in the model. We use the sum of negative log-likelihoods of these terms for all selected objects that are within a distance of
to the refined candidate:
where represents the normal distribution, and
is the Gaussian normalization factor for the kth mixture. At optimization time, during each step we find the mixture component
that minimizes this function, and then optimize w.r.t. the negative log likelihood of the Gaussian of that component alone, resulting in the following term to be added to the objective function
(Equation 2):
Refined selection. Refined candidates and objects selected for the mockup can help in placing additional objects that have subtler cues. Hence, we iterate between refined fitting and refined selection processes. In the refined selection, we assume that previously selected objects cannot be removed, and add the unary term to favor placing new candidates. So, for each candidate placement in the second iteration, we add a term to (Eq. 5):
5.1. Training and test data
We curated three datasets to evaluate our method. (Datasets to be made available for research use.)
(a) 2D keypoints on indoor images. We downloaded 5000 images from the HOUZZ website using keywords like living room, kitchen, dining room, meeting room, etc. We utilized the Amazon Mechanical Turk platform to obtain keypoints on the images requiring at least 3 workers to agree per image. For each image, we asked the turkers to mark the keypoints of the chairs (maximum of 8 keypoints per chair). Please refer to the supplemental material for details about the web-based annotation interface. We convolved these keypoints with a Gaussian filter to simplify the CNN’s task of learning of smooth filters and averaged the results.
1 2 3 4 5 6 7 8 9 10 11 Number of chairs
Figure 4: Number of chairs and their estimated visibility distribution in the sampling of images in our annotated HOUZZ dataset.
Figure 5: Qualitative comparison of the baseline methods: SEEINGCHAIRS (orange) and FASTERRCNN3D (blue) against SEETHROUGH (green). Annotated groundtruth poses (gray) are provided for reference in the top view. Note that our approach both detects ore chairs and correctly aligns them compared to the others.
(b) Scene mockup groundtruth. In order to quantitatively measure the performance of SEETHROUGH and compare with alternate methods, we require a set of ground truth annotated scenes, i.e., images for which all the 3D objects (chairs in our case) have been placed manually. We are not aware of a similar dataset with mockups for 3D objects including the (partially) occluded ones. Hence, we setup another annotation tool in which an object can be placed by clicking and dragging, as well as by annotating a number of keypoints of the object, and optimizing for its location and scale. Moreover, objects can be copied and translated along their local coordinate axes, allowing for quick and precise annotation (see supplemental for details). We used the automatically estimated camera parameters for the automatic refinement, while discarding any image with grossly erroneous camera estimates. We used the tool to annotate 300 scenes (see Figure 4), which were randomly selected from our HOUZZ dataset.
(c) 3D models and scenes. For our database models, we used the chair models from the ShapeNet [6] database and for scene statistics, we used 45K houses from the PBRS dataset [40]. While the latter comes with 400K physically-based renderings, we tried using these synthetic images to pretrain networks for predicting keypoint maps, but found that fine-tuning a variant of ResNet-50 with weights trained on ImageNet produced more accurate results (see Section 5.4 for more details).
5.2. Performance Measures and Parameters
Hyperparameters. Our optimization pipeline depends on a number of parameters that we optimized using HyperOpt [20], which employs a Tree of Parzen Estimators [5]. We used the LOCANG measure as our objective measure. As ground truth data, we used 10 scenes fully annotated specifically for this purpose, in the same way as the data used for evaluation (see above). See supplemental material for the list of resulting hyper parameter values.
Figure 6: Quantitative performance of SEETHROUGH against the state-of-the-art method-based baseline methods. We outperform the baselines significantly across all the measures. Please refer to supplemental for the tabulated values.
Quantitative measures. We use source and target to denote the two scenes between which a measure is computed. We specifically do not use ‘result scene’ and ‘ground truth scene’ as the ground truth acts as a target to compute precision, and acts as source to compute recall.
We denote the objects in the source and target scene as , respectively. We use
and
to represent the Jaccard index or intersection-over-union (IoU) of the bounding boxes of
and
in 3D world space and 2D screen space, respectively. Finally, given an object
we define the ‘
correspon- dence’ with T as the object with the MaxIoU with
as:
Intuitively, this re- turns, for a given object, the best matching object from the other scene in terms of overlap. Next, we briefly describe our selected measures (see supplemental for details). (a) IOU3D: This measures average IoU for 3D bounding boxes around objects. Specifically, given a source scene and a target scene, we average MaxIoU across all objects in the source scene (measuring IoU overlap with the corresponding object in the target). (b) IOU2D: Similar to IOU3D, this measure averages IoU for 2D bounding boxes around projected objects. (c) LOC: This measures the fraction of correct locations of objects in the source scene with respect to the target. We consider every object in the source scene that has a
cor- respondence over a threshold
to have a correct location. (d) LOCANG: Similar to LOC, this measures additionally requires the angle difference to be under a threshold
. (e) ANGDIFF: This measures the average angle difference for the objects that have a correct location.
5.3. Baselines: State-of-the-art Alternatives
We are not aware of prior research focusing on producing scene mockups in the presence of significant occlusion. Hence, we created two baselines by combining relevant state-of-the-art methods. We convert the output of each baseline (in both cases 3D pose but 2D image space locations of chairs) to our comparable 3D scene mockup format.
(a) SEEINGCHAIRS. Aubry et al. [4] proposed a method to find chairs by matching so-called ‘discriminative visual elements’ (DVE) from a set of rendered views of 1000+ chair models with any input image. These DVEs are linear classifiers over HOG features [8] learned from the rendered views in a discriminative fashion. At training time, they are learned at multiple scales while keeping only the most discriminative ones for matching. At test time, a patch-wise matching process finds the best-matching image and rendered patch pairs, and then finds sets of pairs that come from the same rendered view (see [4] for details).
The above method outputs scored image space bounding boxes together with a specific chair model and pose. For our 3D performance measures, however, we need the output in the form of a 3D scene. Hence, we convert each set of bounding box, pose, and chair model to a 3D scene. Using our estimated camera, we optimize the location (in the xz-plane) of the 3D model without changing its pose, such that the 2D bounding box of the projected model matches as closely as possible with the detected bounding box using a least-squares formulation (solved using Ceres [2]).
(b) FASTERRCNN3D. As the second baseline, we combine a convolutional neural network (CNN) trained for image-space object detection and another CNN trained for 3D object interpretation. Specifically, we use FasterRCNN [32] to extract bounding boxes of chairs from the input image and then feed these regions of interest to 3D-INN [37], which produces a templated chair model consisting of a set of predefined 3D keypoints as well as a pose estimate. Since our set of keypoints is a subset of the keypoints produced by 3D-INN, we use our 3D candidate generation part of SEETHROUGH to convert the extracted keypoints to a 3D chair for the resultant scene mockup.
5.4. Evaluation and Discussion
We ran SEETHROUGH and the two baseline methods on the full ground truth annotated scene set (Section 5.1). A sampling of results can be seen in Figure 5. (Further visualization for 100 scenes in our groundtruth set can be found in the supplementary material.)
The baseline methods perform well when there is no occlusion in the scene. Specifically, chairs that are clearly visible are reconstructed reliably as the direct visual information is sufficient to make an accurate inference about the objects’ pose and identity. However, when chairs are partly occluded, the methods break down quickly. In contrast, SEETHROUGH, by incorporating co-occurrence object model, is more often able to recover from these situations.
This difference in performance is also reflected in the quantitative results (see Figure 6). Our method outperforms the baselines on all counts. Additionally, in Figure 7, we show how the LOCANG measure changes under varying thresholds of angle () and IoU (
).
Figure 7: Performance variation according to LOCANG F1 measure for SEETHROUGH and the two baseline methods under varying angle and IoU thresholds. We perform sig-nificantly better across both the threshold ranges.
Performance under increasing occlusion. In order to specifically test performance under varying occlusion, we sorted the groundtruth annotated HOUZZ dataset into categories based on the extent of the visible chairs. We approximate visibility as follows: we compute how many chairs lie along view rays connecting the estimated camera location with points on a discrete grid on the image plane. We used the objects’ bounding boxes for this visibility computation. Higher values denote more occlusion (as there are more chairs along the view rays). Figure 1 shows that while all the three methods perform comparably under low occlusion, only SEETHROUGH continues to have a high success rate under medium to heavy occlusion.
Effect of multiple iterations. In Section 5.5, we demonstrate the positive utility of multiple iterations to SEETHROUGH. One of our key observations is that high-confidence objects (e.g., unoccluded objects) are easier to detect, and hence can provide valuable contextual information in reinforcing the weaker signals (e.g., partially occluded objects). This behavior results in higher detection rates using iterations and believed to be also functional in the human perception systems [11, 14].
Utility of synthetic data. We found that training on synthetic datasets [40] for predicting image-space keypoint maps led to unsatisfactory results. For this experiment, we took all renderings from 400K images that contain at least one of the annotated chairs and reprojected the keypoint locations from corresponding 3D models into these renders, yielding one image/keypoint map pair as training
Figure 8: Ablation study evaluating the importance of the different stages of SEETHROUGH.
data per render, resulting in a total of 8000 image/keypoint map pairs. We experimented with three different training setups: (i) network trained with only synthetic data; (ii) network first trained with synthetic data, and then refined using real data, and (iii) network trained with only real data.
The best performance on the test set resulted from setup #iii, i.e. training with only real data. One likely explanation is that training the network with the synthetic data first steers away the network weights from those that were the result of the ImageNet pretraining, which already encompass a high general understanding of real photographs.
5.5. Ablation Study
We evaluated the importance of the individual steps of SEETHROUGH to the final performance (see Figure 8 and supplemental). Specifically, we ran our pipeline on the full test set under two weakening conditions: (a) we disable all pairwise costs and run the remaining pipeline based solely on the keypoint location maps; and (b) we disable iterations by running the second and third stage only once, thus removing the possibility of the candidate generation stage benefiting from previously placed objects.
Discussion. Although IOU2D recall increases when disabling scene statistics (option #a), the precision goes down significantly. This is true as the pairwise costs by themselves do not propose new objects – they only make output mockups more precise by pruning objects that do not agree with others. In contrast, using only a single iteration (option #b) increases precision, but recall takes a significant hit. This is not surprising, as in the later iterations the keypoint location maps have decreased influence relative to the pairwise costs. As a result, while objects with weaker keypoint response are more easily found, false positives also become more likely. Overall, the combined IOU2D F1 measure is highest for the full SEETHROUGH as well as the LOCANG F1 measure.
We proposed SEETHROUGH, a method for automatically finding partially occluded chairs in a photograph of a structured scene. Our key insight is the incorporation of higher level scene statistics that allows more accurate reasoning in scenes containing medium to high levels of occlusion. We demonstrate considerable quantitative and qualitative performance improvements across multiple measures.
Our method suffers from limitations that suggest a number of future research directions. First, we plan to extend the evaluation to a more expansive class of objects beyond chairs. Second, we think exploring templates that can express a broader understanding of the multi-object spatial relationships is a promising future direction.
[1] IBM ILOG CPLEX Optimizer. https://www.gams. com/latest/docs/S_CPLEX.html.
[2] S. Agarwal, K. Mierle, and Others. Ceres solver. http: //ceres-solver.org.
[3] B. Andres, T. Beier, and J. Kappes. OpenGM: A C++ library for discrete graphical models. CoRR, abs/1206.0111, 2012.
[4] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic. Seeing 3d chairs: Exemplar part-based 2d-3d alignment using a large dataset of CAD models. In Proc. IEEE CVPR, pages 3762–3769, 2014.
[5] J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proc. ICML, pages 115–123, 2013.
[6] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR].
[7] W. Choi, Y.-W. Chao, C. Pantofaru, and S. Savarese. Indoor scene understanding with geometric and semantic contexts. IJCV, 112(2):204–220, 2015.
[8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE CVPR, pages 886–893, 2005.
[9] S. Dasgupta, K. Fang, K. Chen, and S. Savarese. Delay: Robust spatial layout estimation for cluttered indoor scenes. In Proc. IEEE CVPR, pages 616–624, 2016.
[10] L. Del Pero, J. Bowdish, D. Fried, B. Kermgard, E. Hart- ley, and K. Barnard. Bayesian geometric modeling of indoor scenes. In Proc. IEEE CVPR, pages 2719–2726. IEEE, 2012.
[11] J. J. DiCarlo, D. Zoccolan, and N. C. Rust. How does the brain solve visual object recognition? Neuron, 73(3):415– 434, 2012.
[12] M. Fisher, D. Ritchie, M. Savva, T. Funkhouser, and P. Han- rahan. Example-based synthesis of 3d object arrangements. Proc. ACM/SIGGRAPH Asia, 2012.
[13] M. Fisher, M. Savva, Y. Li, P. Hanrahan, and M. Nießner. Activity-centric scene synthesis for functional 3d scene modeling. Proc. ACM/SIGGRAPH, 34(6):179, 2015.
[14] A. M. Fyall, Y. El-Shamayleh, H. Choi, E. Shea-Brown, , and A. Pasupathy. Dynamic representation of partially occluded objects in primate prefrontal and visual cortex. eLife, 2017.
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE CVPR, pages 770–778, 2016.
[16] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatial layout of cluttered rooms. In Proc. ICCV, pages 1849–1856. IEEE, 2009.
[17] V. Hedau, D. Hoiem, and D. A. Forsyth. Recovering the spa- tial layout of cluttered rooms. In Proc. ICCV, pages 1849– 1856, 2009.
[18] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. tog, 24(3):577–584, 2005.
[19] Q. Huang, H. Wang, and V. Koltun. Single-view reconstruc- tion via joint analysis of image and shape collections. ACM Transactions on Graphics (TOG), 34(4):87, 2015.
[20] HyperOpt. HyperOpt, 2017.
[21] H. Izadinia, Q. Shan, and S. M. Seitz. IM2CAD. CoRR, abs/1608.05137, 2016.
[22] H. Izadinia, Q. Shan, and S. M. Seitz. Im2cad. In Proc. IEEE CVPR, 2017.
[23] N. Kholgade, T. Simon, A. Efros, and Y. Sheikh. 3d object manipulation in a single photograph using stock 3d models. ACM Transactions on Graphics (TOG), 33(4):127, 2014.
[24] Y. M. Kim, N. J. Mitra, D.-M. Yan, and L. Guibas. Acquiring 3d indoor environments with variability and repetition. ACM Transactions on Graphics, 31(6):138:1–138:11, 2012.
[25] Y. Li, A. Dai, L. J. Guibas, and M. Nießner. Databaseassisted object retrieval for real-time 3d reconstruction. Comput. Graph. Forum, 34(2):435–446, 2015.
[26] J. J. Lim, A. Khosla, and A. Torralba. Fpm: Fine pose parts- based model with 3d cad models. In Proc. ECCV, pages 478–493. Springer, 2014.
[27] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing IKEA ob- jects: Fine pose estimation. In Proc. ICCV, pages 2992– 2999, 2013.
[28] A. Mallya and S. Lazebnik. Learning informative edge maps for indoor scene layout prediction. In Proc. ICCV, pages 936–944, 2015.
[29] O. Mattausch, D. Panozzo, C. Mura, O. Sorkine-Hornung, and R. Pajarola. Object detection and classification from large-scale cluttered indoor scans. Comput. Graph. Forum, 33(2):11–21, 2014.
[30] A. Monszpart, N. Mellado, G. J. Brostow, and N. J. Mitra. Rapter: rebuilding man-made scenes with regular arrangements of planes. Proc. ACM/SIGGRAPH, 34(4):103, 2015.
[31] E. Olson and P. Agarwal. Inference on networks of mixtures for robust robot mapping. I. J. Robotics Res., 32(7):826–840, 2013.
[32] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Proc. NIPS, pages 91–99, 2015.
[33] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. Box in the box: Joint 3d layout and object reasoning from single images. In Proc. ICCV, pages 353–360, 2013.
[34] T. Shao, A. Monszpart, Y. Zheng, B. Koo, W. Xu, K. Zhou, and N. J. Mitra. Imagining the unseen: stabilitybased cuboid arrangements for scene understanding. Proc. ACM/SIGGRAPH, 33(6):209:1–209:11, 2014.
[35] S. Tulsiani and J. Malik. Viewpoints and keypoints. In Proc. IEEE CVPR, pages 1510–1519, 2015.
[36] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convo- lutional pose machines. In Proc. IEEE CVPR, pages 4724– 4732, 2016.
[37] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Tor- ralba, and W. T. Freeman. Single image 3d interpreter network. In Proc. ECCV, pages 365–382. Springer, 2016.
[38] J. Xiao, B. Russell, and A. Torralba. Localizing 3d cuboids in single-view images. In Proc. NIPS, pages 746–754, 2012.
[39] Y. Zhang, S. Song, P. Tan, and J. Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. In Proc. ECCV, pages 668–686. Springer, 2014.
[40] Y. Zhang, S. Song, E. Yumer, M. Savva, J. Lee, H. Jin, and T. A. Funkhouser. Physically-based rendering for indoor scene understanding using convolutional neural networks. In Proc. IEEE CVPR, 2017.