3D Shape Segmentation with Geometric Deep Learning

2020·arXiv

Abstract

Abstract

The semantic segmentation of 3D shapes with a high-density of vertices could be impractical due to large memory requirements. To make this problem computationally tractable, we propose a neuralnetwork based approach that produces 3D augmented views of the 3D shape to solve the whole segmentation as sub-segmentation problems. 3D augmented views are obtained by projecting vertices and normals of a 3D shape onto 2D regular grids taken from different viewpoints around the shape. These 3D views are then processed by a Convolutional Neural Network to produce a probability distribution function (pdf) over the set of the semantic classes for each vertex. These pdfs are then re-projected on the original 3D shape and postprocessed using contextual information through Conditional Random Fields. We validate our approach using 3D shapes of publicly available datasets and of real objects that are reconstructed using photogrammetry techniques. We compare our approach against state-of-the-art alternatives.

1 Introduction

Traditional Convolutional Neural Networks (CNNs) use a cascade of learned convolution filters, pooling operations and activation functions to transform image data into feature embeddings processable by fully connected layers that classify the image content [7]. Typically, 3D deep-learning approaches extend traditional 2D methods to non-Euclidean domains as the convolution operation is not well defined in 3D [15]. One of the most challenging researched topic related to 3D deep learning is the semantic segmentation of 3D shapes as it is key to support computer graphics applications such as shape editing [24] and modelling [4]. Challenges to segment 3D shapes include dealing with different topologies, handling noisy geometries and different resolutions, and modeling semantic representations for different segments.

3D segmentation can be performed through multi-view [10, 22], volumetric [23] or intrinsic [15,18] deep learning-based approaches. Multi-view and volumetric approaches use Euclidean structures, such as 2D or 3D grids, respectively, to process 3D shapes with 2D CNNs [10, 22, 23]. In particular, multi-view approaches simplify the representation of a 3D model using a set of rendered depth images taken from different viewpoints around the model, thus making the segmentation independent of the 3D-model polygon density [10,22]. Multi-view approaches cannot fully exploit the geometric properties of the 3D shape (e.g. face normals) because geometric information can be lost when data are projected in 2D. Volumetric approaches approximate the 3D shape using voxels which could overshadow geometric details of the object [23]. Intrinsic approaches can be further divided into point-based and convolution-based approaches. Point-based approaches define feature extractors directly on the shape vertices [18], whereas convolution-based approaches extend the traditional convolution operations from grid-like structures to triangular meshes [15]. Point-based approaches mostly process each vertex of the shape independently and loosely exploit local information [18]. The additional structures used by conventional convolution-based approaches increase the shape representation complexity hence prohibiting the processing of high-density polygon models [15]. Typically, 3D segmentation approaches validate their performance on datasets collected in controlled scenarios, and they mostly lack of an evaluation carried out on 3D models reconstructed using photogrammetric techniques [16].

In this paper we propose a novel 3D segmentation approach that retains both the advantages of view-based [10] and intrinsic approaches [15] by building 3D augmented views from multiple viewpoints around a 3D shape. 3D augmented views are a projection of 3D shape portions on 2D regular grids, where each cell of the grid encodes the information about depth and normal of the corresponding projected portion. This allows us to significantly reduce the number of parameters to learn and to perform 3D segmentation of shapes with diverse mesh topology (e.g. polygon structure and/or density). We evaluate our approach on synthetic 3D shapes from publicly available datasets, and on 3D shapes of objects we captured with a smartphone and reconstructed using photogrammetry techniques. Results show that the proposed approach can achieve state-of-the-art accuracy by using only 1% of the parameters used by the alternative approaches.

2 Our approach

2.1 Problem formulation

Given a 3D shape composed of vertices , we design a neuralnetwork based approach ) that outputs a probability distribution p(x) over the label space L = {1, . . . , L}, where L is the number of segmentation labels. The output segmentation of X is computed as

Fig. 1. Our approach outline. 3D augmented views from different viewpoints are com- puted from the 3D shape (shape decomposition). Point-wise features (i.e. coordinates and surface normals) are extracted from these 3D views and classified to obtain segmentation predictions. Predictions are re-projected and aggregated on the original shape, and refined through a Conditional Random Field for local prediction consistency.

where h(x) is a label defining the segment class of the vertex x.

The neural network can be defined as a a parametric function in the set of learnable parameters (i.e. weights) is composed of four modules, namely shape decomposition, feature extraction and classification, feature aggregation and prediction refinement. Shape decomposition transforms the input 3D shapes into 3D augmented views, or 3D views. Each 3D view is processed by a feature extraction and classification network, namely ViewNet, that predicts the class of each vertex. Prediction aggregation re-projects the predictions of ViewNet of each 3D view onto the original 3D shape. Prediction refinement improves class prediction using contextual information on the original shape. Fig. 1 depicts the block diagram.

2.2 Shape decomposition

We simplify the 3D shape representation (e.g. triangular meshes, quad meshes, CAD models) by decomposing the input shape into several components. Shape decomposition can be performed by clustering shape vertices [8], by using geometrical primitives [9], or by generating range scans from different viewpoints [10]. We use a similar approach to the latter in order to process the 3D shape regardless its 3D representation, resolution and vertex topology.

Given X in the form of a triangular mesh with vertices ), , we simplify X by building 3D views from M different viewpoints. Let )) be a range scan that is captured from the mth viewpoint , where (u, v) is the coordinate of a pixel, d(u, v) is the depth value of the 3D shape, and m = 1, . . . , M. Let be the mth 3D view whose vertices ) are obtained by registering the coordinates (u, v, d(u, v)) of the range scan to the coordinates of the vertices X. The faces of the 3D view are obtained by connecting depth values using the typical regular grid pattern of 2D images. For each vertex we compute the surface normal to define the signal on the 3D view as )). The relation between a 3D view and the input 3D shape is defined by the correspondence function : that assigns the vertices of the mth 3D view to the corresponding vertices of the 3D shape.

Fig. 2. Example of 3D augmented views. Left-hand side: a synthetic 3D shape from the FAUST dataset [2]. Right-hand side: examples of the 3D augmented views extracted by the shape decomposition module. 3D views have an uniform vertex density and capture the underlying geometry even at a lower resolution.

2.3 Feature extraction and classification

The feature extraction and classification module processes the M 3D views in parallel to learn features through a set of deep neural networks, namely ViewNets, with shared weights. Formally, each ViewNet is a non-linear parametric function )) that takes vertex-wise features ) as input and produces the probability distribution )) as output, where L is the number of segmentation classes and is the set of ViewNet learnable weights. Let [0, 1]be the matrix containing the pdfs of all vertices of .

A ViewNet module is defined as the composition of Intrinsic Convolutional (IC), Fully Connected (FC) and Softmax layers. FC and Softmax are standard layers, whereas the IC layer replaces the convolutional layer used in traditional Euclidean CNNs to perform convolution operations on 3D views [15]. The convolution at using IC layers requires additional information, in the form of a local coordinate frame and a set of weighting functions that maps the signal of the local neighbourhood of x to a fixed grid.

2.4 Prediction aggregation

Predictions inferred from each 3D view are re-projected and aggregated on the 3D shape X in order to transfer the segmentation result on the original input. We name this operation ProjNet. ProjNet employs a pooling operation that takes the ViewNet predictions on as input and the correspondence function : for any m, to produce a single confidence map defined on X. The pooling operation is defined as

where is the set of 3D view indices relative to the vertex , and ) is the probability distribution over the segmentation classes associated to vertex of the ˜mth 3D view.

2.5 Prediction refinement

The output of ProjNet is a point-wise prediction, i.e. the label prediction of each vertex is estimated independently from its neighbors, thus leading to likely local label inconsistencies. Moreover, some vertices of the input 3D shape may not have been projected on any of the 3D views, thus leading to vertices with undefined label predictions on X. Therefore, we impose local label consistency by using a surface-based Conditional Random Field (CRF) approach [10,25] that exploits contextual information to produce structured and dense predictions.

For each vertex , let be a random variable that assigns a label to it, and let ) be the set of the random variables associated to the N vertices of X. The CRF energy associated to y is defined as:

where the unary term ) quantifies the assignment cost of to vertex and the pairwise term ) quantifies the joint assignment cost of to vertices [14]. Because ) measures the cost of assigning the vertex to L, we define the unary term as log ()). The pairwise potential is instead defined as the weighted sum of three Gaussian kernels:

where

) is the geodesic distance between the vertices , 1is the identity function on X, and ) is a label compatibility term.

Similarly to [10, 25], favors local spatial consistency, while promotes the assignment of similar labels to vertices with similar properties. The third kernel is novel and is introduced to disambiguate symmetries. Because symmetric parts are likely to be located far from each other (e.g. arms and legs in a human shape) we designed to avoid distant points to have similar labels. The set of CRF learnable parameters is defined as , . Fig. 3 shows how CRF learns the relationships among segments

Fig. 3. Example of Conditional Random Field (CRF) learned weights (in the case of human 3D shapes.

through an example of learned parameters (i.e. and ) on human 3D shapes. In we can observe that the head weights suggest that there is a strong relationship between head and torso rather than between head and right foot/right arm. Similarly, the torso weights suggest that there is a strong relationship between torso and arms/legs rather than between torso and feet/hands.

The most probable pdf configuration of y for X is obtained by minimizing the energy E(y) defined in Eq. 1. The exact inference of the CRF distribution is intractable, thus we use a mean-field approximation [10, 14]. The iterative algorithm for approximate mean-field inference can be implemented as a Recurrent Neural Network (RNN) by rephrasing each step of the algorithm as a CNN layer [25].

3 Results

3.1 Experimental setup

We evaluate our 3D segmentation approach through two different experiments. Firstly, we use data from the publicly available Princeton Shape Benchmark (PSB) dataset [20] that contains synthetic shapes of several objects and animals; in particular, the rigid shapes of the Airplane class, and the non-rigid shapes of the Ant, Four Leg and Teddy classes. The segmentation labels of each object are defined as in [20]. Secondly, we use data of non-rigid human shapes; in particular, (i) synthetic people with different poses (FAUST dataset [2]), (ii) real people acquired with depth sensors (SCAPE dataset [1]) and with structured light 3D body scanners (SHREC14 dataset [5]), and (iii) real people that we acquired with a smartphone and reconstructed using the photogrammetry pipeline COLMAP [19]. We manually labelled the ground truth for FAUST and SCAPE datasets and used their training data to learn the neural network model for the human shapes. We have used this model to test our approach on all the other human shapes of FAUST, SCAPE, SHREC14 and COLMAP datasets. The segmentation labels for the non-rigid human shapes are: L = {head, torso, right arm, right hand, right leg, right foot, left arm, left hand, left leg, left foot}.

3.2 Training

Given a labelled training set, where each vertex is associated to a ground-truth label ), the optimal parameters are obtained by minimizing the categorical cross-entropy loss,

where is the Kronecker delta defined for the ground-truth label ).

Our approach is trained end-to-end and from scratch. We use M = 10 3D views (Sec. 2.2, Fig. 2) taken from equi-spaced viewpoints around the shape. For training we use the Adam optimizer [13] with a learning rate of 0.001. The CRF weights are initialized with identity matrices, i.e. each segment class is only in relationship with itself.

3.3 Evaluation

PSB dataset: Table 1 shows the quantitative results of our approach on a subset of PSB’s 3D shapes. We compare the accuracy of our approach with ShapeBoost [11], Guo et al. [6] and ShapePFCN [10]. The first two approaches use classifiers that are learned from hand-crafted features, whereas the latter is an end-to-end deep learning approach similar to ours (i.e. features are also learned). We can observe that the accuracy of our approach is similar to that of state-of-the-art methods. However, compared to ShapePFCN [10] that is based on the VGG16 architecture [21], which uses 134M parameters, our neural network uses 14K parameters, i.e. 1% of ShapePFCN’s parameters [10]. Fig. 4 shows examples of segmentation results that are obtained on the Airplane category. The uncertainty map next to each segmentation result showed that the highest level of uncertainty is located where different segments intersect. Qualitatively, the results are very accurate and show only minor errors on the rudder region.

Table 1. Segmentation mean accuracy (the higher the better [10]) on the Princeton Shape Benchmark dataset [20].

Non-rigid human shapes: Fig. 5 to 8 show examples of segmentation results that are obtained on the non-rigid human shapes. Beside each segmented shape we can observe their associated entropy map. The smaller the entropy the higher the uncertainty. As expected, the largest level of uncertainty is located at the

Fig. 4. Semantic segmentation results of our approach on PSB Airplane test shapes. Segmentation color key: green = body, blue = wings, purple = engine, yellow = stabilizer, and red = rudder. Each segmentation result (center) is accompanied by its ground-truth (on its left) and a confidence map (on its right) showing the uncertainty (entropy) of the network prediction over the 3D shape. The darker the color the higher the uncertainty.

joints between two segments, that is where transition is not well defined. Because we have annotations for FAUST and SCAPE, we quantified the accuracy [10] and Intersection over Union (IoU) [18] of the segmentation results. In FAUST we achieved an accuracy of 93.8% and IoU of 88.5% while in SCAPE we achieved an accuracy of 72.1% and IoU of 58.7%. This accuracy and IoU differences are due to the unbalanced number of training samples of the two datasets. FAUST annotations are much more numerous than those of SCAPE. A few of the poses of FAUST’s training shapes are also present in the test set. This does not occur in the case of SCAPE, where poses are only present once. Fig. 6 shows examples of the segmentation errors occurred in SCAPE test, e.g. on the right-hand block we can see that the legs of the shape in the middle have been segmented with inverted labels.

Results in Fig. 7 and 8 show that the method can generalize also to 3D shapes that have not been used for training. Interestingly, our approach can effectively generalize the mesh representation through the 3D augmented views and produce a reliable segmentation in the case of COLMAP’s shapes. Note that the mesh topology of COLMAP’s shapes is different from those used in training. This is because the meshing operation based on Poisson reconstruction of COLMAP produces highly irregular polygons [12]. However, it is also clear that COLMAP’s shapes are more challenging than SHREC14’s ones by looking at the respective confidence maps. Overall, results show that our approach can effectively segment 3D shapes of different subjects, despite their different pose.

4 Conclusions

We presented an approach to segment 3D shapes efficiently regardless their mesh topology. To achieve this we decomposed the segmentation problem into

Fig. 5. Semantic segmentation results of our approach on a subset of FAUST’s test shapes. Segmentation color key: colour code: yellow = head, green = torso, blue = right arm, light blue = right hand, orange = right leg, yellow = right foot, red = left arm, light red = left hand, purple = left leg, light purple = left foot. Each segmentation result (left) is accompanied by a confidence map (right) showing the uncertainty (entropy) of the network prediction over the 3D shape. The darker the color the higher the uncertainty.

sub-segmentation problems by using 3D augmented views generated from the underlying 3D shape. This enabled us to train a neural network with 1% of the parameters used by alternative state-of-the-art solutions, while maintaining similar accuracy performance. We showed that our approach is generic and can be used to segment 3D shapes with arbitrary mesh topologies, like those computed with photogrammetry reconstruction techniques (e.g. Poisson reconstruction [12]) that have a high density of polygons and that are distributed irregularly. Moreover, our approach also showed evidence of being flexible to segment other categories of 3D shapes (e.g. airplanes) other than human ones.

GT preds confs 0

Fig. 6. Semantic segmentation results of our approach on a subset of SCAPE’s test shapes. Segmentation color key is the same as that in Fig. 5.

confs 0

Fig. 7. Semantic segmentation results of our approach on a subset of SHREC14’s shapes. Trained on FAUST and SCAPE training sets. Segmentation color key is the same as that in Fig. 5.

confs 0

Fig. 8. Semantic segmentation results of our approach on a subset of COLMAP’s shapes. Trained on FAUST and SCAPE training sets. Segmentation color key is the same as that in Fig. 5.

Future research directions include an extensive analysis of the results, evaluating the impact of a multi-scale approach applied on the 3D augmented views and exploring next-best-view approaches [17] to select the most suitable 3D views of the object of interest. We will also exploit the structured output of the CRF to build models for surface matching between 3D shapes [3] and explore attention mechanisms to make the prediction of our approach robust to the clutter present on the 3D shape (i.e. untrained segmentation classes).

References

1. Anguelov, D., Srinivasan, P., Koller, D., et al.: SCAPE: Shape Completion and Animation of People. ACM Trans. Graph. 24(3), 408–416 (2005)

2. Bogo, F., Romero, J., Loper, M., et al.: FAUST: Dataset and Evaluation for 3D Mesh Registration. In: Proc. CVPR (2014)

3. Boscaini, D., Masci, J., Rodol`a, E., et al.: Learning Shape Correspondence with Anisotropic Convolutional Neural Networks. In: Proc. NIPS (2016)

4. Chen, X., Zhou, B., Lu, F., et al.: Garment Modeling with a Depth Camera. ACM Trans. Graph. 34(6), 203:1–203:12 (2015)

5. Giachetti, A., Mazzi, E., Piscitelli, F., et al.: SHREC’14 Track: Automatic Location of Landmarks used in Manual Anthropometry. In: Proc. 3DOR (2014)

6. Guo, K., Zou, D., Chen, X.: 3D Mesh Labeling via Deep Convolutional Neural Networks. ACM Trans. Graph. 35(1), 3:1–3:12 (2015)

7. He, K., Zhang, X., Ren, S., et al.: Deep Residual Learning for Image Recognition. In: Proc. CVPR (2016)

8. Hua, Z., Huang, Z., Li, J.: Mesh Simplification Using Vertex Clustering Based on Principal Curvature. International Journal of Multimedia and Ubiquitous Engineering 10(9), 99–110 (2015)

9. Kaiser, A., Ybanez Zepeda, J.A., Boubekeur, T.: A Survey of Simple Geometric Primitives Detection Methods for Captured 3D Data. Computer Graphics Forum 38(1), 167–196 (2019)

10. Kalogerakis, E., Averkiou, M., Maji, S., et al.: 3D Shape Segmentation with Pro- jective Convolutional Networks. In: Proc. CVPR (2017)

11. Kalogerakis, E., Hertzmann, A., Singh, K.: Learning 3D Mesh Segmentation and Labeling. ACM Trans. Graph. 29(3), 102:1–102:12 (2010)

12. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson Surface Reconstruction. In: Proc. SGP (2006)

13. Kingma, D.P., Ba, J.L.: Adam: A Method for Stochastic Optimization. In: Proc. ICLR (2015)

14. Kr¨ahenb¨uhl, P., Koltun, V.: Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. In: Proc. NIPS (2011)

15. Monti, F., Boscaini, D., Masci, J., et al.: Geometric Deep Learning on Graphs and Manifolds using Mixture Model CNNs. In: Proc. CVPR (2017)

16. Nocerino, E., Lago, F., Morabito, D., et al.: A Smartphone-based pipeline for the creative industry - The Replicate EU project. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences (2017)

17. Potthast, C., Sukhatme, G.S.: A probabilistic framework for next best view esti- mation in a cluttered environment. Journal of Visual Communication and Image Representation 25(1), 148–164 (2014)

18. Qi, C.R., Su, H., Mo, K., et al.: PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In: Proc. CVPR (2017)

19. Schonberger, J.L., Frahm, J.M.: Structure-from-Motion Revisited. In: Proc. CVPR (2016)

20. Shilane, P., Min, P., Kazhdan, M., et al.: The Princeton Shape Benchmark. In: Proc. SMI (2004)

21. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: Proc. ICLR (2015)

22. Su, H., Maji, S., Kalogerakis, E., et al.: Multi-View Convolutional Neural Networks for 3D Shape Recognition. In: Proc. ICCV (2015)

23. Wu, Z., Song, S., Khosla, A., et al.: 3D ShapeNets: A Deep Representation for Volumetric Shapes. In: Proc. CVPR (2015)

24. Yu, Y., Zhou, K., Xu, D., et al.: Mesh Editing with Poisson-based Gradient Field Manipulation. ACM Trans. Graph. 23(3), 644–651 (2004)

25. Zheng, S., Jayasumana, S., Romera-Paredes, B., et al.: Conditional Random Fields As Recurrent Neural Networks. In: Proc. ICCV (2015)

Designed for Accessibility and to further Open Science