SLOAM: Semantic Lidar Odometry and Mapping for Forest Inventory

2019·Arxiv

Abstract

Abstract

This paper describes an end-to-end pipeline for tree diameter estimation based on semantic segmentation and lidar odometry and mapping. Accurate mapping of this type of environment is challenging since the ground and the trees are surrounded by leaves, thorns and vines, and the sensor typically experiences extreme motion. We propose a semantic feature based pose optimization that simultaneously refines the tree models while estimating the robot pose. The pipeline utilizes a custom virtual reality tool for labeling 3D scans that is used to train a semantic segmentation network. The masked point cloud is used to compute a trellis graph that identifies individual instances and extracts relevant features that are used by the SLAM module. We show that traditional lidar and image based methods fail in the forest environment on both Unmanned Aerial Vehicle (UAV) and hand-carry systems, while our method is more robust, scalable, and automatically generates tree diameter estimations.

Index Terms—Robotics in Agriculture and Forestry, SLAM, Deep Learning in Robotics and Automation, Virtual Reality and Interfaces

I. INTRODUCTION

OBTAINING accurate timber inventory is crucial forforest managers. Foresters currently use Terrestrial Laser Scanning (TLS) sensors [1], [2] to extract tree metrics with high accuracy and reduced manpower [1]. However, these sensors need to be moved around to measure the timber inventory at multiple locations for full coverage of the forest, requiring additional time and manpower. We propose a robotic timber cruise involving a robot navigating a stand of timber, measuring samples, and estimating the total forest volume. Fig. 1 shows our Unmanned Aerial Vehicle (UAV) used in this work, and the resulting estimated timber map. This approach provides increased data granularity allowing foresters to make optimized management and harvesting decisions. Deploying UAV systems is appealing for a variety of applications including fruit counting [3], [4], disaster manage-

Manuscript received: September, 11, 2019; Revised November, 30, 2019; Accepted December, 23, 2019. This paper was recommended for publication by Editor Allison Okamura upon evaluation of the Associate Editor and Reviewers’ comments. This work is supported by Trefo LLC under the NSF SBIR grant #193856, by ARL grant ARL DCIST CRA 2911NF-17-2-0181, ONR grant N00014-07-1-0829, ARO grant W911NF-13-1-0350, INCT-INSac grants CNPq 465755/2014-3, FAPESP 2014/50851-0 and 2018/24526-5. This work is supported in part by the Semiconductor Research Corporation (SRC) and DARPA. We additionally thank NVIDIA for generously providing support through the NVAIL program.

of Pennsylvania, United States. {chenste,elslee,quchao, liuxu,kumar}@seas.upenn.edu. Authors with the Robot learning laboratory, ICMC - University of Sao Paulo, Brazil. {guinardari,rafrance}@usp.br Digital Object Identifier (DOI): see top of this page.

Fig. 1: Our UAV performing a timber inventory acquisition by simultaneously estimating its state and the tree diameters. Bounding boxes are manually added to illustrate the same landmarks in the 3D point cloud and the 2D image.

ment [5], and penstock inspection [6]. Previous works [7]– [10] focus on autonomous navigation in cluttered [10] or GPS-denied environments [7]. Many works employ UAVs in forests [8]–[14]. Most enable autonomous navigation by detecting and following forest trails using learning-based approaches [9], [11], [12]. Cui et al. [8] uses a 2D laser range finder to navigate autonomously, but makes the assumption that the UAV height does not drastically change. More recently, [10] develops a UAV platform equipped with a 3D lidar and IMU for navigating in timber environments and building point cloud maps, and [13], [14] review the UAV applications in forestry. None of these prior works focus on the problem of estimating timber volume.

Reche et al. [15] addresses the timber inventory problem by modeling trees from photographs, and Fritz et al. [16] uses photogrammetric point clouds collected over canopy to map tree stems. Other works estimate tree attributes using TLS [1], [2]. Relying on photography or TLS can be time consuming, and capturing data over dense forest canopy may not provide accurate tree diameter at breast height (DBH).

Our work leverages UAVs and lidars for measuring timber inventory, particularly tree DBH estimation, by focusing on solving the simultaneous localization and mapping (SLAM) problem. Previous methods, such as lidar odometry and mapping (LOAM) [17], rely on texture-based features that are brittle in forest environments. Our goal is to develop a Semantic LOAM (SLOAM) algorithm that extends ideas from semantic SLAM [18] for forestry applications. Our key idea is utilizing a parameterized landmark shape representation to obtain more robust solutions to the SLAM and DBH problems.

We present an end-to-end pipeline for solving the DBH estimation problem. While our approach is specific to our primary forestry problem, the framework and tools we develop can be generalized to other environments. The pipeline consists of: (1) 3D point cloud labeling with virtual reality (VR); (2) range image segmentation with a fully convolutional neural network (FCN); (3) landmark instance detection with trellis graphs; (4) semantic lidar odometry; and (5) semantic lidar mapping. We first present the semantic lidar odometry and mapping modules, as they form the core ideas of the work. We then present the first 3 modules which relate to accurate detection of landmarks from individual lidar sensor readings. Finally, we present experimental results of our system in the field. The primary contributions of this paper are:

• A complete end-to-end pipeline for solving the forest DBH estimation problem using UAV systems; and

• SLOAM, a semantic framework that is capable of handling challenging environments with aggressive motions while simultaneously estimating landmark parameters.

II. PROBLEM FORMULATION

Primary Problem (Forest DBH Estimation). Given a forest consisting of trees , estimate the number of trees N, and the position and diameter at breast height (DBH) of each tree .

The use of a mobile lidar sensor to solve the DBH estimation problem requires registering a sequence of lidar sweeps into a common reference frame. This registration typically involves estimating the trajectory of the lidar sensor over time, which leads to the following problem:

General Problem (Semantic LOAM). Given a sequence of lidar sweeps collected in an environment consisting of objects , estimate the sensor state trajectory X, the number of objects N, and the model parameters and classes of each object .

Under specific assumptions on the model parameterization and class of objects , a solution to the general Semantic LOAM problem will yield a solution to the primary DBH Estimation problem.

III. SEMANTIC LIDAR ODOMETRY AND MAPPING

The purpose of the lidar odometry algorithm is to estimate the rigid 6-DOF motion of the lidar within a lidar sweep. The purpose of the lidar mapping algorithm is to estimate the 6-DOF pose of the lidar in the world, and register the point cloud in the world frame. We follow the framework introduced in the original LOAM paper [17]. SLOAM incorporates semantic landmark features in an end-to-end system to increase robustness, scalability, and performance in challenging environments where LOAM fails, such as forests.

SLOAM relies on landmark models and features to estimate motion and register the lidar point clouds. The choice of which landmark features to use, and how to model the landmarks, depends on the environment. In forests, good landmark features are trees and ground modeled as cylinders and planes. We will assume that these features have been segmented and clustered into individual tree instances by an external segmentation and instance detection module described in Sec. IV.

For the previous lidar sweep that started with time stamp , let be the set of tree landmark models of size . For a given , let be the set of tree feature points of size associated with the landmark, and let be the aggregation of all tree features. Let be the set of ground feature points of size . Denote as the projections of to the lidar frame at time stamp . Let t be the current time stamp, and be the lidar pose transform between , where are the translations along the x, y, and z axes in the lidar frame {L}, and , and are the rotation angles following the right-hand rule. Let be similarly defined in the world frame {W}, which defines the sensor state trajectory . Both the semantic odometry and mapping algorithms estimate the pose transforms by performing a data association between tree and ground features in the current sweep with tree and ground models in a previous sweep (odometry) or map (mapping).

A. Model Parameterization and Distance Functions

We model the ground locally as a plane parameterized by , where is the normal of the plane, and is the offset such that the plane is defined by . Given a point p and plane , let be a point on the plane. We can then define a point to plane distance:

Since the ground is relatively flat and lidar upright, ground features provide constraints in the z axis, but not the x and y axes, and we need to also detect tree models and features.

We parameterize the tree cylinder models following the method in [19]. Let be the parameters of a cylinder model. To understand these parameters, let n be the cylinder normal at the point on the cylinder closest to the origin, and a be the axis of the cylinder. The first parameter is the distance from the origin to the closest point on the cylinder, is the angle between the projection of n onto the xy plane with the x axis, is the angle between n and the z axis, is the angle between a and the partial derivative of n with respect to , and is the radius of the cylinder. For an illustration of the cylinder model parameters and angles, please refer to [19, Fig. 3]. More specifically, given , we can compute the normal n and axis a as:

where is the partial derivative of n w.r.t , and is derived from the partial derivative of n w.r.t .

The point to cylinder distance between a point p and cylinder s is:

B. Estimating Tree and Ground Models

Given the set of feature points of size cor- responding to landmark , the geometric least squares approach to estimating the parameters of a cylinder s is to stack the point to cylinder distances for each feature point and optimize over the parameters of s. However, the point to cylinder distance can encounter singularities when the curvature decreases, which introduces difficulties in the numerical optimization procedure [19]. As a result, for a feature p and cylinder s, we approximate the point to cylinder distance with the following distance which has the same zero set and derivatives at the zero set as the true distance function, but additionally behaves well when the curvature decreases:

The geometric least squares optimization problem is then solved as follows:

It is easy to ensure that , by optimizing over . Thus given a set of tree features , solving Prob. (5) will yield a cylinder landmark model s.

Our approach to estimate the local ground plane models is similar to previous works [17], [20]. However, the challenge with forest environments is that the ground is frequently covered by brush and small shrubs that are irregularly shaped. While we describe a ground segmentation method that tries to filter out this noise in Sec. IV, it is extremely challenging to completely filter out all of the underbrush in natural forest environments. This noise can be problematic for standard ground plane estimation methods used in methods such as [20] which only utilize 3 points to estimate the plane.

Instead, we use the following robust ground estimation method. Given a set of m ground feature points denoted as the matrix G, we use Singular Value Decomposition (SVD) to estimate a model of the ground plane. Let G be the centroid, and be the mean shifted ground feature points. We can then compute the SVD of where and . The estimate of the normal will be the left singular vector of U corresponding to the least singular value. Given and the centroid G, we can compute by using the equation of the plane.

C. Motion Estimation and Data Association

Given a set of tree features and ground features in the current sweep , as well as tree features , ground features , and tree landmark models from the previous sweep projected to , we can estimate the pose transform by optimizing the following non-linear least squares problem:

where and balance the frequency between tree and ground features. In this optimization problem, we use the true point to cylinder distance defined in Eqn. (3), as the singularity from is eliminated since is now a constant. Notice that each of the distance function () depends on associating each feature point ) with an object model ), which is called data association. This data association process is different for the tree features and ground features.

For a tree feature , we need to find its landmark correspondence in . We first project the feature taken at time t to in the frame at time using an initial guess of at time t. There are two methods to perform the data association for tree features. In the first method, we match each feature to the model in with the smallest orthogonal distance to . In the second method, we find the nearest neighbor of in . Since each feature point in is associated with a cylinder model in , we then perform data association by matching to the cylinder associated with the nearest neighbor.

The first data association method is desirable, especially during the lidar mapping algorithm, because it only needs to maintain , whereas the second method needs to maintain both and . In a global map, the size of is independent of the number of sweeps seen, whereas the size of will grow over time. However, the second method can be more accurate since it preserves more detailed information through the features at the expense of having a map representation that increases with the number of sweeps. In this work, our main concern is to first obtain accuracy and robustness in forest environments, so we decide to employ the second method for data association. However, it is easy to switch to the first method when scaling to large environments.

Both data association methods are general to other environments beyond forests. The first method only requires a convex representation of the landmark models in order to efficiently compute a distance function, while the second method can be applied more generally as it directly compares features to features. As a result, our proposed data association methods extend beyond our primary problem of Forest DBH Estimation to the more general Semantic LOAM problem.

SLOAM handles the ground feature points differently than the tree feature points. For a given ground feature point at time t, we project it back to at time stamp . We then find the set of nearest neighbors to in , and compute the ground plane corresponding to this set using the SVD. We thus compute a ground plane for each feature , rather than explicitly maintaining and updating the plane model parameters as we do for the tree cylinders. Our treatment of the ground plane is thus similar to [20].

D. Semantic Lidar Odometry and Mapping

The semantic lidar odometry algorithm is presented in Alg. 1. Depending on implementation, multiple recursions can be executed with growing and as the sweep begins from and ends at , or it can be executed just once per sweep. It follows the overall framework of the

original lidar odometry algorithm in [17], except with the incorporation of semantic models and features.

The semantic lidar mapping algorithm has analogous inputs and outputs to the lidar odometry algorithm. It takes as input , which is the output of the lidar odometry algorithm. It then follows the similar steps in Alg. 1 lines 12- 26 to estimate . Rather than comparing the new features to , the mapping algorithm compares them to its map parameters .

The main difference is initializing and updating these map parameters. At time t = 0, the set of map landmarks is empty . At each subsequent sweep , we combine the odometry output with . For each tree feature in the odometry output , we assign it to a tree cylinder in using the same data association process as in the lidar odometry algorithm. However, if a feature point is not close to any tree cylinders, we mark it as an unassigned point. After performing this process, we check the unassigned points and determine which belong to the same tree in the odometry output . If a large enough number of points belong to the same tree in , we add that tree to , taking care to project the tree cylinder model into the world frame.

The process of aggregating the tree and ground features

is the same as in the original LOAM algorithm [17]. We can thus combine the semantic lidar odometry and mapping algorithms to estimate the global pose transform as well as accumulate both the feature points , and the tree landmark models . The explicit estimation of allows us to automatically retrieve the number of trees, along with an estimate of their radius, at the end of SLOAM. It thus demonstrates how a solution to SLOAM will yield a solution to our primary DBH Estimation problem.

IV. ROBUST TREE AND GROUND DETECTION

Sec. III assumes access to the tree features and ground features , and its performance is highly dependent on the quality and reliability of these features. However, obtaining these features is challenging in itself, and we present a 3 part module that reliably extracts tree features, as well a specialized segmentation method to extract ground features. These 3 modules for tree feature extraction can be generalized to other semantic features as well, and as a result can be combined with the SLOAM framework as a full end-to-end system for general environments.

A. Virtual Reality Point Cloud Labeling

Our data-driven deep learning segmentation method requires a large training data set in order to obtain good performance and generalization. However, there are few data sets for lidar point clouds in forests, and we need to collect and label our own data. Most current labeling tools use mouse and keyboard and are designed for 2D images, which are ill-suited for labeling 3D point clouds.

Virtual reality has frequently been used to visualize point clouds [21], and recently to label them [22]. However, most of these applications and tools have been developed in the graphics community, and are either not suitable nor easy to adapt to a robotics application. As a result, we designed a custom VR labeling tool that interfaces with ROS in order to obtain and label point clouds from standard mobile lidar platforms. The VR labeling tool is programmed using the Unity video game engine with the Oculus Rift.

In order to increase tree labeling efficiency, we design a labeling cylinder primitive, and other similar primitives can be used to label other objects. As shown in Fig. 2, the user labels points by placing the labeling primitive over the points. The user is able to easily rotate, translate and scale the point cloud using these handsets. After all tree points have been successfully labeled, the user can easily store the point cloud and cylinders into a database and proceed to the next point cloud without leaving the labeling environment, thus facilitating large acquisitions of labels.

B. Deep Learning Tree Segmentation

Natural forests feature large levels of occlusion and noise, and it is challenging to reliably segment lidar tree points at large scale [23]. Most approaches use clustering-based methods [24], which are limited to relatively clean forests. As a result, our approach is to use a deep neural network for point cloud segmentation to extract the tree features .

Fig. 2: Oculus Rift virtual reality point cloud labeling tool. The white points represent the point cloud, the red cylinder is the labeling primitive, and the gray hands represent the virtual hands of the labeling user.

Rather than directly operating on the point cloud and requiring the use of slower network architectures such as PointNet++ [25], our segmentation network takes as input a 2D range image of size where h is the number of beams and w is the number of azimuth readings of the sensor. Operating on the range image opens up the use faster FCN architectures. Prioritizing speed, we use a simplified version of ERFNet [26] with Non-bottleneck-1D structure, Downsampler and Deconvolution layers [26, Sec. IIIb].

The input is constructed by representing every 3D point as a pixel in the range image according to its position and the lidar sensor model. This representation captures spatial relationships through convolution operations and is more computationally efficient. One drawback is that far apart points with similar altitude and azimuth angles, but different depth ranges, will be neighbors in the range image, and as a result can be challenging to differentiate. To minimize this effect, during training and inference we sample the point cloud according to a radius range relative to the sensor. The segmentation network will output a predicted segmentation for each sampled image, and the final network prediction is given by the sum of all the sampled predictions. This output is a mask assigning semantic labels for every pixel which are projected back to the corresponding 3D points in the lidar point cloud.

C. Ground Segmentation

We next perform ground segmentation to extract . Rather than a neural network, we use simple heuristics due to their effectiveness and simplicity. We make the basic assumption that the ground is locally planar, but not necessarily globally planar. In a given scan, we first remove points that the neural network determines to be part of a tree. As a result, we are left with points that belong either to the ground, shrub, or leaves. The core idea is that the ground should appear below all of these other points. However, we cannot simply extract the lowest points in z from the lidar sweep, as this approach will fail if the terrain is sloped or features other complicated behavior. We instead divide the lidar sweep into a circular grid specified by the distance and angle of the space around

Fig. 3: Left: Trellis graph with 5 detected trees (lower beams at top, higher beams at bottom). Tree 13 exhibits a fork structure. Right: Detected tree points. Each of the 16 beams of our VLP-16 lidar passes over the trees, highlighting the lidar output structure and connection to trellis graphs.

the sensor. Within each grid cell, we retain the lowest points in the z direction, where the number of points to retain is a hyperparameter. By retaining fewer points, it is more likely that the points in are actually ground points, but the total number of ground features will be smaller.

D. Instance Detection

Once we segment the tree points in the lidar sweep, we next group them into individual tree landmarks . The lidar output has a natural structure as exhibited in the range image, specifically that points can be sorted by beam and arranged by the rotation direction of the lidar. This output can be viewed as a trellis graph [27], where each slice of the trellis represents a single lidar beam. We exploit this trellis structure to detect tree instances. The key observation motivating our approach is that in the presence of gravity, a higher beam detecting an object indicates that a lower beam will likely also detect that same object. As a result, in the trellis graph, an object will appear as a path that starts from earlier slices (lower beams) and ends at later slices (higher beams). By properly constructing a weighted trellis graph, we can identify tree instances by efficiently solving for shortest routes using the dynamic programming Viterbi Algorithm [27], or even more simply a greedy algorithm. Since this approach only assumes gravity, it generalizes to other objects besides trees.

The vertices of the trellis graph are clusters of points belonging to the same object in a beam, and the edges are the connections between these vertices across beams. We assume that points on the same object have similar depth readings, and construct vertices in the trellis graph by clustering contiguous points within a single beam according to a depth threshold. We create the edges by connecting the vertices between previous and succeeding slices. To reduce computation, we then remove edges where the distance between the centroid of the points in each vertex exceeds a distance threshold, as those vertices are

TABLE I: ERFNet inspired architecture used for semantic segmentation.

TABLE II: Error: Distance relative to trajectory length in hard UAV experiment loop.

unlikely to belong to the same tree. We specify the weights of each edge by constructing a score function, such as the distance between the vertex centroids.

Once we have specified the trellis graph, we greedily find the tree instances by starting from vertices in the early slices and finding shortest routes through the trellis. We add the route as a tree instance if it exceeds a minimum path length threshold indicating that the tree was detected by enough lidar beams, and if the total path cost is less than a path weight threshold, indicating that the centroids of each vertex in the tree are close together. Although these shortest routes can be solved optimally with the Viterbi algorithm, we found that a greedy approach of following the minimum-weight outgoing edges at each vertex worked for tree instances.

There are 2 auxiliary benefits of using this trellis approach. First, it immediately provides an initial guess for the cylinder parameters . Since the geometric least squares Prob. (5) is non-linear, it is extremely important to obtain a good intialization. We define the focus point of a vertex as the mean of the two points that are furthest from each other in the vertex. The focus point is an estimate of the center of the circle defining the shape of an individual vertex. The radius of this circle can be used to initialize the radius of the tree, and an Orthogonal Distance Regression (ODR) through the focus points of a tree can be used to estimate the axis a and normal n of the cylinder model of the tree. The second benefit of this trellis approach is that it can identify other properties of trees, for example forks in the tree where a single trunk branches into two parts. Fig. 3 displays two trees detected in an organized lidar scan. It also depicts a portion of the corresponding trellis graph with 5 trees, where tree 13 exhibits a fork structure.

V. EXPERIMENTS

We use a Velodyne Puck VLP-16 lidar in our experiments.

The frequency of each lidar sweep is 5Hz. For our method, we run semantic lidar odometry (Alg. 1) per sweep, and semantic lidar mapping once per sweep. The lidar is equipped onto a UAV platform that is manually flown, as well as a handheld sensor suite. The experiments were performed in Wharton State Forest in New Jersey.

We benchmark our method (SLOAM) against A-LOAM, Generalized ICP (GICP), and the Intel Realsense T265. ALOAM is an open source implementation of LOAM [17].

Fig. 4: Trajectories of benchmark methods in hard UAV experiment loop.

GICP [28] is an open source implementation of the Iterative Closest Point (ICP) algorithm available through PointCloud Library (PCL). In order to increase the speed of ICP, we apply a voxel grid filter to reduce the number of points. The Intel Realsense T265 is a commercial off-the-shelf tracking camera that relies on visual-inertial odometry (VIO).

We evaluate on 2 experiments, which we classify as medium and hard. The medium scenario involves walking the handheld sensor suite through a dense forest environment for 1 minute in a straight line. The sensor readings entail rotation motions and instabilities caused by the operator dodging vines and thorns. The hard scenario is the UAV flying for 2 minutes that features significant rotation motions. It starts from hover and flies in a 65m trajectory until it loops back to the initial position and lands on a landing platform. Since the start and end marks are different in the z axis, we offset the goal coordinate by 1 meter instead of using the origin.

The segmentation network is trained on 544 scans from which 16 are from the handheld dataset and the remaining are from 5 other regions of the same forest. No data from the UAV flight was used for training. The network architecture was chosen emphasizing inference speed. Our network architecture shown in Table I runs at 100Hz inference speeds with an Intel NUC5i7RYH on-board our UAV. The final model achieves 0.81 average IoU score on a 10-fold cross validation.

We consider a variety of qualitative and quantitative metrics. Qualitatively we evaluate the trajectory and the accumulated point cloud to observe if any ghosting or duplication of trees occur. Quantitatively, we evaluate the error between start and end in the UAV experiment which performs a loop. Since SLOAM explicitly estimates the radius of each tree, we also quantitatively evaluate the DBH estimation compared to field measurements using a tape measure.

Fig. 5: SLOAM is the only method succeeds in handling aggressive motions (yaw). Top Row (UAV dataset): Colored by z-axis. A-LOAM and GICP maps are blurry, indicating failure in mapping. Bottom Row (Handheld dataset): Colors illustrate the different ways each method treats the points.

Fig. 4 displays each method’s trajectory on the UAV dataset, and Table II quantifies the drift error between start and end. SLOAM achieves the lowest drift. A-LOAM clearly drifts, while the T265 outright fails. The VIO failure is expected, as it cannot handle extreme rotations. According to these trajectory metrics and plots, GICP seems to track closely with SLOAM, and achieves a similar magnitude of drift.

However, Fig. 5 demonstrates that GICP also has difficulty with these datasets. The top row demonstrates the differences in the point cloud maps for each of the lidar based methods. Both A-LOAM and GICP produce blurry maps with frequent ghosting of trees. This blurring is unacceptable when we need to estimate the diameter of the trees with high accuracy. SLOAM, on the other hand, produces a crisp map that preserves fine details in the tree shapes. The fact that GICP performs comparably to SLOAM in the trajectory metrics, but much worse when viewing the point cloud maps, indicates that it has difficulty with rotation motions such as yaw, since the trajectory only measures x,y, and z positions.

SLOAM outperforms A-LOAM and GICP because our semantic features are more reliable than texture-based lines and planes. Specifically, for both ground and tree features, data association is more robust since it inherently filters out noise, and the resulting cost function is more informative due to the use of landmark shapes. While these texture features are reliable in man-made environments, they are problematic in natural environments which lack clear planar and edge surfaces. The bottom row of Fig. 5 illustrates the different ways each method treats the points. SLOAM detects each semantic landmark, indicated by the different colorings of each tree instance. On the other hand, the A-LOAM features appear random, indicating that they are not distinctive and are thus prone to frequent data misassociation. Finally, GICP does not make a distinction between the points. This approach works well when the motion is slight, but it is also susceptible to

TABLE III: DBH Metrics in hard UAV experiment

data misassociation during extreme rotations.

Compared to SLOAM which uses a point to cylinder cost function, both A-LOAM and GICP only utilize a point to plane or point to line cost functions to compute the pose transformation. These approaches force a false planar model onto the cylinders, and will introduce slight errors that manifest as wider, blurry trees. While these small errors will not lead to an outright failure in the sensor state estimates, they are still unacceptable due to the high precision and accuracy necessary to measure tree diameters.

We next evaluate how well SLOAM can estimate the DBH of the tree landmarks. We obtain these estimate for free, as we can use the semantic models and features to extract out the diameter for each landmark. For the hard UAV dataset described in the previous section, we manually measured 35 trees that were in the path of the robot and use the models generated by our method to estimate DBH. We summarize the DBH estimation results compared to human measurements in Table III. SLOAM detected 29 trees with an average error is 0.67 in, which falls within the desired accuracy as typically in industry the measurements are taken to the nearest inch.

We found that using the diameter parameter of our cylinder models to estimate the DBH had a few large outliers. Instead, we found that it was more effective to take the median of all the radii estimates across all beams in all scans, which yielded the results presented in Table III. This process still requires accurate registration from SLOAM to group these beams together. We believe that since only half of the cylinder can be viewed from the lidar at a single scan, the presence of noise or insufficient features can cause instability in the geometric least squares optimization process. It does not seem to affect the pose estimation optimization process, as the presence of many tree landmarks offers robustness to noise. Beyond the median operation, no additional post-processing steps are required to obtain the DBH results.

The ability to quantify mapping performance through landmark ground truth measurements highlights another strength of our approach. It is difficult to quantify the performance of traditional SLAM algorithms, as ground truth measurements of trajectories are hard to obtain in natural environments. For example, the dense forest canopy prevents the use of GPS as ground truth, as the errors can range up to tens of meters. On the other hand, ground truth measurements of the landmark shapes are easily obtained, and provide an alternative quantitative metric to benchmark various algorithms.

VI. CONCLUSIONS

We pose the DBH estimation problem as a special case of the semantic lidar odometry and mapping problem, providing a generic formulation that can be extended to other scenarios. We develop a VR labeling tool to facilitate annotation of 3D lidar scans by fitting geometric primitives to the raw data. This enables us to train a segmentation model that is used with a graph based method for instance detection and extraction of relevant attributes of each tree such as radius and focus point. We demonstrate the difficulty of the DBH estimation problem by benchmarking against 3 other lidar state estimation and VIO methods and observing large drifts. Finally, we show that the semantic shape models are critical in achieving accuracy and scalability in challenging natural environments.

REFERENCES

[1] T. de Conto, K. Olofsson, E. B. G¨orgens, L. C. E. Rodriguez, and G. Almeida, “Performance of stem denoising and stem modelling algorithms on single tree point clouds from terrestrial laser scanning,” Computers and Electronics in Agriculture, vol. 143, no. November, pp. 165–176, 2017.

[2] X. Liang, V. Kankare, J. Hyypp¨a, Y. Wang, A. Kukko, H. Haggr´en, X. Yu, H. Kaartinen, A. Jaakkola, F. Guan et al., “Terrestrial laser scanning in forest inventories,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 115, pp. 63–77, 2016.

[3] S. W. Chen, S. S. Shivakumar, S. Dcunha, J. Das, E. Okon, C. Qu, C. J. Taylor, and V. Kumar, “Counting Apples and Oranges with Deep Learning: A Data-Driven Approach,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 781–788, 2017.

[4] X. Liu, S. W. Chen, C. Liu, S. S. Shivakumar, J. Das, C. J. Taylor, J. Underwood, and V. Kumar, “Monocular Camera Based Fruit Counting and Mapping with Semantic Data Association,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2296–2303, 2019.

[5] S. Lee, D. Har, and D. Kum, “Drone-Assisted Disaster Management: Finding Victims via Infrared Camera and Lidar Sensor Fusion,” Proceedings - Asia-Pacific World Congress on Computer Science and Engineering 2016 and Asia-Pacific World Congress on Engineering 2016, APWC on CSE/APWCE 2016, no. C, pp. 84–89, 2017.

[6] T. Nguyen, S. S. Shivakumar, I. D. Miller, J. Keller, E. S. Lee, A. Zhou, T. zaslan, G. Loianno, J. H. Harwood, J. Wozencraft, C. J. Taylor, and V. Kumar, “Mavnet: An effective semantic segmentation micro-network for mav-based tasks,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3908–3915, Oct 2019.

[7] F. J. Perez-Grau, R. Ragel, F. Caballero, A. Viguria, and A. Ollero, “An architecture for robust uav navigation in gps-denied areas,” Journal of Field Robotics, vol. 35, no. 1, pp. 121–145, 2018.

[8] J. Q. Cui, S. Lai, X. Dong, and B. M. Chen, “Autonomous Navigation of UAV in Foliage Environment,” Journal of Intelligent and Robotic Systems: Theory and Applications, vol. 84, no. 1-4, pp. 259–276, 2016.

[9] B. G. Maciel-Pearson, P. Carbonneau, and T. P. Breckon, “Extend- ing deep neural network trail navigation for unmanned aerial vehicle operation within the forest canopy,” in Annual Conference Towards Autonomous Robotic Systems. Springer, 2018, pp. 147–158.

[10] F. Gao, W. Wu, W. Gao, and S. Shen, “Flying on point clouds: Online trajectory generation and autonomous navigation for quadrotors in cluttered environments,” Journal of Field Robotics, vol. 36, no. 4, pp. 710–733, 2019.

[11] A. Giusti, J. Guzzi, D. C. Ciresan, F. L. He, J. P. Rodriguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. D. Caro, D. Scaramuzza, and L. M. Gambardella, “A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots,” IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 661–667, 2016.

[12] B. G. Maciel Pearson, S. Akcay, A. Atapour-Abarghouei, C. Holder, and T. Breckon, “Multi-Task Regression-based Learning for Autonomous Unmanned Aerial Vehicle Flight Control within Unstructured Outdoor Environments,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 1–1, 2019.

[13] C. Torresan, A. Berton, F. Carotenuto, S. F. Di Gennaro, B. Gioli, A. Matese, F. Miglietta, C. Vagnoli, A. Zaldei, and L. Wallace, “Forestry applications of UAVs in Europe: a review,” International Journal of Remote Sensing, vol. 38, no. 8-10, pp. 2427–2447, 2017.

[14] S. Puliti, H. O. Ørka, T. Gobakken, and E. Næsset, “Inventory of small forest areas using an unmanned aerial system,” Remote Sensing, vol. 7, no. 8, pp. 9632–9654, 2015.

[15] A. Reche-Martinez, I. Martin, and G. Drettakis, “Volumetric recon- struction and interactive rendering of trees from photographs,” in ACM transactions on graphics (ToG), vol. 23, no. 3. ACM, 2004, pp. 720– 727.

[16] A. Fritz, T. Kattenborn, and B. Koch, “UAV-Based photogrammetric point clouds – Tree stem mapping in open stands in comparison to terrestrial laser scanner point clouds,” ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XL-1/W2, no. September, pp. 141–146, 2013.

[17] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real- time.” in Robotics: Science and Systems, vol. 2, 2014, p. 9.

[18] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Proba- bilistic data association for semantic slam,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 1722–1729.

[19] G. Luk´acs, A. Marshall, and R. Martin, “Geometric least-squares fitting of spheres, cylinders, cones and tori,” RECCAD, Deliverable Document 2 and 3, COPERNICUS project, no. 1068, 1997.

[20] T. Shan and B. Englot, “Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 4758–4765.

[21] G. Bruder, F. Steinicke, and A. N¨uchter, “Poster: Immersive point cloud virtual environments,” in 2014 IEEE Symposium on 3D User Interfaces (3DUI). IEEE, 2014, pp. 161–162.

[22] J. D. Stets, Y. Sun, W. Corning, and S. W. Greenwald, “Visualization and labeling of point clouds in virtual reality,” in SIGGRAPH Asia 2017 Posters. ACM, 2017, p. 31.

[23] J.-F. Tremblay, M. B´eland, F. Pomerleau, R. Gagnon, and P. Gigu`ere, “Automatic 3d mapping for tree diameter measurements in inventory operations,” arXiv preprint arXiv:1904.05281, 2019.

[24] M. Pierzchała, P. Gigu`ere, and R. Astrup, “Mapping forests using an unmanned ground vehicle with 3d lidar and graph-slam,” Computers and Electronics in Agriculture, vol. 145, pp. 217–225, 2018.

[25] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in neural information processing systems, 2017, pp. 5099–5108.

[26] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Effi- cient residual factorized convnet for real-time semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263–272, 2017.

[27] G. D. Forney, “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, 1973.

[28] A. Segal, D. Haehnel, and S. Thrun, “Generalized-icp.” in Robotics: science and systems, vol. 2, no. 4. Seattle, WA, 2009, p. 435.

designed for accessibility and to further open science