Autonomous and reliable grasping is crucial to robots performing useful tasks in the real world. A robust robotic grasper must have the ability to: operate in situations where objects may be moving; be robust to errors in sensing or actuation; grasp items that have never been seen previously.
To grasp robustly, with respect to dynamic scenes or sensor/actuation error, a closed-loop approach is required with the ability to perform grasp synthesis at a sufficient rate to use in the control loop. For example, the system [1], which is extended in this work, provides a grasp pose in just 19 ms given a depth image of the scene. However, all RGB-D cameras have a minimum sensing distance for range data, typically in the order of 30 cm as illustrated in Figure 1. For objects closer than this “standoff” distance the camera provides an RGB image but no valid depth data.
This means that a dynamic grasp planner such as [1] will fail during the final grasp phase if the object is still moving. However, RGB cameras can generally operate reliably at close range subject to constraints on depth of field, field of view and occlusion.
In order to perform closed-loop control below the minimum depth sensing distance of the RGB-D camera, a visual servoing (VS) scheme is proposed. Image-based visual servoing (IBVS) is a control approach that uses a set of image-plane visual features (point coordinates [2], parameters of lines or ellipses [3], or image moments [4], [5]) to guide a
Centre for Robotic Vision (ACRV), Queensland University of Technology (QUT), Brisbane, Australia j.haviland@qut.edu.au, feras.dayoub@qut.edu.au, peter.corke@qut.edu. au. This research was conducted by the Australian Research Council project number CE140100016, and supported by the QUT Centre for Robotics.
Fig. 1: During the final approach phase depth information is no longer available from the RGB-D sensor. We use image-based visual servoing for the final motion phase and servo toward a goal feature configuration predicted from the last valid depth image.
camera to a desired pose with respect to the scene [2], [6]. A particular advantage of IBVS, compared to other VS schemes [2], for point-coordinate features is that the points are driven in straight lines on the image-plane and never leave the field of view. IBVS is simple and remarkably robust but in practice there are three challenges: we need to know the distance from the camera to the object; we need to establish robust correspondence between the current and the goal features; and we need to know the goal feature configuration. Firstly, in our case object distance can be inferred from a valid depth image and subsequent robot joint-encoder odometry. Secondly, many techniques exist for robustly establishing correspondence and we propose the use of scale- and rotation-invariant features. Thirdly, goal feature configuration can be predicted from a valid depth image, combined with current and desired end-effector pose and is a key contribution of this work. Further details on each of these system components are provided in the remainder of this paper. We assume that the object motion is planar, that the RGB camera can provide a focussed image at close range, and that the robot’s fingers do not occlude the object. The contributions of this paper are: 1) The use of IBVS to extend the closed-loop working range of depth-image-based grasping controller so as to allow more robust grasping of moving objects. 2) A novel method to predict the goal image-feature configuration for IBVS, from an RGB-D image. 3) Experimental validation with unmodeled moving objects.
Section II describes related work, Section III describes our control approach, Section IV describes our experimental setup and methodology and, finally, Section V details our experimental results and insights informed by the results.
A. Visual Servoing for Grasping
A key problem in robotic grasping is determining the grasp pose. Grasp pose synthesis is a well established field with many methods [7], [8]. Recently, grasp point synthesizers using deep learning approaches have proven very successful, even on never-before-seen items and scenes [1], [9]–[11]. These synthesizers have been able to learn effective and important features present in depth images to output reliable grasp poses.
Closed-loop grasping involves using a VS scheme to position the robot’s end-effector in such a way that it can grasp an object. The current state-of-the-art in robotic grasping uses position-based visual servoing (PBVS) to guide the end-effector to the grasp point [1], [9]–[11] based on the estimated camera-relative pose of the object grasp point. However, these approaches rely on depth information from an RGB-D camera which has a minimum operating distance, therefore the last stage of the grasp must be completed in an open-loop manner.
B. Minimum sensing distance
Structured light [12] and stereo [13] cameras exploit multiple view geometry with a fixed transform between two different camera sensors to construct 3D information [14]. The transform between the two sensors is known as the baseline. On structured light cameras, the baseline causes the projected pattern to be outside the field-of-view of the camera at the minimum standoff distance. For stereo cameras at close range, the amount of overlap between views is limited and most algorithms enforce a maximum disparity search which is inversely proportional to range.
Time-of-flight (ToF) cameras construct 3D information by measuring the round trip time of a projected light signal [15]. However, these will not operate at the minimum standoff distance due to the minimum measurable round-trip time for the light pulse emitted by the camera.
C. Using a Feature Detector for Visual Servoing
Image feature points are a popular choice for IBVS in unstructured environments. Feature descriptors describe a support region around the feature point and are essential for reliable matching across views – they are ideally invariant to scale, orientation, illumination, and affine transformations.
Common feature detector and descriptor combinations include: SIFT [16], SURF [17], ORB [18] and MSER [19]. A study [20] concluded that SIFT descriptors [16] were the best in all categories other than robustness luminance changes. Despite the computation time, we have used SIFT in this work.
For IBVS it is critical, at every time step, to locate each goal feature in the current image. Features are initially matched between frames based on descriptor distance. Greater robustness can be achieved by various heuristics such as the ratio test [16], loop consistency [21], or epipolar constraints enforced by computing the fundamental matrix [14] with Random Consensus Algorithm (RANSAC) [22].
Other approaches exist which variously: use image reference features with SIFT descriptors [23], use epipolar lines to define a sliding visual servoing scheme [24], apply IBVS on a mobile robot [25], and use SIFT features and descriptors with a known 3D model of the goal object to guide a position-based visual servoing (PBVS) scheme [26]. Deep learning has provided alternatives to feature matching, such as monocular depth estimation, or depth reconstruction [27], [28]. However, these are computationally expensive and trained on large-scale scenes rather than close-up images.
The approaches in [23], [24], and [25] require prior knowledge of the goal feature configuration. Typically in IBVS this comes from moving the camera to the goal pose but for the problem we are considering this is not possible. Instead the goal feature configuration must be estimated, for a previously unseen object, from observed RGB-D data and measured robot pose.
This section outlines our proposed grasp controller. The primary sensor is an Intel RealSense D15 RGB-D camera and the control loops run at 30 Hz.
The key aspects of our controller are:
1) Perform continuous grasp pose synthesis and PBVS to approach the object, utilizing 3D information from an RGB-D camera. See Frame A in Figure 2 and Section III-A. We also find SIFT image features on the object and record their position, descriptor and depth.
2) At the lower depth limit of the camera, estimate the image-plane locations of the SIFT features for a camera at the grasp pose. See Frame B in Figure 2 and Section III-B.2.
3) Below the lower depth limit of the camera, robustly match features in the camera’s view to the last stored features and perform IBVS control to drive the former to the latter. See Frame C in Figure 2 and Section III-B.
4) Grasp the object.
Nomenclature We use the notation of [29] where {x} denotes a coordinate frame, is a point in 3D space defined with respect to the coordinate frame {x}, and
is a relative pose or rigid-body transformation of {x} with respect to {y}, and
represents composition. Additionally, we use
(where H and W are the height and width of the image) to denote an image captured by a camera in the {x} frame, and
to denote an image-plane coordinate of
. We define coordinate frames ‘w’ for world, ‘c’ for camera, and ‘e’ for end-effector. The superscript ‘*’ denotes demand.
Fig. 2: Overview of proposed switching grasp controller. The robot continuously performs grasp point synthesis while using a PBVS scheme to approach the grasp point (shown in A). At the depth limit of the RGB-D camera the controller sets a grasp pose, stores SIFT key points with locations of the current view of the scene, and predicts the target configuration of these features (shown in B). After this point, features in the current scene are matched to the stored information, before using the predicted target configuration of these features to IBVS to the goal (shown in C).
A. RGBD-based initial reaching phase
1) Grasp Pose Calculation: We utilize the grasp synthesizer of Morisson et. al. [1] which, unlike competing deep learning approaches, provides a real-time output (reported as 19 ms per image) making it suitable for a closed-loop system. It takes a depth image and outputs an antipodal grasp encoded as
where and whose values represent respectively the finger orientation, finger width and grasp quality at every pixel. The best visible grasp is given by g = max
which is a tuple
where s = (u, v) is the coordinate in Q with the greatest grasp quality and
are the corresponding finger orientation, width and grasp quality. The grasp is expressed in camera image coordinates and the camera axis is assumed parallel to the table surface. There is potential for similarly ranked grasps in multiple locations of the image and the grasp point may be unstable which degrades the speed and quality of the control. To counteract this, the previous
is stored as
and the next
is defined as the closest (in image plane coordinates) local maxima around
. The general form of the camera projection equation for a calibrated camera is
where the left-hand side is the homogeneous image-plane coordinate of the Cartesian point (X, Y, Z) and is the camera matrix, a function of camera pose and intrinsics, which we can write in partitioned form
where and the rest scalar. We can solve for (X, Y ) since f and Z are known
The best grasp is
where represents the 3D location of the grasp,
represents the finger orientation (yaw) of the grasp, W represents the grasp width in metres, and q represents the grasp quality. The grasp pose can be conveniently represented as Cartesian position and XYZ roll, pitch, and yaw angles
where roll and pitch angles are 0 since the fingers are assumed to be normal to the table.
The fingers can be closed when frame {e} equals {g} but pose is measured with respect to the camera which is offset from the end effector by. The desired camera pose, with respect to current camera pose, is therefore
2) PBVS Controller: The PBVS controller is defined as
where is the end-effector spatial velocity in the world frame,
is the diagonal controller gain matrix, and e describes the pose error i.e.
. This controller will run while depth data is available from the RGB-D camera.
3) Finding visual features: At every time step we extract the n-strongest SIFT features from the RGB image, that belong to the objet, and form a list of reference features and descriptors
where is the feature position,
is the corresponding SIFT descriptor, and
is the corresponding depth if valid depth data is available from the RGB-D camera.
For the last frame with valid depth information, we record the feature data and the end-effector pose
for later use.
B. RGB-based final approach phase
where f is a set of detected image-plane feature coordinates and is a set of corresponding desired image-plane feature coordinates.
The 2-dimensional image-plane velocity of a pixel is related to the 3-dimensional velocity of the camera (rigidly attached to the end-effector) by an image Jacobian (also known as an interaction matrix) [2]
where is the camera spatial velocity, and
is the image Jacobian
where and
are extracted from
in (3) and
is updated based on forward kinematics If
we can solve for the end-effector spatial velocity
where is the gain of the controller,
is the Moore-Penrose pseudo inverse of the stacked image Jacobian from (6),
is a Jacobian which transforms the spatial velocity of the camera to the end-effector,
is the error vector from (4).
The depth values used in (6) can be estimated during the IBVS motion [30], [31] but [2] shows that small errors in depth will have a negligible effect on the performance of the controller. In this work, we fix the depth value Z in (5) at 5 cm.
2) Goal feature prediction: The IBVS controller requires the image-plane coordinates of the tracked features when the camera is at the grasping pose. We use information,
and, from the last frame where depth information is available to estimate this. We compute the Cartesian coordinates of the SIFT features using (1), transform them to the camera pose when the end-effector is at the synthesized grasp pose, then re-project them to the image plane. 3) Robust feature matching: For each subsequent image we compute (3), without the depth, and attempt to robustly match the features to
. We use a hierarchy of checks to ensure robust matching: 1) distance ratio test outlined in [16] 2) duplicate feature removal. SIFT can produce duplicate features with different scale and orientation, so we remove any features within 5 pixels of another match (where the higher quality match remains). 3) loop constraint [21] 4) the fundamental matrix is calculated using RANSAC. This produces a list of inlier, and outlier matches where only inlier matches are retained [22]. 5) A 20
20 grid is placed over the image, where a maximum of one matched feature is kept per grid cell. This ensures that the Jacobian in (5) is well conditioned and that feature points are well spread across the image.
C. Switching VS Control Scheme
The RealSense D15 camera has a rated minimum sensing distance of 16 cm which agrees with our experience. Violation of minimum distance results in depth values of NaN. Rather than counts NaNs in the image and choose a threshold we adopt a simple and conservative strategy that deems range data invalid when the object is sensed to be within 25 cm of the camera.
Our controller uses PBVS when range data is available and IBVS when it is not. This allows us to exploit the benefits of each, while avoiding their major shortcomings. PBVS provides an optimal Cartesian path to the goal but is prone to errors introduced from camera calibration and robot odometry. This makes PBVS best suited to getting the robot close to the goal. IBVS is very robust to sensor error but may produce sub-optimal Cartesian paths and is best utilized in the final approach. Figure 2 demonstrates this approach. A simple filter ensure continuity of velocity at the transition.
D. Grasping
When the error calculated in (4) becomes sufficiently small, the servoing is considered complete and a grasp can be attempted. This is completed by instructing the fingers of the robot’s grippers to close.
We validate and evaluate our approach through testing on an arm-type robot. Our approach is realized using ROS middleware and primarily Python code. Our experiments first seek to validate that the approach works given ideal
Fig. 3: (a) The target features printed on A4 paper. (b) The robot’s position before attempting to reach the goal. (c) the robot in the goal state.
Fig. 4: (a) The 8 household objects which are to be grasped. (b) The robot’s position before attempting a grasp.
0 2 4 6 8 10 12 14 16 Time (s)
Fig. 5: Experiment 1: Target Feature Error in Static and Dynamic Test with Known Features
0 2 4 6 8 10 12 Time (s)
Fig. 6: Experiment 2: Target Feature Error in a Static and Dynamic Grasping Trial
conditions, before evaluating the robustness and consistency on repeated grasping trials.
A. Equipment
As shown in Figure 1, all experiments are performed using a Franka-Emika Panda robot, equipped with 3D-printed grippers using a design from [32]. We use an Intel RealSense D415 camera to provide RGB and depth information, which is mounted to the robot’s end-effector.
B. Experiment 1: Predicting Target Feature Configuration
We first validate our target feature configuration predictor through an IBVS controller with both static and dynamic targets. These tests seek to demonstrate that the target
Fig. 7: Experiment 1: Test with Binary Features (a) Photo of Initial and Desired Feature Configuration from the Initial Camera POV. (b) Actual Target Feature Trajectory in the Image Plane.
feature configuration can be predicted using only initial depth information, to enable closed-loop control.
To remove any potential unreliability due to feature matching we print a simple planar target with blobs of different sizes and shapes which we analyze with classical binary vision techniques, see Figure 3a. The goal of the robot is to place its end-effector at the reference pose of this target, see Figure 3c.
In the static test, the robot is initialized 40 cm above the table, see Figure 3b, with the target having a random pose and located within the camera’s field of view. In the dynamic tests, the target is moved by hand in a random translational motion such that they remain visible to the camera.
C. Experiment 2: Grasping Trials
In this experiment we evaluate our switching visual servoing scheme on grasping tasks with common household objects and use SIFT features. We perform dynamic grasping trials where the object on the table is moved in a random fashion while the robot attempts to grasp it. Some of the target objects are displayed in Figure 4a and vary in size, and grasp difficulty. In each test, one of the objects is placed randomly on a table located within a 30 30cm zone. The starting configuration of the robot with an object is shown in Figure 4b. The robot then attempts to grasp the object ten times while the object is being moved.
The results from the Experiment 1 static test show that, given perfect correspondence between features, the approach will allow the robot to achieve the goal. Figure 5 shows the average feature error between each feature point and the desired location of that feature point in the image plane. Figure 7a displays the initial view of the features from the camera’s point of view, the desired feature configuration as predicted by our algorithm and the ideal IBVS path to be followed. Figure 7b displays the feature path actually taken on the image plane. The paths are close to straight but deviation is due to the fixed depth value used to compute the image Jacobian.
The results from the Experiment 1 dynamic test verifies that the controller can operate with dynamic scenes. Figure 5 displays the target feature error. We observe large
TABLE I: Experimental Results From Grasping Trials
upward spikes when the target is moved in the scene but the controller continues to drive the features to their goal configuration.
Results for Experiment 2 dynamic grasping are shown in Figure 6 which displays the feature-point error in a static and dynamic object grasp attempt. We observe that there is more noise present in the system when using SIFT features compared to the simple binary features of Figure 5, however the robot consistently reaches its goal.
Summary results from the repeated dynamic grasping trials are displayed in Table I. The results in rows 1 and 2 are taken directly from [1] and highlight the issue that a closed-loop grasper has for the case of moving objects due to the camera minimum sensing distance.
In contrast, our switching controller (row 3) shows a sig-nificant increase in performance for closed-loop grasping of moving objects with a grasping success rate is 76.25%. The performance of our system was stronger on larger objects and objects which had many unique SIFT features. This is expected since our approach relies on numerous unique features being present in the scene with enough remaining to be visible to the camera at the grasp point. While this limits the types of scenes and objects our approach will reliably work on, it could be mitigated through alternative or additional features such as lines, object shapes, or image moments.
The main failure mode was with objects that moved fast and left the camera’s field of view. This could be mitigated by incorporating an object velocity estimator and feed-forward control. Other failure modes included blurry images due to extreme object velocity, and some weakness in the use of SIFT features such as stability of feature position and descriptor over very large changes of scale.
Robotic grasping controllers that are not closed-loop will fail to grasp moving objects. Closed-loop grasp controllers based on RGB-D imagery, such as [1], can track a moving object, but fail when the sensor’s minimum object distance is violated just before grasping.
This paper has shown how image-based visual servoing can improve the performance of closed-loop RGB-D-based grasping algorithms for the case of moving objects. We achieve this by servoing toward image-plane goal features that are predicted from a depth image, robot encoder-based pose, and the grasp synthesiser’s goal pose. This is quite different to most previous IBVS work where the goal feature configuration is assumed to be known. Using IBVS in this way retains all the advantages of the RGB-D-based grasp synthesizer such as not requiring a model of the object being grasped.
We have demonstrated the robustness of this new approach in the context of dynamic closed-loop grasping and shown a greatly improved grasp success rate.
[1] D. Morrison, J. Leitner, and P. Corke, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” in Robotics: Science and Systems XIV. Robotics: Science and Systems Foundation, June 2018.
[2] S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” IEEE Transactions on Robotics and Automation, vol. 12, no. 5, pp. 651–670, Oct 1996.
[3] B. Espiau, F. Chaumette, and P. Rives, “A new approach to visual servoing in robotics,” IEEE Transactions on Robotics and Automation, vol. 8, no. 3, pp. 313–326, June 1992.
[4] F. Chaumette, “Image moments: a general and useful set of features for visual servoing,” IEEE Transactions on Robotics, vol. 20, no. 4, pp. 713–723, Aug 2004.
[5] F. Chaumette and S. Hutchinson, “Visual servo control. II. Advanced approaches [Tutorial],” IEEE Robotics Automation Magazine, vol. 14, no. 1, pp. 109–118, March 2007.
[6] ——, “Visual servo control. I. Basic approaches,” IEEE Robotics Automation Magazine, vol. 13, no. 4, pp. 82–90, Dec 2006.
[7] A. Bicchi and V. Kumar, “Robotic grasping and contact: a review,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), vol. 1, April 2000, pp. 348–353 vol.1.
[8] K. Shimoga, “Robot Grasp Synthesis Algorithms: A Survey,” The International Journal of Robotics Research, vol. 15, no. 3, p. 230266, 1996. [Online]. Available: https://doi.org/10.1177/027836499601500302
[9] E. Johns, S. Leutenegger, and A. J. Davison, “Deep learning a grasp function for grasping under gripper pose uncertainty,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, pp. 4461–4468.
[10] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, vol. 34, no. 4-5, p. 705724, 2015. [Online]. Available: https://doi.org/10.1177/0278364914549607
[11] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. Aparicio, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in Robotics: Science and Systems XIII. Robotics: Science and Systems Foundation, July 2017.
[12] D. Fofi, T. Sliwa, and Y. Voisin, “A comparative survey on invisible structured light,” in Machine vision applications in industrial inspection XII, vol. 5303. International Society for Optics and Photonics, 2004, pp. 90–98.
[13] U. R. Dhond and J. K. Aggarwal, “Structure from stereo-a review,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 19, no. 6, pp. 1489–1510, Nov 1989.
[14] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. New York, NY, USA: Cambridge University Press, 2003.
[15] G. J. Iddan and G. Yahav, “Three-dimensional imaging in the studio and elsewhere,” in Three-Dimensional Image Capture and Applications IV, B. D. Corner, J. H. Nurre, and R. P. Pargas, Eds., vol.
4298, International Society for Optics and Photonics. SPIE, 2001, pp. 48 – 55. [Online]. Available: https://doi.org/10.1117/12.424913 [16] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol. 60, no. 2, p. 91110, Nov 2004. [17] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust Features,” in Computer Vision ECCV 2006, A. Leonardis, H. Bischof, and A. Pinz, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, p. 404417. [18] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International Conference on Computer Vision, Nov 2011, pp. 2564–2571. [19] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” Image and Vision Computing, vol. 22, pp. 761–767, 09 2004. [20] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1615–1630, Oct. 2005. [21] C. Zach, M. Klopschitz, and M. Pollefeys, “Disambiguating visual relations using loop constraints,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 1426–1433. [22] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, pp. 381–395, 1981. [23] A. Maxim, C. Lazar, A. Burlacu, and C. Copot, “Robotic visual servoing system based on sift features,” in 2012 16th International Conference on System Theory, Control and Computing (ICSTCC), Oct 2012, pp. 1–6. [24] J. Xin, B. Ran, and X. Ma, “Robot visual sliding mode servoing using sift features,” in 2016 35th Chinese Control Conference (CCC), July 2016, pp. 4723–4729. [25] H. Lang, Y. Wang, and W. d. S. Clarence, “Vision based object identification and tracking for mobile robot visual servo control,” in IEEE ICCA 2010, June 2010, pp. 92–96. [26] F. Liefhebber and J. Sijs, “Vision-based control of the manus using sift,” in 2007 IEEE 10th International Conference on Rehabilitation Robotics, June 2007, pp. 854–861. [27] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [28] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning for monocular depth map prediction,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017, pp. 2215–2223. [29] P. Corke, Robotics, Vision and Control, 2nd ed. Springer International Publishing, 2017. [30] J. T. Feddema and O. R. Mitchell, “Vision-guided servoing with feature-based trajectory generation (for robots),” IEEE Transactions on Robotics and Automation, vol. 5, no. 5, pp. 691–700, Oct 1989. [31] N. P. Papanikolopoulos and P. K. Khosla, “Adaptive robotic visual tracking: theory and experiments,” IEEE Transactions on Automatic Control, vol. 38, no. 3, pp. 429–445, March 1993. [32] M. Guo, D. V. Gealy, J. Liang, J. Mahler, A. Goncalves, S. McKinley, J. A. Ojea, and K. Goldberg, “Design of parallel-jaw gripper tip surfaces for robust grasping,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 2831–2838.