nine different target containers for robot experiments. The first five rows in Table I illustrate the properties of these nine containers, where containers 1-3 are included in the training dataset while 4-9 are only used for testing. We kept the distance between the target containers and the microphone the same as in our original dataset. Then we set the desired length of the air column to 60 mm. We synthesized a audio signal by playing the PR2 robot noise from a pair of loudspeakers located at positions 1&1 as shown in Fig. 10(a). Here, 1&1 means that both loudspeakers were placed at position 1. We repeated the experiments five times for each container.
Fig. 8(a) displays that the absolute mean errors of the liquid height are below 8 mm and the standard deviations are below 4 mm for both MP-Net and AP-Net among the known target containers. Container 3 performs best for both networks due to the stainless steel material which makes the crispest sound. Furthermore, we converted the height error of each cup to a weight error, shown in rows seven and eight of Table I. We can see that AP-Net performs well on known containers and even outperforms MP-Net a
TABLE I THE PROPERTIES OF NINE TARGET CONTAINERS AND CONTROLLING RESULTS OF POURING WATER INTO A DESIRED LENGTH OF THE AIR COLUMN.
Fig. 8. In the following robot experiments, the were set to 5 dB. Robot experiment result (a) on pouring water into different target containers (
mm, ID 1-3 are in the training dataset, ID 4-9 are novel), (b) on four different pouring heights (pouring height 310 mm was the height for dataset collection,
on one known source container S1 and two novel source containers S2, S3 (
little on containers 2 and 3. Regarding the unseen containers, especially for container 7 and 9, MP-Net exhibits a stronger generalization ability than AP-Net in a noisy environment.
ment, we placed the source container at four different heights from the set [310, 260, 210, 160] mm respectively. The height of the source container is the vertical distance from the mouth center of the source container to the scale plane. We carried out five robot experiments on each height using target container 2 when . Fig. 8(b) indicates that our algorithm performs well for the three higher heights but not at the lowest height. As the pouring height decreases, the volume of the pouring sound also decreases. With the source container at height 160 mm the pouring sound is too low, and haptics only is not sufficient for the network to perceive the height of the air column.
container influences the flow-rate and the initial force/torque data. Therefore, we tested two novel source containers S2 (44 g) and S3 (484 g), which are shown in the left side of Fig. 2, to compare with the container S1 (64 g) used for network training and all other experiments. We kept the pouring height at 310 mm, and the other experimental setup was the same as in the evaluation of the different pouring height. Fig. 8(c) suggests that different source containers hardly affect network performance.
verify the performance of our MP-Net model in different
Fig. 9. Robot experiment results of the performance of MP-Net, AP- Net, and AP-Net* in environments with different levels of noise (measured by ). We evaluate with five different target liquid heights and the results are demonstrated in five different colors, respectively. The dashed lines represent the desired lengths of the air column, while the solid dots (with error bars) show the actual ones when the pouring terminates.
noise conditions, we implemented a set of experiments using target container 2 under six different noise levels of [-5, 0, 5, 10, 15, 20] . The loudspeakers were again located at positions 1&1. For each
level, we tested five different target lengths of air column, namely [40, 50, 60, 70, 80] mm. We carried out five robot experiments on each audio and target length condition.
As visualized in Fig. 9, MP-Net has a substantial advantage over AP-Net when , which further indicates the advantages of multimodal fusion. In this experiment, we also tested AP-Net* under different noise conditions. AP-Net* performed well while
, but when
, the robot either stopped pouring immediately or overfilled the target containers. Therefore we did not list the experiment results of AP-Net* tested on audio with
.
assess whether MP-Net is sensitive to the direction of the noise source, we set up six different position combinations [1&1, 2&2, 3&3, 4&4, 2&4, 1&3] of the two loudspeakers.
The two loudspeakers played a synthetic noise signal at each position. We used the target container 2 and a desired air column length of 40 mm. The other experimental setup was the same as in the evaluations of the different target containers. Then at each combined position of the two loudspeakers, we poured water five times. As shown in Fig. 10(b), when the loudspeakers are at position 1&1, both models perform best as the loudspeakers are behind the microphone. MP-Net generalizes better than AP-Net to
Fig. 10. (a) Schematic diagram of 4 different loudspeaker positions relative to the target containers, the UR5 robot position, and the control box of the UR5 robot. Evaluation results of (b) six combinations of two loudspeakers positions, (c) varying initial heights of the source container, (d) different types of liquids and (e) different types of noise sources. In these experiments, the was set to 5 dB. The target height
was set to 40 mm.
the different positions of the loudspeakers due to the lower consistent mean height error among all tested positions.
Containers: In this experiment, we poured water into the target container 2, starting from five different initial liquid heights [0, 10, 20, 30, 40] mm. We tested five times from each initial level. We put two loudspeakers at 1&1 positions and kept the other test setups the same as in the evaluations of the varying direction of noise sources. The results in Fig. 10(c) demonstrate that MP-Net is again more robust than AP-Net. Force and torque data yield a meaningful indication of how much water was poured out.
pouring experiments with different liquids: pure water, orange juice and 1.8% fat milk. We used the same experimental setting as in the evaluations of different microphone positions and poured each type of liquid for five times. As manifested in Fig. 10(d), MP-Net can generalize to common household liquids like water and orange juice while AP-Net cannot handle the task of pouring orange juice well under 5. However, similar to [6], due to the high viscosity of milk, both models cannot generate correct height prediction.
also assessed our model with three noise types: PR2 robot noise, human voices and a continuous piece of piano music. The human voice is represented by discrete sounds of a man counting numbers in English. We poured water under each type of noise five times. All experimental settings were the same as in the evaluations of different types of liquid. Fig. 10(e) shows that MP-Net is not affected by different noise types, but the accuracy of AP-Net has a small fluctuation under a musical disturbance.
C. Shape Prediction of Target Containers
In this section, we applied MP-Net to predict the shape of symmetric target containers. In this case, the edge profile is sufficient to describe the shape of the containers, which is determined by the correlation between height and radius [29]. Fig. 11(a) shows a volume profile filled with liquid of density , where
are the poured liquid volume, and the weight and liquid height differences during a time interval
respectively. Assuming that
is very small, then
can be calculated by approximating the shape of
as a cylinder,
We can determine by the
values from the force/torque sensor and
through our neural network output
. In the robot experiments, the frequency of
and
was 500 Hz and 12 Hz, respectively. To get a smooth and accurate estimation of the container shape, we used a quadratic function to fit the scatter points,
For target containers 1, 2, 4, 5, 6, 7, we conducted five trials of the experiment in which the robot pours into these target containers. We recorded the realtime estimation of and force data into a rosbag. When the target container was filled to about 90% of its total height, we stopped the pouring and the recording. Using the data from these rosbags, we calculated the edge-profiles of the target containers. In Fig. 11(b), the thick black curves are the ground truth profiles, while five colored curves around black curves depict the experimental results. The magenta area in the middle of each target container visualizes the mean error of the radius prediction. As expected, the mean radius estimation error is highest for an empty container, when our recursive network cannot yet rely on its memory but stabilizes as the liquid level rises. Due to the restriction to quadratic functions, the reconstruction works best for containers with low edge curvature (such as containers 1, 7).
In this paper, we motivate the need for combining audio and haptic information for robot pouring tasks. We recorded a robot pouring dataset that includes 300 complete robot pouring sequences with audio and force/torque data. We propose a novel audio-haptic recurrent deep network (MPNet) trained on this dataset that predicts liquid height in realtime. The multimodal perception system is systematically tested across four baselines and a wide range of robotic pouring experiments in a noisy environment. The results substantiate that MP-Net is quite robust against noise and against changes in different tasks and varying environments.
Fig. 11. (a, left) Schematic diagram of a symmetric container. In a specified time interval , the change in mass
can be determined by the F/T sensor and the change in height
can be derived through MP-Net. Then the radius r at each height can be calculated to form an edge-profile of this container. (b, right) Prediction result of estimating the target container shape. The black curve is the ground truth, while the five different colored curves are the estimated target container shape in five trials. The mean error of the estimated container radius at different heights is plotted in the middle of each subplot (the shaded magenta area).
Finally, the multimodal nature of our network lets us reconstruct the shape of the target container. The dataset and associated software are public and are available at https://lianghongzhuo.github.io/MultimodalPouring.
One surprising limitation of our approach is the poor generalization to liquids like milk or fruit juices, which would be considered quite similar to water by many humans, while the pouring noises are actually quite different. Training on different liquid types would improve network performance, but MP-Net will still fail in situations where the auditory signal is too weak. Another issue is our use of raw force/torque data as the network input, which changes significantly for different grasp types and pouring motions. This could be resolved by training on many grasps, or simply by feeding preprocessed weight data into the network.
For future work, using audio and haptic information for dynamic control of robotic pouring would be an exciting research direction.
[1] C. Schenck and D. Fox, “Perceiving and reasoning about liquids using fully convolutional networks,” The Int. Journal of Robotics Research (IJRR), pp. 452–471, 2018.
[2] R. Mottaghi, C. Schenck, D. Fox, and A. Farhadi, “See the glass half full: Reasoning about liquid containers, their volume and content,” in IEEE Int. Conf. on Computer Vision (ICCV), 2017, pp. 1871–1880.
[3] C. Schenck and D. Fox, “Reasoning about liquids via closed-loop simulation,” in Robotics: Science and Systems (RSS), 2017.
[4] S. Clarke, T. Rhodes, C. G. Atkeson, and O. Kroemer, “Learning audio feedback for estimating amount and flow of granular material,” in Proc. of The 2nd Conf. on Robot Learning, 2018, pp. 529–550.
[5] J. Wilson, A. Sterling, and M. C. Lin, “Analyzing liquid pouring sequences via audio-visual neural networks,” in IEEE Int. Conf. on Intelligent Robots and Systems (IROS), 2019, pp. 7696–7703.
[6] H. Liang, S. Li, X. Ma, N. Hendrich, T. Gerkmann, F. Sun, and J. Zhang, “Making sense of audio vibration for liquid height estimation in robotic pouring,” in IEEE Int. Conf. on Intelligent Robots and Systems (IROS), 2019, pp. 5333–5339.
[7] Y. Huang and Y. Sun, “Learning to pour,” in IEEE Int. Conf. on Intelligent Robots and Systems (IROS), 2017, pp. 7005–7010.
[8] S. Ikeno, R. Watanabe, R. Okazaki, T. Hachisu, M. Sato, and H. Kajimoto, “Change in the amount poured as a result of vibration when pouring a liquid,” in Haptic Interaction. Springer, 2015, pp. 7–11.
[9] C. Weadon, “Pouring in the dark,” Future Reflections, vol. 10, no. 3, 1991.
[10] K. J. Pithadiya, C. K. Modi, and J. D. Chauhan, “Selecting the most favourable edge detection technique for liquid level inspection in bottles,” Int. Journal of Computer Information Systems and Industrial Management Applications (IJCISIM), pp. 2150–7988, 2011.
[11] C. Schenck and D. Fox, “Visual closed-loop control for pouring liquids,” in IEEE Int. Conf. on Robotics and Automat. (ICRA), 2017, pp. 2629–2636.
[12] C. Dong, M. Takizawa, S. Kudoh, and T. Suehiro, “Precision pouring into unknown containers by service robots,” in IEEE Int. Conf. on Intelligent Robots and Systems (IROS), 2019, pp. 5875–5882.
[13] C. Do and W. Burgard, “Accurate pouring with an autonomous robot using an RGB-D camera,” in Intelligent Autonomous Systems(IAS), 2018, pp. 210–221.
[14] C. Do, C. Gordillo, and W. Burgard, “Learning to pour using deep deterministic policy gradients,” in IEEE Int. Conf. on Intelligent Robots and Systems (IROS), 2018, pp. 3074–3079.
[15] P. Sermanet, K. Xu, and S. Levine, “Unsupervised perceptual rewards for imitation learning,” arXiv preprint arXiv:1612.06699, 2016.
[16] S. Griffith, V. Sukhoy, T. Wegter, and A. Stoytchev, “Object categorization in the sink: Learning behavior–grounded object categories with water,” in ICRA Workshop on Semantic Perception, Mapping and Exploration, 2012.
[17] Y. Huang and Y. Sun, “A dataset of daily interactive manipulation,” The Int. Journal of Robotics Research (IJRR), pp. 879–886, 2019.
[18] L. Rozo, P. Jim´enez, and C. Torras, “Force-based robot learning of pouring skills using parametric hidden markov models,” in IEEE Int. Workshop on Robot Motion and Control, 2013, pp. 227–232.
[19] H. P. Saal, J.-A. Ting, and S. Vijayakumar, “Active estimation of object dynamics parameters with tactile sensors,” in IEEE Int. Conf. on Intelligent Robots and Systems (IROS), 2010, pp. 916–921.
[20] C. Matl, R. Matthew, and R. Bajcsy, “Haptic perception of liquids enclosed in containers,” in IEEE Int. Conf. on Intelligent Robots and Systems (IROS), 2019, pp. 7142–7149.
[21] R. Sanchez-Matilla, K. Chatzilygeroudis, A. Modas, N. F. Duarte, A. Xompero, P. Frossard, A. Billard, and A. Cavallaro, “Benchmark for human-to-robot handovers of unseen containers with unknown filling,” IEEE Robotics and Automat. Lett., vol. 5, no. 2, pp. 1642–1649, 2020.
[22] C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B. Tenenbaum, “Look, listen, and act: Towards audio-visual embodied navigation,” in IEEE Int. Conf. on Robotics and Automat. (ICRA), 2020.
[23] T. Wu, J. Lin, T. Wang, C. Hu, J. C. Niebles, and M. Sun, “Liquid pouring monitoring via rich sensory inputs,” in European Conf. on Computer Vision (ECCV), 2018, pp. 335–351.
[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[25] D. Park, Z. Erickson, T. Bhattacharjee, and C. C. Kemp, “Multimodal execution monitoring for anomaly detection during robot manipulation,” in IEEE Int. Conf. on Robotics and Automat. (ICRA), 2016.
[26] K. Cho, B. Van Merri¨enboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 2014, pp. 103–111.
[27] A. P. French, “In vino veritas: A study of wineglass acoustics,” American Journal of Physics, pp. 688–694, 1983.
[28] E. S. Webster and C. E. Davies, “The use of helmholtz resonance for measuring the volume of liquids and solids,” Sensors, vol. 10, no. 12, pp. 10 663–10 672, 2010.
[29] M. Kennedy, K. Schmeckpeper, D. Thakur, C. Jiang, V. Kumar, and K. Daniilidis, “Autonomous precision pouring from unknown containers,” IEEE Robotics and Automat. Lett., vol. 4, no. 3, pp. 2317– 2324, 2019.