There is an increased demand for task automation in robots. Contact-rich tasks, wherein multiple contact transitions occur in a series of operations, are extensively being studied to realize high accuracy. In this study, we propose a methodology that uses reinforcement learning (RL) to achieve high performance in robots for the execution of assembly tasks that require precise contact with objects without causing damage. The proposed method ensures the online generation of stiffness matrices that help improve the performance of local trajectory optimization. The method has an advantage of rapid response owing to short sampling time of the trajectory planning. The effectiveness of the method was verified via experiments involving two contact-rich tasks. The results indicate that the proposed method can be implemented in various contact-rich manipulations.
Automation of tasks involving contact is currently in great demand; for instance, robots performing assembly tasks that involve contact with the objects to be assembled without damaging the parts. High-order assembly may involve multiple contact transitions in a series of operations and such tasks are termed as “contact-rich tasks”. Because the actions to be performed by the robots change at each contact transition, the trajectory plannig and controller design are complex. Moreover, assembly often requires fitting parts into small gaps, measuring tens of micrometers, thereby generating the issue of achieving a level of accuracy that exceeds the robot’s own position accuracy.
Impedance control ,  and admittance control ,  are known to be effective in achieving higher accuracy in the assembly process than the robot’s position accuracy. Because contact-rich tasks require further consideration of contact transitions, hybrid control was introduced, which involved recognizing the contact transitions and switching control modes . The complexity of designing control mode switching during contact transitions is an issue in contact-rich tasks, because controllers tend to be conservative as the robot motion becomes unstable due to mode switching. Moreover, the accurate modeling of robots and their contact with objects pose difficulties. To support both autonomy and performance in assembly robots, a robust technology is required that recognizes the contact states without an accurate model and determines the input.
Reinforcement learning (RL) is a technique where the agents optimize the unknown steps through interactions with the environment; it is a potential method that can solve the aforementioned issues –. In humans, for manual assembly tasks, the contact states are recognized from the sensation of touching the object based on heuristics, and an appropriate trajectory is provided by considering the contact state. RL renders this process autonomous. Several studies have already been conducted on this technology –. In these studies, the local trajectories were optimized through the outputs by the agents to the controller of information on the positions or forces in the form of actions. However, owing to the nature of RL algorithms, obtaining the output of actions in short cycles (under a few tens of milliseconds) is difficult. The command output cycle is an element that is directly linked to the performance of the control system; this is a fundamental problem in robots that implement RL. The objective of this study is to maintain high control performance by outputting the position and force commands in brief cycles in RL. We propose an RL technique as a solution where the agent outputs the stiffness matrices as actions and demonstrate that the performance of local trajectory optimization is fundamentally improved. We verify the usefulness of the proposed method via a verification experiment for two examples of high-accuracy contact-rich tasks: peg-in-hole task and gear insertion task.
Many studies have been conducted on automating assembly tasks using deep reinforcement learning. Owing to the need for physical interaction with the environment for the purpose of learning, the number of trials for learning must be limited. Sim-to-real reduced the number of trials in the real world by retraining the model in the real world for which learning was performed in a simulation –.
However, for high-accuracy contact-rich tasks involving multipoint contacts with gaps of a few tens of micrometers, the sim-to-real technique is not a realistic option owing to the divergence of the contact phenomenon between the simulation and the real world. A technique is therefore required to acquire the motor skills to perform high-accuracy contact-rich tasks through a relatively low number of realworld trials. Inoue et al. used the long short-term memory (LSTM) scheme for learning two independent policies to perform a peg-in-hole task . Luo et al. performed the contact-rich task of assembling a gear without pre-defined heuristics through an RL technique that derived the appropriate force control command values from the position and force responses . The methods show good performance based on adaptive impedance behavior through the adjustment of position/force commands, whereas their sampling periods are over a few tens of milliseconds. This paper shows the advantage of the agent that outputs stiffness matrices as an action of the RL.
On the basis of the aforementioned background, there are two main contributions of this study. First, this study shows that the local trajectory optimization can be done against an external force without changing the reference trajectory by adjusting only the stiffness matrix. Second, it shows the advantage of selecting stiffness matrices as an action produced by the RL agent. A shorter sampling period for the admittance model generating the position command leads to a higher performance of tasks with contact motion. Remote center compliance, a method that utilizes mechanical compliance, is a good candidate for optimizing the local trajectories of assembly actions without time delays , . However, we apply admittance control instead, because stiffness matrices ensure easy responses to the changes in fast-changing environmental conditions and tasks. The method proposed in this paper can also be applied to techniques that use mechanical elements.
In this section, we first state the problem of this study. Subsequently, the control architecture of the admittance control is described. The important feature of this study is that the stiffness matrices are selected from the discrete action space by the RL agent. Hence, a method to design a stiffness matrix for the discrete action space is described. Then, the learning architecture with a DQN is shown after the trajectory planning algorithm.
A. Problem statement
This study deals with peg-in-hole and gear-insertion tasks as the target for high precision assembly. The peg-in-hole task can be divided into two phases: search phase and insertion phase . Teeth alignment phase, another additional phase should be considered in the gear-insertion task.
The robot places the peg center within the clearance region of the hole center during the search phase. We use a 6-axis force-torque sensor to detect the relative relationship between the peg and the hole. Since it is difficult to obtain a precise model of the physical interaction between these two, the RL has been an effective solution to detect the relative relationship . The robot adjusts the orientation of the peg by the admittance control and pushes the peg into the hole during the insertion phase. The peg often gets stuck in the initial stage of the insertion in case the clearance is small and some errors in attitude angle exist. In case of gear-insertion task, the gear wheel need to be matched to other gear wheels by aligning the gear teeth after the insertion. This study shows that the gear aligning trajectory can be generated by the local trajectory modification using a stiffness matrix.
Fig. 1. Block diagram of the admittance control
Although there are both cases in practice: inserting a grasped peg into a fixed hole; and inserting a fixed peg into a hole on a grasped part, these two cases are considered with the same control architecture. This paper described the method with the first case, while it can be applied to the second case as shown in the experimental verification results.
B. Control architecture
Fig. 1 shows a block diagram of the admittance control in this study. A disturbance observer (DOB)  was used in the control system to cancel the interference between the admittance model and PD controller. Note that admittance control based on a simple PD controller may not properly imitate the admittance model because PD controller itself work as a kind of a stiffness controller. Table I lists the control parameters. We used a six degrees-of-freedom (six-DOF) serial-link manipulator. The system is an ordinary admittance control and the trajectory planner providing the trajectory input is also simple. The characteristic feature of the architecture is that the stiffness matrix is given by the agent.
TABLE I PARAMETERS OF CONTROLLER
C. Trajectory modification using stiffness matrices
This subsection outlines the technique used for the online generation of non-diagonal stiffness matrices using deep RL. The agent determines the action based on the state , as shown below.
where p, f and denote tip position, external force, and external torque, respectively. x, y and z in subscripts denote the axes. The action selected by the agent is a non-diagonal stiffness matrix.
Generally, the outputs in Q-learning are discrete values. Thus, the output is the highest action value out of the few non-diagonal stiffness matrices designed in advance. The number and type of stiffness matrices to be designed vary depending on the task. Fig. 2 shows the inputs/outputs of the DQN and the flow of input into the robot. The current robot state is input into the DQN and the value for each action (stiffness matrix ) is obtained as the output. The stiffness matrix with the highest action value obtained as the output of the DQN is the output of the robot and the stiffness matrix in the admittance control is updated. As discussed in , the robot motion is made relative to the external force by adapting the non-diagonal terms of the stiffness matrices. This indicates that, while the trajectories are actively modified by the agent in the conventional techniques, they can be passively modified based on the external force in the proposed methodology. While the trajectory is modified in both techniques, the introduction of a passive methodology improves the responsiveness. The basis for this is described as follows.
The diagrams for action generation considering the contact states for both the proposed and conventional methods are compared in Fig. 3. Generally, the sampling cycle of the agent is longer than the control cycle of the robot. In the proposed method, the agent generates the reference stiffness in cycles of 20 ms and the admittance controller generates a position command based on the reference trajectory and force response in cycles of 1 ms. By setting up the optimized reference stiffness beforehand, the optimized position command is generated in cycles of 1 ms in response to the contact state as the non-diagonal stiffness matrix modifies the trajectory passively in line with the external force. Note that the reference stiffness can be set up before the instance of contact because does not have any interference to the trajectory during free motion. In contrast, the agent generates the position/force command in cycles of 20 ms in the conventional method. In summary, while the proposed method ensures the generation of actions corresponding to the contact states in cycles of 1 ms, the cycle for generating actions is 20 ms in the conventional method. Both methods modify the trajectory,
Fig. 2. Action selected by the DQN
Robot based on position controlAgent Admittance control
Fig. 3. Architecture of the proposed method
but the responsivity is improved in the proposed method due to the shortening of the sampling period of command generation.
D. Design of stiffness matrices
As described in the previous subsection, , the action in the discrete space, is designed in advance so that the robot motion is guided in the desired direction. One of the advantages of outputting the stiffness matrices as an action of the agent is that the admittance model leads to a solid design of stiffness matrices as described below.
First, the trajectory deviation given by the admittance model converges to
in case the admittance model is stable. This indicates that the local trajectory deviation can be modified by setting up for an expected force . Suppose an external force occurs in the z direction, substituting into (3) yields
To design for the desired deformation against the expected force , (4) is developed as follows:
After deriving , and similarly, the stiffness matrix can be calculated using the following
Fig. 4. Start and hole position in peg-in-hole task
Although an example of the expected force in only one axis is provided by the above descriptions, it can be extended to any direction by multiplying a rotational matrix.
Multiple types of stiffness matrices are derived in advance because different actions should be selected to converge the robot motion to a certain condition from different states.
E. Trajectory planning
As shown in Fig. 4, a simple, pre-determined trajectory is provided, moving from the start position downward in the direction of the z-axis at constant velocity. Moreover, for simplification, the trajectory begins with negative and positive in the experiments. The external torque and is related to and in case the peg and the hole are partially overlapped during contact. To detect minute torque through the peg and let the agent estimate the contact state, the bottom of the peg and the hole need to be aligned. The preconditions for this peg-in-hole task are as follows.
Preconditions for peg-in-hole task
RL algorithm starts with a random exploration of the action space and strives to maximize the cumulative reward :
where and are the discount factor and the current reward, respectively.
Q-learning finds a policy that maximizes the expected value of the total reward over successive steps, starting from the current state as described in the following formula.
Fig. 5. Robot arm used in this study
Here, the action value function is updated as follows:
where is a learning rate. .
Deep Q-learning is a method to approximate the action value function Q by a deep neural network (DNN) model. Substituting the parameters of the DNN model w into (9), the following formula is given.
-greedy algorithm is introduced to avoid local minimum, while is gradually reduced with the reduction ratio by . The policy is described as follows by using -greedy algorithm
A. Verification system set-up
Fig. 5 shows the six-DOF robot manipulator used in the experiment. A three-finger gripper was fixed on a six-axis force sensor, and the net force and moment applied on the grasped object was measured. Fig. 6 is an overview of the proposed system. TCP/IP communication is used to transmit NN input data between the robot’s control computer and the NN control computer.
B. Comparison of control performance
First, control performances with different sampling time of admittance model calculation were compared with a peg-in-hole task. Since admittance model generates the position reference of local trajectory optimization, the result with 20 ms sampling time should be similar to that of a conventional RL algorithm with position command output. By shortening the
Fig. 6. Configuration of the proposed system
sampling time, the deviation of force response during contact reduced. Additionally, the recognition time of contact, the time interval between the contact time and the time the agent selected the action for contact, was also shortened owing to smaller deviation of force information. The result in 20 ms sampling time ended in an unstable motion and it implies that the conventional methods requires more conservative control setup for contact motion. In sum, shortening the sampling time of the admittance model is essential for performance improvement of contact tasks.
TABLE II CONTROL PERFORMANCE WITH DIFFERENT SAMPLING TIME OF
C. Peg-in-hole task
The diameters of the peg and hole are 10.05 and 10.07 mm, respectively, leaving a gap of 20 m. For the peg-in-hole task, a stiffness matrix is used where the interference is only in the translational direction. The matrix in the rotational direction has a constant value as below.
Considering the preconditions for this task, the stiffness matrices can be narrowed down to four, as shown in Fig. 7. When the peg is at location 1), should be seelcted as the action and the displacement in the direction of the hole is generated by the admittance model. Suppose 1N contact force on the z-axis was generated, the admittance model modifies the trajectory for +5 mm and -5 mm in x and y-axes, respectively. When the peg is at location 2) or 3), should be selected as the action and the displacement is generated in x and y direction, respectively. The peg being at location 4) constitutes the insertion phase; thus, the task can be performed robustly in relation to any angle disturbance by decreasing the stiffness in the x, y direction and increasing stiffness in the z direction. The state accepted by the agent is redefined as follows. (1) has been normalized because the amplitude of the force and torque is different to each other.
where i, g, and m in superscripts denote initial, goal, and maximum values,respectively. and were set as 20 N, 20 N, 40N, 1 Nm, and 1 Nm, respectively.
The reward was designed as follows:
K is the number of maximum steps, which was 500 in this study.
Fig. 8 shows the learning progress in the case of a 20m clearance with 0 degrees tilt angle, and 3mm initial offset Means in a moving window of 20 episodes are shown in a solid line and their 90% confidence interval is layered as the gray area. It shows that the reward converged in about 150 episodes.
Fig. 9 shows the position on task execution and the response to force values. The maximum absolute force and the moment applied on the finger are 6 N and 0.75 Nm, respectively, which are much smaller than a reaction force that causes damage to the parts. Fig. 11 shows the success rates of the task with different initial position errors. Since
Fig. 7. Ideal selection of stiffness matrices in peg-in-hole task
the position accuracy of the robot is not smaller than 1mm, various initial position errors were produced by adjusting the initial position of the hole by a manual x-y linear stage. When the position errors for the x and yaxes were both smaller than 3 mm, the success rate was 100%. On the other hand, the success rate was low when the position error was larger than 3 mm in one of the axes and small in the other axis. Since there were too small overlap between the peg and the hole, the agent could not find a proper action. When the error was larger in both axes, the success rate was still high although the precondition 1 was not met. Since was selected as an action when no overlap between the peg and the hole exists, the peg moved toward the direction of the hole and as a result, precondition 1 was met finally.
Fig. 12 shows the actions selected during a learning phase of a peg-in-hole task. Some samples show that improper stiffness matrices were selected at times because of -greedy algorithm. This phenomena can be eliminated during a test phase by eliminating the -greedy algorithm. Other samples shows that was mainly selected at the early stage when is larger and was selected when was around -0.033 m. was selected when the peg center was located in the center of the hole. Note that the exact position of the hole is unknown because the robot has few millimeters position error in general. Hence,the results infers that force/torque information had a strong influence to the agent for selecting the stiffness matrices. Additionally, it is evident from Fig. 12 that the agent was trained to ensure that the stiffness matrices guide the peg to the direction which the hole exists.
To show the robustness and the rapid responses of the proposed method, we performed experiments with pegs of 20 m clearances. The performances can be seen in https://youtu.be/gxSCl7Tp4-0 by a video. The peg-in-hole task was executed for 100 times after learning to see the time for the peg-in-hole motion. We examined the two cases: (a) (-3mm, 3mm) initial offsets in the x-y plane, and 0 degree tilted angle, (b) (-3mm, 3mm) initial offsets in the x-y plane, and 2 degree tilted angle. Fig. 13 shows the distribution of the execution time using histograms. The results show that
Fig. 8. Reward during learning
Fig. 9. Position and force responses during a peg-in-hole task
search time and the insertion time have been shortened with the proposed method in both the cases. The average total time was 1.64 s and 1.87 s, respectively, which are less than 50% compared to the previous state of the art study .
D. Gear-insertion task
For the gear-insertion task, a similar condition with the peg-in-hole task: 20 m clearance and a 10 mm radius peg, was introduced. The four non-diagonal stiffness matrices were also used in the action space. However, the robot needed to align the gear teeth with that of two pre-fixed gears. Therefore, another non-diagonal stiffness matrix was added to the action space as follows:
Fig. 10. Snapshots of a peg-in-hole task
Fig. 11. Success rate of peg-in-hole task with different position error
Since the angles of the two pre-fixed gears were unknown, Gear-teeth alignment was done by rotating around z-axis by a sinusoidal wave after insertion. Fig. 14 shows the snapshots of an experiment. Commercial spur gears with module 2 for gear box assembly was used in this study to evaluate the performance in a practical standard. Fig. 15 shows the histogram of the execution time. The results show that the execution time strongly depends on the teeth alignment time, which stochastically varies. One noticeable point of this experiment is that a contact-rich task with several contact transitions was accomplished with a simple linear trajectory with variable admittance model. The gear-insertion task in the video (see https://youtu.be/gxSCl7Tp4-0 ) shows that the grasped gear moved toward the peg center after contact and also rotated around z-axis after insertion, although it only
Fig. 12. Selected actions during peg-in-hole task
Fig. 13. Histogram of execution time for peg-in-hole task
Fig. 14. Snapshots of gear-insertion task
generates a linear trajectory without any contact. Another interesting point is that the time distribution of the searching time was similar to that of Fig. 13 despite the difference of inserting a grasped peg into a hole and inserting a fixed peg into a grasped part. The results shows that these two can be treated with a similar control architecture.
This study proposed a method for the generation of non-diagonal stiffness matrices online for admittance control using deep Q-learning. The actions were generated corresponding to the contact states, according to the agent’s update cycles for contact-rich tasks using conventional machine learning. However, the proposed method ensures the local trajectory optimization in line with the contact states according to the robot’s control cycle. The responsiveness and the robustness were evaluated through experiments on a peg-in-hole task. and a gear-insertion task with different conditions. This study handled a contact-rich task in an ap-
Fig. 15. Histogram of execution time for gear-insertion task
proach involving a simple trajectory and online generation of stiffness matrices. As the proposed method allows for parallel connection with a trajectory planning module, it is expected to be applied to a variety of contact-rich manipulations.
 S. Part, “Impedance control: An approach to manipulation,” Journal of dynamic systems, measurement, and control, vol. 107, p. 17, 1985.
 J. K. Salisbury, “Active stiffness control of a manipulator in cartesian coordinates,” in Proc. Conference on Decision and Control Including the Symposium on Adaptive Processes. IEEE, pp. 95–100, 1980.
 D. Lawrence and R. Stoughton, “Position-based impedance controlachieving stability in practice,” in Proc. Guidance, Navigation and Control Conference, p. 2265, 1987.
 H. Kazerooni, B. Waibel, and S. Kim, “On the stability of robot compliant motion control: Theory and experiments,” Journal of Dynamic Systems, Measurement, and Control, vol. 112, no. 3, pp. 417–426, 09 1990.
 O. Khatib, “A unified approach for motion and force control of robot manipulators: The operational space formulation,” IEEE Journal on Robotics and Automation, vol. 3, no. 1, pp. 43–53, 1987.
 S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
 M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy search for robotics,” Foundations and Trends Rin Robotics, vol. 2, no. 1–2, pp. 1–142, 2013.
 J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
 T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and R. Tachibana, “Deep reinforcement learning for high precision assembly tasks,” in Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 819–825.
 M. Vecerik, O. Sushkov, D. Barker, T. Rothorl, T. Hester, and J. Scholz, “A practical approach to insertion with variable socket position using deep reinforcement learning,” arXiv preprint arXiv:1810.01531, 2018.
 T. Ren, Y. Dong, D. Wu, and K. Chen, “Learning-based variable compliance control for robotic assembly,” Journal of Mechanisms and Robotics, vol. 10, no. 6, p. 061008, 2018.
 M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Selfsupervised learning of multimodal representations for contact-rich tasks,” in Proc. 2019 IEEE International Conference on Robotics and Automation (ICRA), 2019.
 T. Ren, Y. Dong, D. Wu, and K. Chen, “Fast skill learning for variable compliance robotic assembly,” arXiv preprint arXiv:1905.04427, 2019.
 G. Schoettler, A. Nair, J. Luo, S. Bahl, J. A. Ojea, E. Solowjow, and S. Levine, “Deep reinforcement learning for industrial insertion tasks with visual inputs and natural reward signals,” arXiv preprint arXiv:1906.05841, 2019.
 J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar, and P. Abbeel, “Reinforcement learning on variable impedance controller for high-precision robotic assembly,” in Proc. International Conference on Robotics and Automation (ICRA), pp. 3080–3087, 2019.
 C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in Proc. European Conference on Computer Vision (ECCV), 2018, pp. 19–34.
 A. A. Rusu, M. Vecerik, T. Roth¨orl, N. Heess, R. Pascanu, and R. Hadsell, “Sim-to-real robot learning from pixels with progressive nets,” arXiv preprint arXiv:1610.04286, 2016.
 X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in Proc. International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–8.
 M. Ang and G. B. Andeen, “Specifying and achieving passive compliance based on manipulator structure,” IEEE Trans. on Robotics and Automation, vol. 11, no. 4, pp. 504–515, 1995.
 S. Lee, “Development of a new variable remote center compliance (VRCC) with modified elastomer shear pad (ESP) for robot assembly,” IEEE Trans. on Automation Science and Engineering, Vol. 2, No. 2, pp. 193–197, 2005.
 K. Sharma, V. Shirwalkar, and P. K. Pal, “Intelligent and EnvironmentIndependent Peg-In-Hole Search Strategies,” in Proc. Int. Conf. on Control, Automation, Robotics and Embedded Systems (CARE), 2013.
 K. Ohnishi, M. Shibata, and T. Murakami, “Motion control for advanced mechatronics,” IEEE/ASME Trans. on Mechatronics, vol. 1, no. 1, pp. 56–67, 1996.
 T. Kaneko, M. Sekiya, K. Ogata, S. Sakaino, and T. Tsuji: “Force control of a jumping musculoskeletal robot with pneumatic artificial muscles.” in Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), pp. 5813–5818, 2016.