Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor

arXiv Github DOI Presentation Supplementary Video

Trevor Ablett¹, Oliver Limoyo¹, Adam Sigal², Affan Jilani³, Jonathan Kelly¹ , Kaleem Siddiqi³, Francois Hogan², Gregory Dudek^2,3

¹University of Toronto, ²Samsung AI Centre, Montreal, QC, Canada, ³McGill University

*work completed while Trevor Ablett, Oliver Limoyo, and Affan Jilani were on internship at Samsung AI Centre, Montreal

IEEE Transactions on Robotics (T-RO): Special Section on Tactile Robotics (to appear), to be presented at ICRA 2025

Summary

A summary of our system for multimodal and force-matched imitation learning.

A video showing: (1) kinesthetic teaching with force recording (parts 1 and 2 in diagram above), (2) a force-matched replay with mode switch label (parts 3 and 4 above), and (3) a learned multimodal policy with mode-switch action (part 5 above).

Motivation
1. Why Kinesthetic Teaching?
Approach
Experiments
Conclusion
Code
Citation
Bibliography

Motivation

The conventional approach to manipulating articulated objects such as doors and drawers with robots relies on a firm, stable grasp of the handle followed by a large arm motion to complete the opening/closing task.

Grasp and rotate.

Press and pull.

A tiny human combines both.

In contrast, humans are capable of opening and closing doors with minimal arm motions, by relaxing their grasp on the handle and allowing for relative motion between their fingers and the handle. This work aims to learn robot policies for door opening that are more in line with human manipulation, by leveraging high-resolution visual and tactile feedback to control the contact interactions between the robot end-effector and the handle.

Why Kinesthetic Teaching?

Teleoperation. Video in real time. Requires proxy for contact feedback, and is slow.

Teleoperation is a viable strategy for collecting demonstrations, but it requires a proxy for contact feedback. In the teleoperation example above, we used the vibration motor in the VR controller and the force-torque sensor to provide coarse feedback, but demonstrations were still difficult to collect. Even with considerable practice, we would still often cause the robot to enter into a protective stopped state for generating excessive force against the environment.

Kinesthetic Teaching.

Replays of kinesthetic teaching without force matching.

Kinesthetic teaching, as shown above, provides contact feedback, alleviating this issue by allowing the operator to naturally react to environmental contact forces.

Kinesthetic teaching has a significant downside, however: the observation and action space during demonstrations does not match that of the robot. This can be partially resolved by using replays, where the robot is commanded to reach the same poses achieved by the human during the demonstration. However, if the human imparts a force on the environment during their demonstration, this replay trajectory can fail.

Approach

See-through-Your-Skin (STS) Tactile Sensor

2022 version of STS sensor

Version from this work, in visual mode (LEDs off) and tactile mode (LEDs on).

An STS sensor¹ is a visuotactile sensor, comparable to other gel-based tactile sensors², that can be switched between visual and tactile modes by leveraging a semi-transparent surface and controllable lighting, allowing for both pre-contact visual sensing and during-contact tactile sensing with a single sensor. In this work, we use the sensor in tactile mode to record a signal linearly related to force during demonstrations, and show its value in both visual and tactile mode as an input to learned policies through extensive experiments.

Tactile Force Matching

Our force-matched trajectories are generated using a spring relationship between desired poses and recorded poses.

Kinesthetic teaching while reading the signal linearly related to force with an STS sensor.

Our force-matched poses were generated by measuring a signal linearly related to force with an STS sensor, which was then used to modify the desired poses before they were input back into our Cartesian impedance controller. For more details, see our paper.

Learned STS Mode Switching

Our learned policy includes sensor mode as an output, in addition to robot motion.

Mode switch labels are provided during the force matched replay by the demonstrator, who pushes a key when the robot makes contact with the environment.

To further leverage an STS sensor for imitation learning, we add mode switching as a policy output, allowing the policy to learn the appropriate moment to switch an STS from its visual to its tactile mode.

Experiments

To verify the efficacy of force matching, learned mode switching, and tactile sensing in general, we study multiple observation configurations to compare and contrast the value of visual and tactile data from an STS with visual data from a wrist-mounted eye-in-hand camera.

Experimental Tasks

Demonstrations and force-matched replays of our four experimental tasks, GlassKnobOpen, GlassKnobClose, FlatHandleOpen, and FlatHandleClose.

In this work, we study four challenging robotic tasks: door opening and closing with a flat handle a glass knob, including a randomized initial position requiring a reach to initially make contact. This scenario is meant to to simulate a scenario with a true mobile manipulator, where the initial pose between the arm and the handle/knob would not be consistent between each open/close. The closing tasks required shutting the door “quietly” without allowing it to slam.

Training details

We collected 20 raw demonstrations for each task, and then completed replays of each demonstration under different conditions, such as including/excluding force matching and/or mode switching.

As shown in the Learned STS Mode Switching section, we trained policies with STS images, wrist camera images, and relative robot pose (i.e., relative to the beginning of an episode) as inputs, with delta end-effector poses and mode switch control as outputs.

For all experiments, we trained 3 seeds of each policy, and then ran each policy for 10 episodes in each configuration.

Results – Overall

The videos below show learned policies with force matching, mode switching, and STS data included.

Autonomous GlassKnobOpen and GlassKnobClose policies.

Autonomous FlatHandleOpen and FlatHandleClose policies.

Performance results with/without force matching, mode switching, and STS data.

Example runs with no force matching, no mode switching, and no STS data.

Conclusion

Through over 3,000 test episodes from real-world manipulation experiments, we find that the inclusion of force matching raises average policy success rates by 62.5%, STS mode switching by 30.3%, and STS data as a policy input by 42.5%. Our results highlight the utility of see-through tactile sensing for IL, both for data collection to allow force matching, and for policy execution to allow accurate task feedback.

Code

Available on Github

Citation

@misc{ablettMultimodalForceMatchedImitation2023,
  title = {Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor},
  author = {Ablett, Trevor and Limoyo, Oliver and Sigal, Adam and Jilani, Affan and Kelly, Jonathan and Siddiqi, Kaleem and Hogan, Francois and Dudek, Gregory},
  year = {2023},
  month = dec,
  number = {arXiv:2311.01248},
  eprint = {2311.01248},
  primaryclass = {cs},
  publisher = ,
  doi = {10.48550/arXiv.2311.01248},
  urldate = {2024-02-15},
  archiveprefix = {arxiv},
  keywords = {Computer Science - Artificial Intelligence,Computer Science - Machine Learning,Computer Science - Robotics}
}

Bibliography

F. R. Hogan, J.-F. Tremblay, B. H. Baghi, M. Jenkin, K. Siddiqi, and G. Dudek, “Finger-STS: Combined Proximity and Tactile Sensing for Robotic Manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10865–10872, Oct. 2022, doi: 10.1109/LRA.2022.3191812. ↩ ↩²
C. Chi, X. Sun, N. Xue, T. Li, and C. Liu, “Recent Progress in Technologies for Tactile Sensors,” Sensors, vol. 18, no. 4, Art. no. 4, Apr. 2018, doi: 10.3390/s18040948. ↩