Link Search Menu Expand Document

Value-Penalized Auxiliary Control from Examples for Learning without Rewards or Demonstrations

arXiv Github

Trevor Ablett1, Bryan Chan2, Jayce Haoran Wang1, Jonathan Kelly1

1University of Toronto, 2University of Alberta

Submitted to Conference on Robot Learning (CoRL) 2024


Value-penalized auxiliary control from examples (VPACE) and SQIL

1

as learning progresses on our real-world door opening task.

Table of contents

  1. Summary
  2. Approach
  3. Real Panda Results
    1. Exploratory Episodes over Time
      1. Door
      2. Drawer
    2. Success Examples for Training
    3. Final Performance
  4. Simulation Results
    1. Exploratory Episodes over Time
      1. Unstack-Stack
      2. Insert
      3. sawyer_drawer_open
      4. sawyer_box_close
      5. sawyer_bin_picking
      6. hammer-human-v0-dp
      7. relocate-human-v0-najp-dp
    2. Success Examples for Training
      1. Panda Tasks
      2. Sawyer Tasks
      3. Adroit Hand Tasks
    3. Final Performance (All Tasks)
      1. Panda Tasks
      2. Sawyer Tasks
      3. Adroit Hand Tasks
  5. Code
  6. Citation
  7. Bibliography

Summary

A visual summary of VPACE.

Learning from examples of success is an appealing approach to reinforcement learning that eliminates many of the disadvantages of using hand-crafted reward functions or full expert-demonstration trajectories, both of which can be difficult to acquire, biased, or suboptimal. However, learning from examples alone dramatically increases the exploration challenge, especially for com- plex tasks. In this work, the fundamental question that we aim to address is:

Is it possible to learn policies efficiently given only example states of completed tasks?

Average performance of VPACE compared with SQIL

1

, RCE

2

, and DAC

3

.

This work introduces value-penalized auxiliary control from examples (VPACE); we significantly improve exploration in example-based control by adding scheduled auxiliary control and examples of auxiliary tasks. Furthermore, we identify a value-calibration problem, where policy value estimates can exceed their theoretical limits based on successful data. We resolve this problem, which is exacerbated by learning auxiliary tasks, through the addition of an above-success-level value penalty. Across three simulated and one real robotic manipulation environment, and 21 different main tasks, we show that our approach substantially improves learning efficiency.

Approach

VPACE boils down to three main changes to standard off-policy inverse reinforcement learning:

  1. Expert buffers are replaced with example states \( s^\ast \in \mathcal{B}^\ast \), where the only expert data provided to the agent are examples of successfully completed tasks.
  2. Auxiliary task data is provided, in addition to main task data, following the design established by SAC-X4 (for RL) and LfGP 5 (for IRL).
  3. To mitigate highly erroneous value estimates derived from bootstrapping, exacerbated by the addition of auxiliary task data, we introduce a simple scheme for value penalization based on the current value estimate for example states.

We find that our approach improves performance and efficiency both with a separately learned reward function (as in DAC3), and without (as in SQIL1 and RCE2).

For more details on our approach, see our corresponding paper.

Real Panda Results

Exploratory Episodes over Time

Door

(See video above)

Drawer

Success Examples for Training

The numerical state data corresponding to these example success images was the only signal (i.e., no reward function and no full trajectories) used for training policies in this work. We also show examples from the initial state distributions.

Final Performance

Simulation Results

Exploratory Episodes over Time

Unstack-Stack

Insert

sawyer_drawer_open

sawyer_box_close

sawyer_bin_picking

hammer-human-v0-dp

relocate-human-v0-najp-dp

Success Examples for Training

The numerical state data corresponding to these example success images was the only signal (i.e., no reward function and no full trajectories) used for training policies in this work. We also show examples from the initial state distributions.

Panda Tasks

Sawyer Tasks

Adroit Hand Tasks

The same data was used for the original and the delta-position variants.

Final Performance (All Tasks)

Panda Tasks

Sawyer Tasks

Adroit Hand Tasks

Code

Available on Github

Citation

Check back soon!

Bibliography

  1. S. Reddy, A. D. Dragan, and S. Levine, “SQIL: Imitation Learning Via Reinforcement Learning with Sparse Rewards,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020  2 3

  2. B. Eysenbach, S. Levine, and R. Salakhutdinov, “Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification,” in Advances in Neural Information Processing Systems (NeurIPS’21), Virtual, Dec. 2021  2

  3. I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson, “Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning,” in Proceedings of the International Conference on Learning Representations (ICLR’19), New Orleans, LA, USA, May 2019.  2

  4. M. Riedmiller et al., “Learning by Playing Solving Sparse Reward Tasks from Scratch,” in Proceedings of the 35th International Conference on Machine Learning (ICML’18), Stockholm, Sweden, Jul. 2018, pp. 4344–4353. Accessed: Jan. 10, 2021. 

  5. T. Ablett, B. Chan, and J. Kelly, “Learning From Guided Play: Improving Exploration for Adversarial Imitation Learning With Simple Auxiliary Tasks,” IEEE Robotics and Automation Letters, vol. 8, no. 3, pp. 1263–1270, Mar. 2023, doi: 10.1109/LRA.2023.3236882. 


Space and Terrestrial Autonomous Systems Lab - University of Toronto Institute for Aerospace Studies