# Learning from Guided Play: Improving Exploration in Adversarial Imitation Learning with Simple Auxiliary Tasks

### Trevor Ablett*, Bryan Chan*, Jonathan Kelly (*equal contribution)

#### Submitted to Robotics and Automation Letters (RA-L) with IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’22) Option

Also presented as Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning

#### Poster at Neurips 2021 Deep Reinforcement Learning Workshop

Off-policy Adversarial Imitation Learning (DAC)1                                                 Learning from Guided Play (LfGP)

In this work, we were interested in investigating the efficacy of Adversarial Imitation Learning (AIL) on manipulation tasks. AIL is a popular form of Inverse Reinforcement Learning (IRL) in which a Discriminator, acting as a reward, and a policy are simultaneously learned using expert data. Empirically, we found that a state-of-the-art off-policy method for AIL1 is unable to effectively solve a variety of manipulation tasks. We demonstrated that this is because AIL is susceptible to deceptive rewards2, where a locally optimal policy sufficiently matches the expert distribution without necessarily solving the task. A simplified example where this occurs is shown below:

A simple MDP where AIL learns a deceptive reward and a suboptimal policy.

The example above can be thought of as analogous to a stacking task: $$s^2$$ through $$s^6$$ represent the first block being reached, grasped, lifted, moved to the second block, and dropped, respectively, while $$s^1$$ is the reset state, and $$a^{15}$$ represents the second block being reached without grasping the first block (action $$a^{nm}$$ refers to moving from $$s^n$$ to $$s^m$$). Taking action $$a^{55}$$ in $$s^5$$ represents opening the gripper, which results in a return of -1 after taking $$a^{15}$$ (because $$R(s^1, a^{15} ) = −5$$), since the first block has not actually been grasped in this case.

AIL learns to exploit the $$a^5$$ action without actually completing the full trajectory.

To mitigate this problem, we introduced a scheduled hierarchical modification3 to off-policy AIL in which multiple discriminators, policies, and critics are all learned simultaneously, solving a variety of auxiliary tasks in addition to a main task, while still ultimately attempting to maximize main task performance. We called this method Learning from Guided Play (LfGP), inspired by the play-based learning found in children, as opposed to goal-directed learning. Using expert data, the agent is guided to playfully explore parts of the state and action space that would have been avoided otherwise. The title also refers to the actual collection of this expert data, since the expert is guided by a uniform sampler, in our case, to fully explore an environment through play. This not only significantly improved the performance of AIL with an equivalent amount of expert data, but also allowed for the reuse of auxiliary task models and expert data between main tasks through transfer learning.

An example of DAC’s poor performance is shown on the left side of the video at the top of the page, and the improved exploration exhibited by LfGP is shown on the right. The diagram below is a simplified description of our multitask environment and the different types of play used in our method.

The main components of our system for learning from guided play.

We created a simulated multitask environment which is available for use here, and is automatically installed and used when training using the open-source repository for LfGP.

Here are examples of each of the four main tasks studied in this work:

Left to right: Stack, Unstack-Stack, Bring, Insert.

As stated, we also used simple-to-define auxiliary tasks to assist in learning and allow for the reuse of expert data and learned models:

Left to right: Open-Gripper, Close-Gripper, Reach, Lift, Move-Object.

### Results

• A no-scheduler variant of our method, where only the main task is executed (LfGP-NS)
• Multitask Behaviour Cloning (BC (Multi))

• Discriminator Actor-Critic (DAC)
• Behaviour Cloning (BC)
• Behaviour Cloning with the same amount of main task data as the multitask methods (BC (less data))
• GAIL

To try to make a fair comparison, we used an equivalent amount of total expert data for the single-task methods, as compared to the multitask methods. However, it is also important to note that the single-task methods cannot reuse data between tasks. The following table describes this idea in further detail:

Expert dataset sizes for each task, including quantity of auxiliary task data. Each letter under "Dataset Sizes" corresponds to an auxiliary or main task, and bolded letters correspond to datasets that were reused (e.g., Open-Gripper, Stack). Numbers are total number of (s,a) pairs.

Our main performance results are shown below.

Final performance results on each main task.

Although single-task BC beats LfGP in three out of four tasks, remember that it cannot reuse learned data or learned models. However, this result shows that there is still further work to be done to better understand the differences in performance between AIL and BC, especially since with the exact same data, single-task BC dramatically outperforms DAC in all of our environments.

The results of our simple transfer learning are shown here:

Transfer learning using existing models. See our code and paper for more implementation details.

In three out of four main tasks, transfer learning shows improved learning speed. We used a very simple method for transfer learning, in which we simply reused the existing buffer and models as warm-starts for a new main-task, but we believe that with future work, a more efficient method for transfer learning could do even better.

### Analysis

We also visualized the learned stack models for LfGP and DAC.

LfGP and DAC trajectories for eight episodes throughout learning, with manually set consistent initial conditions. The LfGP trajectories contain many tasks composed, whereas the DAC trajectories only execute the main stacking task.

The LfGP policies explore significantly more diversely than the DAC policies do, and the DAC policies eventually learn to partially reach the blue block and hover near the green block. This is understandable—DAC has learned a deceptive reward for hovering above the green block regardless of the position of the blue block, because it hasn’t sufficiently explored the alternative of grasping and moving the blue block closer. Even if hovering above the green block doesn’t fully match the expert data, the DAC policy receives some reward for doing so, as evidenced by its learned Q-Value (DAC on the right-most image):

A view of a single plane of learned mean policy actions (arrow for velocity direction/magnitude, blue indicates open gripper, green indicates close) and Q values (red: high, yellow: low) for each LfGP task and DAC at 200k environment steps.

Again, compared with DAC, the LfGP policies have made significantly more progress towards each of their individual tasks than DAC has, and the LfGP Stack policy, in particular, has already learned to reach and grasp the block, while learning high value for being near either block. Further on in training, it learns to only have high value near the green block after having grasped the blue block; an important step that DAC never achieves.

For more details, see our arXiv paper!

## Code

Available on Github

## Citation

@misc{ablett2021learning,
title={Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning},
author={Trevor Ablett and Bryan Chan and Jonathan Kelly},
year={2021},
eprint={2112.08932},
archivePrefix={arXiv},
primaryClass={cs.LG}
}


## Bibliography

1. I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson, “Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning,” presented at the Proceedings of the International Conference on Learning Representations (ICLR’19), New Orleans, LA, USA, May 2019.

2. A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “First return, then explore,” Nature, vol. 90, no. 7847, pp. 580–586, Feb. 2021.

3. M. Riedmiller et al., “Learning by Playing: Solving Sparse Reward Tasks from Scratch,” in Proceedings of the 35th International Conference on Machine Learning (ICML’18), Stockholm, Sweden, Jul. 2018, pp. 4344–4353.

Space and Terrestrial Autonomous Systems Lab - University of Toronto Institute for Aerospace Studies