b

DiscoverSearch
About
My stuff
Combating False Negatives in Adversarial Imitation Learning
2020·arXiv
Abstract
Abstract

In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior. However, as the trained policy learns to be more successful, the negative examples (the ones produced by the agent) become increasingly similar to expert ones. Despite the fact that the task is successfully accomplished in some of the agent’s trajectories, the discriminator is trained to output low values for them. We hypothesize that this inconsistent training signal for the discriminator can impede its learning, and consequently leads to worse overall performance of the agent. We show experimental evidence for this hypothesis and that the ’False Negatives’ (i.e. successful agent episodes) significantly hinder adversarial imitation learning, which is the first contribution of this paper. Then, we propose a method to alleviate the impact of false negatives and test it on the BabyAI environment. This method consistently improves sample efficiency over the baselines by at least an order of magnitude.

Progress in Deep Reinforcement Learning is impeded by the necessity of handcrafting reward functions, which may be especially difficult for grounded language tasks (Luketina et al. 2019). To avoid using a reward function to judge the agent’s behavior, Imitation Learning (IL) trains an agent to mimic an expert’s policy using demonstrations. The simplest version of IL, Behavioral Cloning (BC) (Pomerleau 1989), trains a policy to regress expert actions from demonstrations in a supervised setup. This approach is appealing due to its simplicity but suffers from the problem of compounding errors (Ross, Gordon, and Bagnell 2011). Another IL method, Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon 2016), jointly learns reward functions and training policies. GAIL trains a discriminator to differentiate agent from expert trajectories, which simultaneously acts as a reward function. Hence, the agent tries to act more and more like the expert in order to fool the discriminator and get a higher reward. While GAIL works well in the initial phase of the learning procedure, as the agent reaches high success rates on the given task, we observed that its performance tends to be unstable. We hypothesize that this is due to the fact that

image

Figure 1: Instructions and initial states for three BabyAI tasks. In (b), the reward for the episode can not be viewed as a function of the last observation, and hence memory is required for the discriminator in GAIL. Best viewed in color.

the discriminator has to classify successful agent episodes as negative examples, even though they are very similar to expert demonstrations. This problem is negligible during the initial phase when the agent executes a random policy, but as the policy improves, the number of such successful trajectories labeled as non-expert increases. We refer to this phenomenon as the False Negatives (FN) problem, as successful agent trajectories are falsely labeled as negative examples.

Our first contribution is a diagnostic method which shows that the FN problem significantly hinders adversarial imitation learning and measures the effect’s strength. We use the BabyAI platform (Chevalier-Boisvert et al. 2018) as the testbed since it is well-suited to analyse imitation learning sample efficiency. We focus on four levels of increasing dif-ficulty (see Figure 1) and show that a naive application of GAIL is not able to solve any level due to false negatives.

We furthermore propose a method that can be applied for any goal-conditioned task and, by leveraging its multi-goal nature, addresses the FN problem. In particular, we simultaneously train two discriminators which take roles of reward functions as in ordinary GAIL. We use goal-conditioning to ensure that the FN problem does not occur when the first discriminator is trained. The second discriminator is trained to penalize the agent for exploiting inaccuracies of the first one. We show that the proposed technique enables the agent’s performance to approach a 100% success rate, while a naive application of GAIL fails.

We study the problem of training agents to follow natural language instructions grounded in a partially-observable environment using adversarial imitation learning. In particular, we assume that N pairs  (ci, τi)of instructions  ciand their respective trajectories  τi =�(oi1, ai1), . . . (oiT (i), aiT (i))�are

provided for the agent to learn from. Here,  oitand  aitare the observation and action at the time step t, and T(i) is the i-th episode length.

GAIL (Ho and Ermon 2016) is an imitation learning framework which combines ideas from Inverse Reinforcement Learning (Ng and Russell 2000; Abbeel and Ng 2004a) and Generative Adversarial Networks (GANs) (Goodfellow et al. 2014). In GAIL, the actor network represents the agent’s policy  πwhile the discriminator network D serves as a local reward function differentiating between the expert policy  πEand the actor using observation-action pairs (o, a). Ho and Ermon (2016) define the optimization objective of GAIL as follows:

image

where function H stands for the maximum entropy introduced by Ziebart et al. (2008), ensuring adequate exploration of the environment. GAIL was originally proposed for fully observable environments where the observation o in Equation 1 must capture the complete state of the environment.

Similarly to Fu et al. (2019), we extend GAIL to learn instruction-conditioned policies by conditioning both the generator (policy)  πand the discriminator D on the instruction c. The resulting objective is

image

where  Eπstands for sampling (o, c, a) by acting with the policy  π(a|o, c)conditioned on both the observation o and the instruction c.

2.1 Conditioned Recurrent Discriminator

For partially-observable environments it is necessary to equip the discriminator with a memory to model the true environment reward (see Figure 1(b) for example). To this end, we implement the discriminator as a recurrent neural network. Its input is not a single observation-action pair (oi, ai), as in the original GAIL formulation, but a full trajectory  τ = ((o1, a1), . . . , (oT , aT )). To address the goal-conditioned nature of the problem, the discriminator is still conditioned on the instruction c, as described in the previous section. Since our recurrent discriminator receives only full trajectories as the training signal, its predictions for incomplete sub-trajectories are not reliable, and hence its outputs should not be used to reward the policy at each intermediate step like Equation 2 would require. For this reason, we reward the policy, which is also implemented as a recurrent network, only at the last step of the episode. The resulting training objective is

image

An alternative approach to adapt GAIL to partially observable environments would be to train the discriminator to distinguish not just between complete trajectories but also between incomplete sub-trajectories. The policy could then be rewarded at each step like it is done in the original GAIL. Such a dense reward formulation may be necessary when the discriminator does not have memory, but with memory in place the dense rewards become optional. We have chosen the full trajectory approach to avoid some of the known issues of the dense reward GAIL formulation, such as the survival bias (Kostrikov et al. 2018). In Section 7.1, we present additional experiments in which incomplete sub-trajectories are used to train the discriminator

To compute a batched version of the discriminator loss we use two finite buffers  Bagentand  Bexpertthat contain agent trajectories and expert demonstrations respectively:

LD(θ) =E(c,τ)∼Bagent −log(1  − Dθ(c, τ)) (4) +E(c,τ)∼Bexpert −log(Dθ(c, τ)).

The main premise of GAIL is that discriminator-based re- wards will be high for the episodes that manifest expected behaviour. In practice, however, the discriminator is trained to output higher rewards for the expert and lower for the agent which, unfortunately, is not the same. In the starting phase of the learning procedure, the dis- criminator can reliably assign a high value to a successful episode. Indeed, such episode must be an expert demonstration as the agent always fails. However, the streams of positive and negative examples that the discriminator receives can become very similar as the agent gets better and more likely to succeed. The discriminator can no longer assume that all successful trajectories are coming from the expert and has to detect idiosyncratic features in expert demonstrations that are not necessarily related to solving a given task. In the extreme case of the perfect agent, the discriminator has to overfit the nuisances in expert demonstrations to minimize its loss, because the agent and expert behaviours are indistinguishable. Discriminator-based rewards for successful agent episodes can potentially become as low as for unsuccessful ones. The culprit of the aforementioned problem is a naive labeling of successful agent episodes as negative examples for the discriminator. We would call such examples False Negatives (FN).

3.1 Oracle Filtering

We propose an approach, which we call Oracle Filtering (OF), to diagnose if the false negatives hinder the performance of adversarial imitation learning. In particular, OF uses the environment’s true reward signal to identify the successful trajectories generated by the agent and filter them out. In this setting, the buffer of negative examples  Bagentcontains only unsuccessful trajectories. The discriminator’s loss is given as in Equation 4 but instead of  Bagentwe use Boracleagentwhich is a subset of  Bagentthat contains unsuccess- ful trajectories only:

LOD(θ) =E(c,τ)∼Boracleagent −log(1  − Dθ(c, τ)) (5) +E(c,τ)∼Bexpert −log(Dθ(c, τ)).

Note that when the OF method is applied, the environment rewards are used solely to filter out successful agent trajectories – the agent does not have access to them.

We found that using OF significantly improves the performance, and reduces sample complexity by at least an order of magnitude. This confirms our hypothesis that false negatives have a detrimental effect on GAIL training.

3.2 Fake Conditioning

Since the Oracle Filtering technique requires access to environment rewards, it is only a diagnostic method suited to assess the impact of false negatives. We propose a technique that does not need environment rewards to address the FN problem. Our technique is aimed at tasks where the policy is goal-conditioned. In our case, we assume that the policy is conditioned on language instructions but the technique is general and can be applied in any multi-goal setup.

First, we maintain a set of possible language instructions S initialized from all the unique instructions in the expert demonstrations and, optionally, updated with the instructions collected when the agent interacts with the environment. Secondly, for each instruction-trajectory pair  (c, τ), we replace its original instruction c with  ˜cthat is randomly sampled from S\{c}.

We call this technique Fake Conditioning (FC). It is motivated by the fact that the success of a trajectory is conditioned on the instruction. Therefore, for each trajectory, by replacing the instruction with a new one, we produce a new instruction-trajectory pair that is very likely to be unsuccessful, even when it is successfully conditioned on the original instruction. The FC technique can be used to prepare discriminator training data with greatly limited number of the false negatives.

Expert FC We call Expert FC a technique whereby the discriminator is trained to distinguish the expert trajectories with the original instructions from fake conditioned expert trajectories. In this case, the agent trajectories are not used at all, and hence the training is no longer adversarial. The discriminator’s loss is given as follows:

image

where  (˜c, τ)is fake conditioned instruction-trajectory pair.

Agent FC The FC technique can also be used with agent trajectories. In this case, which we call Agent FC, all agent trajectories are fake conditioned while expert ones are left unaltered. It means that positive examples stay as in Expert FC formulation but negatives are constructed from the fake conditioned agent trajectories instead of expert ones. In this case, the discriminator’s loss is given as follows:

image

We note that the policy that generates agent trajectories is always conditioned on the original instruction, for both Expert FC and Agent FC. The fake instruction-trajectory pairs are used to train the discriminator only.

3.3 Auxiliary Rewards

The FC technique greatly reduces the percentage of FN. Unfortunately, the discriminator is no longer trained on agent trajectories with the original instructions, and hence the rewards that the discriminator issues to the policy can potentially be less accurate. These reward inaccuracies could be exploited by the agent.

Another potential problem can occur when the fakeness of an instruction-trajectory pair can be inferred solely from the initial observation (without the agent’s action). This could be used by the discriminator to surely identify negative examples since only these are false conditioned. For example, when the instruction requires reaching an object that is not present in the scene, the discriminator could spot that and reward the agent lowly regardless of its actions.

To address the aforementioned problems, we rely on auxiliary rewards provided by additional discriminators. We design these trainable auxiliary rewards to discourage degenerate behaviours that may arise from using the FC technique.

In this section we propose two possible ways of training an additional discriminator, which are Blank Conditioning and Done Detector. Together with previously proposed Agent FC and Expert FC techniques, four different discriminators can be trained. Each discriminator is trained separately to minimize its own loss and provides rewards of the form  − log(1 − Dθ(·)). In practice, a small  ϵ > 0is added to the discriminator’s predictions to make the rewards bounded.

Blank Conditioning When Expert FC is used, the discriminator’s training distribution includes only expert trajectories. As result, its predictions on agent-generated trajectories may be less accurate which can be exploited by the policy. To address this problem, we propose to train an extra discriminator which distinguishes between agent and expert trajectories but is not conditioned. It means that the agent is additionally rewarded for generating trajectories which resemble the expert (for a goal/instruction, not the given goal/instruction). To get the highest rewards, its trajectories have to match the distribution of expert trajectories on which the main discriminator is trained. The auxiliary reward for Blank Conditioning is based on the discriminator trained with the following loss:

LBD(θ) =E(c,τ)∼Bagent −log(1  − Dθ(c∅, τ)) (8) +E(c,τ)∼Bexpert −log(Dθ(c∅, τ)),

Table 1: Comparison of positives and negatives used for different type of discriminators.

image

where  c∅is a fixed blank instruction that masks the original ones1. We call this technique Blank Conditioning.

Unfortunately, Blank Conditioning can be impeded by the FN problem as well. However, we hypothesize that the repercussions will be less severe because only an auxiliary reward is affected. In particular, even when the agent masters the task and generates a lot of FN in the auxiliary discriminator’s training data, the main reward will still be informative since the main discriminator uses training data without FN.

Done Detector The second auxiliary discriminator is trained to detect if a given trajectory is finished. Its negative examples are unfinished expert demonstrations (with the original instruction), while the finished ones are positive examples. In both positive and negative examples the original instruction is used. Hence, the loss forces the discriminator to focus on both the trajectory and the instruction to understand if the goal is reached. The loss is the following:

LDD(θ) =E(c,τ)∼Bsubexpert −log(1  − Dθ(c, τ)) (9) +E(c,τ)∼Bexpert −log(Dθ(c, τ)),

where  Bsubexpertis built out of  Bexpertand contains all pos- sible incomplete sub-trajectories only (i.e. each trajectory from  Bsubexpertcomes from the element of  Bexpertbut is cut before they have finished). We call this discriminator Done Detector.

We note that the loss for Done Detector does not depend on the agent’s performance, and hence the extra discriminator is not trained adversarially. There are also no false negatives since all negative examples are constructed in a controlled way and are never successful.

Blank Conditioning is supposed to address the main limitation of Expert FC (lack of agent trajectories in discriminator training data), while Done Detector is for Agent FC (fake instructions with agent trajectories may be too easy to identify). However, both auxiliary rewards can be used with any FC method. It means that we can train all four discriminators simultaneously (one for each loss introduced in this subsection) to get four different rewards which can be mixed together to train the agent.

In Table 1, we provide the overview of positives and negatives used to train discriminators presented in this work. OF is skipped as it is only a diagnostic tool.

Instruction Following with Natural Language We consider the problem of following natural language instructions. Recently, reinforcement learning methods have been applied to make progress in this area across a variety of environments (Chaplot et al. 2017; Hermann et al. 2017; Mirowski et al. 2018). In these approaches, an agent is rewarded when it successfully follows an instruction. However, designing the respective reward function is non-trivial (Luketina et al. 2019). Formalizing the completion semantics, implicit in natural language instruction, has been an open research problem since early efforts (Winograd 1972). Human handcrafting of the reward function becomes increasingly infeasible as the number of instructions and complexity of the environment scale. Applying IL methods in this area is therefore promising (Duvallet, Kollar, and Stentz 2013).

Imitation Learning As mentioned before, IL alleviates the need to handcraft reward functions by learning policies from demonstrations. In this work we consider Behavioral Cloning (BC, (Pomerleau 1989)) as a baseline and our method improves Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon 2016) which is closely related to Inverse Reinforcement Learning (Abbeel and Ng 2004b). While all imitation learning methods can be used for learning instruction-conditioned policies (Mei, Bansal, and Wal- ter 2016; Fu et al. 2019; Bahdanau et al. 2019), prior literature only features instructions that can be verified by the final observation and does not discuss the topic of converging to near-perfect performance (with the exception of (Bahdanau et al. 2019), see more on that below).

False Negatives We found that the False Negative problem significantly impedes adversarial imitation learning. As such, the issue was also found problematic in prior work (Zolna et al. 2019). However, prior work lacked a diagnostic tool to assess the impact of the FN problem and estimate how much performance can be improved if the problem is addressed. Zolna et al. (2019) proposed a heuristic method to address the FN problem called Actor Early Stopping. This method terminates well performing episodes to limit the number of successful agent states used to train the discriminator. AGILE proposed by Bahdanau et al. (2019) is another heuristic solution which deals with False Negatives in the very similar setup of training a reward model. The method filters out states that were most highly rewarded by the discriminator to not train on them. Finally, Xu and Denil (2019) reformulate discriminator training as PositiveUnlabeled (PU) learning which can be seen as a way of combating False Negatives.

5.1 The BabyAI Environment

BabyAI is a deterministic, partially-observable 2D gridworld based on MiniGrid. The natural-looking language instruction is supplied as a string. At each time step, the agent receives a visual input of the  7 × 7cells in front of it. Precise details of the visual input are described in (Chevalier- Boisvert et al. 2018). We report performances on the following four single-room levels.

GoToRedBall is the simplest level among the four and can be solved purely from visual inputs. An agent is tasked with go to the/a red ball in the presence of other distracting objects of other colors and shapes.

GoToLocal extends GoToRedBall. An agent is tasked with instructions of the form go to the <color> <object>, where <color> and <object> are no longer limited to red and ball. As a consequence, this level can no longer be solved by using just visual inputs – a given instruction has to be parsed and understood.

PickupLoc and PutNextLocal are harder tasks which additionally require to equip the reward model with memory. In PickupLoc the instructions refer to objects not just by their type and color but also by their location relative to the initial position of the agent, e.g. go to the red ball in front of you. In the case of PutNextLocal, the instruction requires putting an object next to another object, each described with a type and color, e.g. put the blue ball next to the blue key. Since putting object1 next to object2 is not the same as putting object2 next to object1, the reward model needs to remember the trajectory in order to give a correct reward at the end.

5.2 Episode Termination

In the BabyAI platform the episode is terminated when the task is solved, or after a maximum allowed number of steps is reached, which is a standard practice for similar RL setups. However, in the GAIL setup, no environment reward is available. The naive approach would be to always run every episode for the maximum number of steps.

We let the agent perform a special Done action. At the same time, all expert demonstrations are padded with this action as the last one (i.e. just before termination of the episode). It makes the agent perform the Done action when it considers the task to be done and it can be used to terminate the episode during training. It turns out that it significantly speeds up the training procedure as it collects richer data (more episodes for the same number of frames experienced by the agent). Done termination is used for all methods, including baselines.

The approach is related to actor early stopping introduced by Zolna et al. (2019). The difference is that we terminate episodes earlier based on the policy’s predictions while the cited method terminates based on the discriminator’s predictions.

image

Figure 2: Model architectures. Best viewed in color.

5.3 Architecture and Training

In our experiments, as mentioned in Section 2.1, the agent is provided with a non-zero reward only when it finishes an episode by performing a Done action described in the previous section. This makes rewards sparse and similar to the original rewards used in BabyAI platform to train RL algorithms. This choice allows us to use the agent architecture and the hyperparameters from the original BabyAI paper (Chevalier-Boisvert et al. 2018) without extra tuning.

Actor-critic The model underlying the RL agent is presented in Figure 2(a). It consists of standard components to predict the next action based on the current observation, the memory of the past observations and the instruction. It uses GRU to encode the instruction and a convolutional network with two batch-normalized FiLM layers to jointly process the observation and the instruction. An LSTM memory is used to integrate representations produced by the FiLM module at each step. It uses a memory of 128 units and encodes the instruction with a unidirectional GRU.

Discriminator The discriminator architecture (see Figure 2(b)) is similar to the actor-critic model for symmetry and efficiency reasons. We just replace the final actor and critic layers by 3-layers MLP, placed after the FiLM block. The LSTM here takes as input not only the FiLM embedding but also a one-hot action vector (both concatenated).

Training In all our experiments, we use the same hyperparameters for actor training as used for RL training in (Chevalier-Boisvert et al. 2018). For both models, we used the Adam optimizer with the hyperparameters  α =10−4, β1 = 0.9, β2 = 0.999and  ϵ = 10−5. We used the Proximal Policy Optimization algorithm with parallelized data collection: we performed 4 epochs of PPO using 64 rollouts of length 40 collected with multiple processes. We

Table 2: Success rate for different algorithms. For each task we report performance for three expert demonstrations sets of different sizes (hence three columns are allocated for each task). The largest set is the minimum necessary demonstration set needed to solve the task using BC (the column headed with 1). The other two are 8 times and 64 times smaller and headed with 18and 164, respectively. A task is considered solved if the agent achieves more than 99% success rate (bold values). In our method we combine the Agent FC the Done Detector techniques.

image

truncated the backpropagation through time at 20 steps for both actor-critic and discriminator.

During learning the agent is evaluated every 100 updates on 500 random episodes. The learning procedure is terminated after 10 successful evaluations (99% success rate) or after 48 hours. We run each configuration considered in the paper with 3 different seeds, and report the average success rate. The variances between seeds are almost always below 0.2% for solved tasks.

5.4 Behavioral Cloning Performance

We use BC results as the baseline for demonstration effi-ciency. First, for each task we tested how many demonstrations are needed to solve the task (achieve more than 99% success rate) using BC. Our implementation of BC needs 125, 250, and 354 (in thousands) demonstrations for GoToLocal, PutNextLocal, and PickupLoc, respectively. These results are very close to the values reported in the original BabyAI paper Chevalier-Boisvert et al. (2018), which is expected as the architectures and hyperparameters are the same.

Since BC is the most basic IL method, the largest demonstrations sets considered in our experiments have the same number of demonstrations that was required to solve the tasks using BC. We also consider two subsets, one and two orders of magnitude smaller (precisely, 8 and 64 times smaller) for each task. These serve as a testbed for more ef-ficient methods.

5.5 Auxiliary Reward Mixing

In Section 3 four different losses that can be used to train the discriminators were introduced. Instead of training four separate models, we train just one neural network with multiple heads, each trained with one loss. We average all rewards used with equal weights to obtain the final reward used to train the agent.

6.1 Baseline GAIL and Oracle Filtering

Our first contribution is to diagnose if false negatives impede adversarial imitation learning and how strong the effect is. To do that, we first run a naive memory-equipped adaptation of GAIL, or Baseline GAIL, and demonstrate that it

does not converge to the required performance level (see Table 2). Even when the full set of demonstrations is used, which is enough for BC to solve the task, Baseline GAIL does not reliably achieve 99% success rate.

Then, we use the OF technique to diagnose the impact of the FN problem. The results are also presented in Table 2. OF clearly improves the performance and all levels are solved using an order of magnitude (8 times) fewer expert demonstrations than BC. For GoToLocal task even 64 times fewer demonstrations is enough. We reiterate that this sig-nificant improvement is achieved by only filtering out successful agent trajectories, which experimentally proves that the FN problem may be a major limiting factor for GAIL performance.

6.2 Fake Conditioning

Once the negative effect of the FN problem is confirmed, we test FC technique – the solution proposed in Section 3.

In Table 2, the results for Agent FC along with Done Detector are referred as our method. The method allows us to solve all tasks with an order of magnitude fewer demonstrations than BC. Even when 64 times smaller expert dataset is used, the obtained agent achieves over 96% success rate for all tasks. The performance on PutNextLocal is better than that of the OF GAIL method despite the fact that the latter needs the environment rewards.

We experimentally found that simultaneous use of Agent FC along with Done Detector works the best for the considered task suite and we will refer to this particular combination as our method in the rest of the paper. However, other combinations built with the use of FC methods also prove to be very effective and significantly outperform Baseline GAIL. The comparisons between them and detailed results are presented in Section 7.3.

6.3 FN Problem in Single Instruction Case

We have so far focused on multi-goal tasks for which the FC technique is well suited. However, the FN problem is not specific to such tasks and can potentially hinder training when a single-goal (unconditioned) task is considered. In this section, we show that the FN problem is indeed a general problem and we propose a way to apply the FC technique in the single-goal case.

FC cannot be naively applied for unconditioned tasks because they have one fixed goal (instruction), and hence no fake instructions can be generated. However, for a given single-goal task, we can build a more complex multi-goal task where only one of the possible goals is the original task. Then, the FC technique can be applied to solve the thereby constructed task. Once a well-performing agent is obtained for the multi-goal task, it can also solve the original task as it is one among all goals the agent can be conditioned on.

GoToRedBall task is a simplified version of GoToLocal task where the agent is always given the same instruction (see Section 5.1). We will test the idea described in the previous paragraph, i.e. we train our agent on the GoToLocal task using Agent FC method (with Done Detector) and test the trained agent on GoToRedBall.

Baseline GAIL and OF GAIL agents are trained using demonstration sets with only GoToRedBall instructions. For a fair comparison, the same number of demonstrations are used to train agents on GoToLocal and GoToRedBall. We note that even though the total numbers of demonstrations are the same, GoToLocal agent is trained with many fewer demonstrations with go to red ball as the instruction. The results are presented in Table 3.

Table 3: The results show success rate on GoToRedBall. We report performance for three expert demonstrations sets of different sizes (in three columns). The largest one is the minimum necessary demonstration set needed to solve GoToLocal tasks using BC. Our method stands for GAIL enhanced with Agent FC and Done Detector and trained using GoToLocal demonstrations.

image

OF GAIL clearly outperforms Baseline GAIL, which con-firms our hypothesis that false negatives can have a negative impact on the training also in the single-goal case.

The agent trained with FC technique to solve GoToLocal achieves around 99% when evaluated on the GoToRedBall which is similar to OF GAIL agent and significantly better than the Baseline GAIL agent. On top of that, the FC agent can also solve all the rest of GoToLocal instructions which the baseline methods can not. The performance difference on GoToRedBall between our method and Baseline GAIL is larger for smaller expert data sizes.

7.1 Sub-Trajectories

As described in Section 2.1, the agent is rewarded only at the very end of an episode, i.e. once the full trajectory has been provided, and the discriminator is trained using full trajectories only. However, one can argue that a straightforward application of GAIL is to train the discriminator using incomplete sub-trajectories as well as complete ones. In that

case, the discriminator loss is the following:

image

where  Bsubagentand  Bsubexpertconsist of incomplete sub- trajectories for the agent and expert, respectively. Note that the elements of  Bsubexpertare used here as positive examples in contrast to how Done Detector is trained.

We conducted the experiments analysing the effect of using incomplete sub-trajectories. We trained Baseline GAIL discriminators in two ways, with sub-trajectories, as in Equation 10, and using only full trajectories, as in Equation 4. The result are presented in Table 4.

Table 4: Success rate for Baseline GAIL using sub- trajectories or only full trajectories. For each task, we used the minimum necessary demonstration set needed to solve the tasks using BC.

image

Even though both methods fail to solve the tasks (FC or any auxiliary rewards are not used in these experiments), it is clear that using incomplete sub-trajectories has a deteriorative effect. We hypothesize that it is due to the fact that short trajectories in  Bsubexpertare hard to discriminate from unsuc- cessful agent trajectories. Hence, the elements of  Bsubexpertshould be treated as negatives (as in Done Detector), not positives.

7.2 Done Termination

As mentioned in Section 5.2, we terminate the episode when the special Done action is performed by the agent. This technical detail turned out to be very useful and critical to achieve good performance. It also significantly speeds up training procedure, because richer data is collected – more episodes for the same number of frames experienced by the agent. The results showing our method’s performance with and without Done termination are presented in Table 5.

Table 5: The results show the drop in success rate when Done termination is not used. Order of magnitude (8 times) fewer demonstrations than needed for BC are used, and Agent FC and Done Detector is applied. Similar result are achieved for other demonstration sizes and methods.

image

When Done termination is not used, the success rate drops significantly for all tasks, and gets down to around 15% for PutNextLocal task.

Table 6: Success rate for different models. For each task we report performance for three expert demonstrations sets of different sizes (hence three columns are allocated for each task). The largest one is the minimum necessary demonstration set needed to solve the tasks using BC. Two smaller subsets are tested (8 times and 64 times smaller). A task is considered solved if the agent achieves more than 99% success rate (bold values).

image

7.3 Auxiliary Rewards

In this section we present results comparing different approaches using FC. Specifically, we consider three models: Agent FC, Expert FC and Agent FC + Expert FC. Each of these models can be run in 4 variants: without auxiliary rewards, with Done Detector or Blank Conditioning, or with both of them. The results are presented in Table 6. We will number specific rows alphabetically (from (a) to (m)) to make referring simple.

We will first focus on auxiliary rewards. Applying Done Detector always leads to better performing agent (please compare no auxiliary reward variants (b), (f) and (j) with their improved versions (c), (g) and (k), respectively). On the other hand, Blank Conditioning helps in some cases and hurts in others. As mentioned before, Blank Conditioning can suffer from the FN problem and we hypothesise this is the main reason why Blank Conditioning is worse than Done Detector. Done Detector is so beneficial that additional adding Blank Conditioning is never significantly better. Hence, in the rest of this subsection we will focus mainly on variants with Done Detector and without Blank Conditioning (as the best performing ones), i.e. (c), (g) and (k).

The best variant of Agent FC (c) tends to be better than Expert FC (g), especially when larger demonstration sets are used. The difference is more pronounced when variants without Done Detector are considered. Using both Agent FC and Expert FC at the same time does not seem to provide any further benefits.

All FC-based methods with Done Detector perform clearly better than Baseline GAIL (a). Among the tested methods, Agent FC with Done Detector auxiliary reward (c) achieves the highest performance in most cases. Agent FC without any auxiliary rewards also perform very well. On the other hand, Expert FC needs Done Detector to achieve good results. When Done Detector is not used, the discriminator, which is trained on expert trajectories only, seems to get explioted by the agent and the final performance is sometimes even worse than naive GAIL. However, Expert FC with Done Detector are the only variants that solve GoToLocal with 64 fewer demonstrations.

We would like to note that Expert FC and Done Detector discriminators does not depend on agent performance. Hence, they can be pretrained before the agent’s learning procedure starts. It means that additional improvements known from supervised learning can be simply added to aforementioned pretraining. For example, validation early stopping can be very beneficial to prevent over-fitting to fixed and limited expert demonstrations when Expert FC discriminator is trained. We leave that for future work.

The take-away message from this section is that FC methods generally outperform Baseline GAIL which experimentally proves that addressing the problem of false negatives is important to achieve well performing agents. The choice of particular FC method presented in the paper is of the lesser importance, however using Agent FC with auxiliary reward based on Done Detector seems to be the best choice.

We show that the problem of false negatives can significantly hinder the performance of adversarial imitation learning. We contribute an extensive analysis of the phenomenon and a diagnostic tool, Oracle Filtering, to measure its impact. The tool is fully general and can be applied to any task.

We propose Fake Conditioning, a method to overcome the problem, and we show that it significantly improves over baselines for multi-goal tasks. We also presented a way to apply the method in the single-goal case.

Finally, we showed that auxiliary rewards obtained with extra discriminators can further improve the agent performance.

We thank Louis Maestrati, Charles Guille-Escuret, Baptiste Goujaud, Anirudh Srinivasan, David Venuto, Junhao Wang, Christopher Beckham, Anne-Flore Baron for useful discussions. Konrad ˙Zołna is supported by the National Science Center, Poland (2017/27/N/ST6/00828, 2018/28/T/ST6/00211). This research was mostly performed at Mila with funding by the Government of Quebec and CIFAR, and enabled by Compute Canada (www.computecanada.ca).

[2004a] Abbeel, P., and Ng, A. Y. 2004a. Apprenticeship learning via inverse reinforcement learning. In In Proceedings of the Twenty-first International Conference on Machine Learning. ACM Press.

[2004b] Abbeel, P., and Ng, A. Y. 2004b. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, 1. ACM.

[2019] Bahdanau, D.; Hill, F.; Leike, J.; Hughes, E.; Hos- seini, A.; Kohli, P.; and Grefenstette, E. 2019. Learning to Understand Goal Specifications by Modelling Reward. In International Conference on Learning Representations, ICLR 2019.

[2017] Chaplot, D. S.; Sathyendra, K. M.; Pasumarthi, R. K.; Rajagopal, D.; and Salakhutdinov, R. 2017. Gated-attention architectures for task-oriented language grounding. CoRR abs/1706.07230.

[2018] Chevalier-Boisvert, M.; Bahdanau, D.; Lahlou, S.; Willems, L.; Saharia, C.; Nguyen, T. H.; and Bengio, Y. 2018. BabyAI: A platform to study the sample efficiency of grounded language learning. In ICLR.

[2013] Duvallet, F.; Kollar, T.; and Stentz, A. 2013. Imitation learning for natural language direction following through unknown environments. In 2013 IEEE International Conference on Robotics and Automation, 1047–1053. IEEE.

[2019] Fu, J.; Korattikara, A.; Levine, S.; and Guadar- rama, S. 2019. From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following. arXiv:1902.07742 [cs, stat]. arXiv: 1902.07742.

[2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 2672–2680.

[2017] Hermann, K. M.; Hill, F.; Green, S.; Wang, F.; Faulkner, R.; Soyer, H.; Szepesvari, D.; Czarnecki, W. M.; Jaderberg, M.; Teplyashin, D.; Wainwright, M.; Apps, C.; Hassabis, D.; and Blunsom, P. 2017. Grounded language learning in a simulated 3d world. CoRR abs/1706.06551.

[2016] Ho, J., and Ermon, S. 2016. Generative adversarial imitation learning. In NeurIPS.

[2018] Kostrikov, I.; Agrawal, K. K.; Dwibedi, D.; Levine, S.; and Tompson, J. 2018. Discriminator-actor-critic: Ad-

dressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925.

[2019] Luketina, J.; Nardelli, N.; Farquhar, G.; Foerster, J. N.; Andreas, J.; Grefenstette, E.; Whiteson, S.; and Rockt¨aschel, T. 2019. A survey of reinforcement learning informed by natural language. arXiv.

[2016] Mei, H.; Bansal, M.; and Walter, M. R. 2016. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences. In Proceedings of the 2016 AAAI Conference on Artificial Intelligence.

[2018] Mirowski, P.; Grimes, M. K.; Malinowski, M.; Her- mann, K. M.; Anderson, K.; Teplyashin, D.; Simonyan, K.; Kavukcuoglu, K.; Zisserman, A.; and Hadsell, R. 2018. Learning to navigate in cities without a map. CoRR abs/1804.00168.

[2000] Ng, A. Y., and Russell, S. J. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, 663–670. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

[1989] Pomerleau, D. A. 1989. Alvinn: An autonomous land vehicle in a neural network. In NIPS.

[2011] Ross, S.; Gordon, G.; and Bagnell, D. 2011. A re- duction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 627–635.

[1972] Winograd, T. 1972. Understanding Natural Language. Orlando, FL, USA: Academic Press, Inc.

[2019] Xu, D., and Denil, M. 2019. Positive-unlabeled re- ward learning. arXiv preprint arXiv:1911.00459.

[2008] Ziebart, B. D.; Maas, A.; Bagnell, J. A.; and Dey, A. K. 2008. Maximum entropy inverse reinforcement learning. In AAAI.

[2019] Zolna, K.; Reed, S.; Novikov, A.; Colmenarej, S. G.; Budden, D.; Cabi, S.; Denil, M.; de Freitas, N.; and Wang, Z. 2019. Task-relevant adversarial imitation learning.


Designed for Accessibility and to further Open Science