Safe Exploration of State and Action Spaces in Reinforcement Learning

2014·arXiv

Abstract

1. Introduction

Reinforcement learning (RL) (Sutton & Barto, 1998) is a type of machine learning whose main goal is that of finding a policy that moves an agent optimally in an environment, generally formulated as a Markov Decision Process (MDP). Many RL methods are being used in important and complex tasks (e.g., robot control see Smart & Kaelbling, 2002; Hester, Quinlan, & Stone, 2011, stochastic games see Mannor, 2004; Konen & Bartz-Beielstein, 2009 and control optimization of complex dynamical systems see Salkham, Cunningham, Garg, & Cahill, 2008). While most RL tasks are focused on maximizing a long-term cumulative reward, RL researchers are paying increasing attention not only to long-term reward maximization, but also to the safety of approaches to Sequential Decision Problems (SDPs) (Mihatsch & Neuneier, 2002; Hans, Schneegass, Sch¨afer, & Udluft, 2008; Mart´ın H. & Lope, 2009; Koppejan & Whiteson, 2011). Well-written reviews of these matters can also be found (Geibel & Wysotzki, 2005; Defourny, Ernst, & Wehenkel, 2008). Nevertheless, while it is important to ensure reasonable system performance and consider the safety of the agent (e.g., avoiding collisions, crashes, etc.) in the application of RL to dangerous tasks, most exploration techniques in RL offer no guarantees on both issues. Thus, when using RL techniques in dangerous control tasks, an important question arises; namely, how can we ensure that the exploration of the state-action space will not cause damage or injury while, at the same time, learning (near-)optimal policies? The matter, in other words, is one of ensuring that the agent be able to explore a dangerous environment both safely and efficiently. There are many domains where the exploration/exploitation process may lead to catastrophic states or actions for the learning agent (Geibel & Wysotzki, 2005). The helicopter hovering control task is one such case involving high risk, since some policies can crash the helicopter, incurring catastrophic negative reward. Exploration/exploitation strategies such as may even result in constant helicopter crashes (especially where there is a high probability of random action selection). Another example can be found in portfolio theory where analysts are expected to find a portfolio that maximizes profit while avoiding risks of considerable losses (Luenberger, 1998). Since the maximization of expected returns does not necessarily prevent rare occurrences of large negative outcomes, a different criteria for safe exploration is needed. The exploration process in which new policies are evaluated must be conducted with extreme care. Indeed, for such environments, a method is required which not only explores the state-action space, but which does so in a safe manner.

In this paper, we propose the Policy Improvement through Safe Reinforcement Learning (PI-SRL) algorithm for safe exploration in dangerous and continuous control tasks. Such a method requires a predefined (and safe) baseline policy which is assumed to be suboptimal (otherwise, learning would be pointless). Predefined baseline policies have been used in different ways by other approaches. In the work of Koppejan and Whiteson (2011), singlelayers perceptrons are evolved, albeit starting from a prototype network whose weights correspond to a baseline policy provided by helicopter control task competition software (Abbeel, Coates, Hunter, & Ng, 2008). This approach can be viewed as a simple form of population seeding which has proven to be advantageous in numerous evolutionary methods (e.g. see Hern´andez-D´ıaz, Coello, Perez, Caballero, Luque, & Santana-Quintero, 2008; Poli & Cagnoni, 1997). In the work of Mart´ın and de Lope (2009), the weights of neural networks are also evolved by inserting several baseline policies (including that provided in the helicopter control task competition software) into the initial population. To minimize the possibility of evaluating unsafe policies, their approach prevents crossover and mutation operators from permitting anything more than tiny changes to the initial baseline policies. In this paper, we present the PI-SRL algorithm, a novel approach to improving baseline policies in dangerous domains using RL. The PI-SRL algorithm is composed of two different steps. In the first, baseline behavior (robust albeit suboptimal) is approximated using behavioral cloning techniques (Anderson, Draper, & Peterson, 2000; Abbott, 2008). In order to achieve this goal, case-based reasoning (CBR) techniques (Aamodt & Plaza, 1994; Bartsch-Sprl, Lenz, & Hbner, 1999) were used which have been successfully applied to imitation tasks in the past (Floyd & Esfandiari, 2010; Floyd, Esfandiari, & Lam, 2008). In the second step, the PI-SRL algorithm attempts to safely explore the state-action space in order to build a more accurate policy from previously-learned behavior. Thus, the set of cases (i.e., state-action pairs) obtained in the previous phase is improved through the safe exploration of the state-action space. To perform this exploration, small amounts of Gaussian noise are randomly added to the greedy actions of the baseline policy approach. The exploration strategy has been used successfully in previous works (Argall, Chernova, Veloso, & Browning, 2009; Van Hasselt & Wiering, 2007).

The novelty of the present study is in the use of two new, main components: (i) a risk function to determine the degree of risk of a particular state and (ii) a baseline behavior capable of producing safe actions in supposedly risky states (i.e., states that can lead to damage or injury). In addition, we present a new definition of risk based on what for the agent is unknown and known space. As will be described in Section 5 in greater detail, this new definition is completely different from traditional definitions of “risk” found in the literature (Geibel, 2001; Mihatsch & Neuneier, 2002; Geibel & Wysotzki, 2005). The paper also reports the experimental results obtained from the application of the new approach in four different domains: (i) automatic car parking (Lee & Lee, 2008), (ii) pole-balancing (Sutton & Barto, 1998), (iii) 2009 RL Competition helicopter hovering (Ng, Kim, Jordan, & Sastry, 2003) and (iv) business management (Borrajo, Bueno, de Pablo, Santos, Fern´andez, Garc´ıa, & Sagredo, 2010). In each domain, we propose the learning of a near-optimal policy which, in the learning phase, will minimize car crashes, pole disequilibrium, helicopter crashes and company bankruptcies, respectively. It is important to note that the comparison of our approach with an agent with an optimal exploration policy is not possible since, in the proposed domains (each with a high-dimensional and continuous state and action space, as well as complex stochastic dynamics), we do not know what the optimal exploration policy is.

Regarding the organization of the remainder of the paper, Section 2 introduces key definitions, while Section 3 describes in detail the learning approach proposed. In Section 4, the evaluation performed in the four above mentioned domains is presented. Section 5 discusses related work and Section 6 summarizes the main conclusions of our study. In these sections, the term return is used to refer to the expected cumulative future discounted reward , while the term reward is used to refer to a single real value used to evaluate the selection of an action in a particular state and it is denoted by r.

2. Deﬁnitions

To illustrate the concept of safety used in our approach, a navigation problem is presented below in Figure 1. In the navigation problem presented in Figure 1, a control policy must be learned to get from a particular start state to a goal state, given a set of demonstration trajectories. In this environment, we assume the task to be difficult due to a stochastic and complex dynamic of the environment (e.g., an extremely irregular surface in the case of a robot navigation domain or wind effects in the case of the helicopter hover task). This stochasticity makes it impossible to complete the task using exactly the same trajectory every time. Additionally, the problem supposes that a set of demonstrations from a baseline controller performing the task (the continuous black lines) are also given. This set of demonstrations is composed of different trajectories covering a well-defined region of the state space (the region within the rectangle).

Our approach is based on the addition of small amounts of Gaussian noise or perturbations to the baseline trajectories in order to find new and better ways of completing the task. This noise will affect the baseline trajectories in different ways, depending on the amount of noise added which, in turn, depends on the amount of risk to be taken. If no risk is desired, the noise added to the baseline trajectories will be 0 and, consequently, no new or improved behavior will be discovered (nevertheless, the robot will never fall off the cliff and the helicopter will never crash). If, however, an intermediate level of risk is desired, small amounts of noise will be added to the baseline trajectories and new trajectories (the

Figure 1: Exploration strategy based on the addition of small amounts of noise to baseline policy behavior. Continuous lines represent the baseline behavior, while newly explored behaviors are indicated by the dotted and dashed lines.

dotted blue lines) to complete the task are discovered. In some cases, the exploration of new trajectories leads the robot to unknown regions of the state space (the dashed red lines). The robot is assumed to be able to detect such situations with a risk function and use the baseline behavior to return to safe, known states. If, instead, a very high risk is desired, large amounts of noise will be added to the baseline trajectories, leading to the discovery of new trajectories (but also to a higher probability that the robot gets damaged). The iteration of this process leads the robot to progressively and safely explore the state and action spaces in order to find new and improved ways to complete the task. The degree of safety in the exploration, however, will depend on the risk taken.

2.1 Error and Non-Error States

In this paper, we follow as far we can the notation presented in Geibel et al. (2005) for the definition of our concept of risk. In their study, Geibel et al. associate risk with error states and non-error states, with the former understood as a state in which it is considered undesirable or dangerous to enter.

Definition 1 Error and non-error states. Let S be a set of states and Φ of error states. A state is an undesirable terminal state where the control of the agent ends when s is reached with damage or injury to the agent, the learning system or any external entities. The set Γ is considered a set of non-error terminal states with Γ and where the control of the agent ends normally without damage or injury.

In terms of RL, if the agent enters an error state, the current episode ends with damage to the learning system (or other systems); whereas if it enters a non-error state, the episode ends normally and without damage. Thus, Geibel et al. define the risk of s with respect to policy ), as the probability that the state sequence (, generated by the execution of policy , terminates in an error state Φ. By definition, if . For states Γ, the risk taken depends on the actions selected by the policy . With these definitions, we have the theoretical framework with which to introduce our own definition of the risk associated with known and unknown states.

2.2 Known and Unknown States in Continuous Action and State Spaces

We assume a continuous, n-dimensional state space where each state is a vector of real numbers and each dimension has an individual domain Similarly, we assume a continuous and m-dimensional action space action is a vector of real numbers and each dimension has an individual domain Additionally, the agent considered here is endowed with a memory, or case-base B, of the size . Each memory element represents a state-action pair, or case, the agent has experienced before.

Definition 2 (Case-base). A case-base is a set of cases . Every case consists of a state-action pair (the agent has experienced in the past and with an associated value , where the first element represents the case’s problem part and corresponds to the state , the following element depicts the case solution (i.e., the action expected when the agent is in the state ) and the final element is the value function associated with the state . Each state is composed of n continuous state variables and each action is composed of m continuous action variables.

When the agent receives a new state , it first retrieves the nearest neighbor of B according to a given similarity metric and then performs the associated action. In this paper, we use Euclidean distance as our similarity metric (Equation 1).

The Euclidean distance metric is useful when the value function is expected to be continuous and smooth throughout the state space (Santamar´ıa, Sutton, & Ram, 1998). However, since the value function is unknown a priori and the Euclidean distance metric is not particularly suitable for many problems, many researchers have begun to ask how the distance metric itself can learn or adapt in order to achieve better results (Taylor, Kulis, & Sha, 2011). While the use of distance metric learning techniques would certainly be desirable in order to induce a more powerful distance metric for a specific domain, such a consideration lies outside the scope of the present study. In this paper, therefore, we have focused only on domains in which Euclidean distance has been proven successful (i.e., it has been successfully applied to car parking (Cichosz, 1995), pole-balancing (Martin H & de Lope, 2009), helicopter hovering control (Martin H & de Lope, 2009) and SIMBA (Borrajo et al., 2010).

Traditionally, case-based approaches use a in order to determine when a new case should be added to the memory. When the distance of the nearest neighbor to is greater than , a new case is added. In this sense, the parameter defines the size of the classification region for each case in B (Figure 2). If a new case is within the classification region of a case , it is considered to be a known state. Hence, the cases in B describe a case-based policy of the agent and its associated value function

Figure 2: Known and Unknown states.

Definition 3 (Known/Unknown states). Given a case-base of cases and a density threshold is considered known when minand unknown in all other cases. Formally, Ω is the set of known states, while Υ is the set of unknown states with Ω

With Definition 3, states can be identified as known or unknown. When the agent receives a new state Ω, it performs the action of the case min). However, if the agent receives a state Υ where, by definition, the distance to any state in B is larger than , no case is retrieved. Consequently, the action to be performed from that state is unknown to the agent.

Definition 4 (Case-Based risk function). Given a case base of cases , the risk for each state s is defined as Equation 2.

Thus, is unknown), such that the state s is not associated with any case and, hence, the action to be performed in the given situation is unknown. If

Definition 5 (Safe case-based policy). The case-based policy derived from a case- base is safe when, from any initial known state with respect to B, the execution of always produces known non-error states with respect to B.

Additionally, it is assumed here that the probability that the state sequence (any known state Ω, generated by executing policy , terminates in an error state

Definition 6 (Safe case-based coverage). The coverage of a single state s with respect to a safe case-base is defined as the state Therefore, we assume that the safe case-based does not provide actions for the entire state space, but rather only for known states

Figure 3 graphically represents the relationship between known/unknown and error/non-error states. The green area in the image denotes the safe case-based policy area of the state space corresponding to the initial known space. An agent following the policy will always be in the green area and all resulting episodes will end without damages. Consequently, a subset of non-error states will also form part of the known space. Formally, let Γbe subsets of non-error states belonging to the known and unknown spaces, respectively, with ΓΩ. The yellow area in the Figure, by contrast, represents the unknown space Υ. In this space will be found all error states, as well as a subset of remaining non-error states. Formally, Γ

Understood in this way, the PI-SRL algorithm can be summed up as follows:

• As a first step, learn the known space (green area) from the safe case-based policy

• As a second step, adjust the known space (green area) and unknown space (yellow area) in order to explore new and improved behaviors while avoiding error states (red area). During this process of adjusting the known space to the space used for safe and better policies, the algorithm can “forget” ineffectual known states, as will be shown in Section 4.

Figure 3: Known/unknown and error/non-error states given the Case Base B.

2.3 The Advantages of Using Prior Knowledge and Predetermined Exploration Policies

In the present subsection, the advantages of using teacher knowledge in RL, namely (i) to provide initial knowledge about the task to be learned and (ii) to support the exploration process, are highlighted. Furthermore, we explain why we believe this knowledge to be indispensable in RL for tackling highly complex and realistic problems with large, continuous state and action spaces and in which a particular action may result in an undesirable consequence.

2.3.1 Providing Initial Knowledge about the Task

Most RL algorithms begin learning without any previous knowledge about the task to be learnt. In such cases, exploration strategies such as are used. The application of this strategy results in the random exploration of the state and action spaces to gather knowledge about the task. Only when enough information is discovered from the environment does the algorithm’s behavior improve. Such random exploration policies, however, waste a significant amount of time exploring irrelevant regions of the state and action spaces in which the optimal policy will never be encountered. This problem is compounded in domains with extremely large and continuous state and action spaces in which random exploration will never likely visit the regions of the spaces necessary to learn (near-)optimal policies. Additionally, in many real RL tasks with real robots, a random exploration to gather information from the environment cannot even be applied. With real robots, what is considered to be sufficient information can be much more information than a real robot can gather from the environment. Finally, as it is impossible to avoid undesirable situations in high-risk environments without a certain amount of prior knowledge about the task, the use of random exploration would require that an undesirable state be visited before it can be labeled as undesirable. However, such visits to undesirable states may result in damage or injury to the agent, the learning system or external entities. Consequently, visits to these states should be avoided from the earliest steps of the learning process.

Mitigating the difficulties described above, finite sets of teacher-provided examples or demonstrations can be used to incorporate prior knowledge into the learning algorithm. This teacher knowledge can be used in two general ways, either (i) to bootstrap the learning algorithm (i.e., a sort of initialization procedure) or (ii) to derive a policy from such examples. In the first case, the learning algorithm is provided with examples or demonstrations from which to bootstrap the value function approximation and lead the agent through the more relevant regions of the space. The second way in which teacher knowledge can be used refers to Learning from Demonstration (LfD) approaches in which a policy is derived from a finite set of demonstrations provided by a teacher. The principal drawback to this approach, however, is that the performance of the derived policy is heavily limited by teacher ability. While one way to circumvent the difficulty and improve performance is by exploring beyond what is provided in the teacher demonstrations, this again raises the question of how the agent should act when it encounters a state for which no demonstration exists (an unknown state).

2.3.2 Supporting the Exploration Process

While furnishing the agent with initial knowledge helps mitigate the problems associated with random exploration, this alone is not sufficient to prevent the undesirable situations that arise in the subsequent explorations undertaken to improve learner ability. An additional mechanism is necessary to guide this subsequent exploration process in such a way that the agent may be kept far away from catastrophic states. In this paper, a teacher, rather than the policy derived from the current value function approximation is used for the selection of actions in unknown states. One way to prevent the agent from encountering unknown states during the exploration process would be by requesting from the beginning a teacher demonstration for every state in the state space. However, such a strategy is not possible due to (i) its computational infeasibility given the extremely large number of states in the state space and (ii) the fact that the teacher should not be forced to give an action for every state, given that many states will be ineffectual for learning the optimal policy. Consequently, PI-SRL requests teacher action only when such action is actually required (i.e., when the agent is in an unknown state).

As this paper supposes that such a teacher is available for the task to be learned, the teacher is taken as the baseline behavior. Although some studies have examined the use of robotic teachers, hand-written control policies and simulated planners, the great majority to date have made use of human teachers. This paper uses suboptimal automatic controllers as teachers, with taken as the teacher’s policy.

is considered the baseline behavior about which three assumptions are made: (i) it is able to provide safe demonstrations of the task to be learnt from which prior knowledge can be extracted; (ii) it is able to support the subsequent exploration process, advising suboptimal actions in unknown states to reduce the probability of entering into error states and return the system to a known situation; and (iii) its performance is far from optimal.

While optimal baseline behaviors are certainly ideal to behave safely, non-optimal behaviors are often easy (or easier) to implement or generate than optimal ones. The PI-SRL algorithm uses the baseline behavior in two different ways. First, it uses the safe demonstrations of to provide prior knowledge about the task. In this step, the algorithm builds the initial known space of the agent derived from the safe case-based policy purpose of mimicking . In the second step, PI-SRL uses to support the subsequent exploration process conducted to improve the abilities of the previously-learnt . As the exploration process continues, an action of is requested only when required, that is, when the agent is in an unknown state (Figure 4). In this step, acts as a backup policy in the case of an unknown state with the intention of guiding the learning away from catastrophic errors or, at least, reducing their frequency. It is important to note that the baseline behavior cannot demonstrate the correct action for every possible state. However, while the baseline behavior might not be able to indicate the best action in all cases, the action it supplies should, at the very least, be safer than that obtained through random exploration.

2.4 The Risk Parameter

In order to maximize exploration safety, it seems advisable that movement through the state space not be arbitrary, but rather that known space be expanded only gradually by starting from a known state. Such an exploration is carried out through the perturbation of the state-action trajectories generated by the policy . Perturbation of the trajectories is accomplished by the addition of Gaussian random noise to the actions in B in order to obtain new ways of completing the task. Thus, the Gaussian exploration takes place

Figure 4: The exploration process in PI-SRL requests actions of the baseline behavior, when it is really required.

around the current approximation of the action for the current known state ). The action performed is sampled from a Gaussian distribution with the mean at the action output given by the instance selected in denotes the algorithm action output, the probability of selecting action ) is computed using Equation 4.

The shape of the Gaussian distribution depends on parameter (standard deviation). In this study, is used as a width parameter. While large values imply a wide bell-shaped distribution, increasing the probability of selecting actions very different from the current action value implies a narrow bell-shaped distribution, increasing the probability of selecting actions very similar to the current action we assume value is directly related to the amount of perturbation added to the state-action trajectories generated by the policy values imply greater perturbations (more Gaussian noise) and a greater probability of visiting unknown states.

Definition 8 (Risk Parameter). The parameter is considered a risk parameter. Large values of increase the probability of visiting distant unknown states and, hence, increase the probability of reaching error states.

These exploratory actions drive the agent to the edge of the known space and force it to go slightly beyond, into the unknown space, in search of better, safer behaviors. After a period of time, the execution of these exploratory actions increases the known space and improves the abilities of the previously-learned safe case-based policy parameter , as well as , are design parameters that must be selected by the user. In Section 3.3, guidelines for this selection are offered.

It is important to note that the approach proposed in this study is based on two logical assumptions in RL derived from the following generalization principles (Kaelbling, Littman, & Moore, 1996; Sutton & Barto, 1998):

(i) Nearby states have similar optimal actions. In continuous state spaces, it is impossible for the agent to visit every state and store its value (or optimal action) in a table. This is why generalization techniques are needed. In large, smooth state spaces, similar states are expected to have similar values and similar optimal actions. Therefore, it is possible to use experience gathered from the environment with a limited subset of the state space and produce a reliable approximation over a much larger subset (Boyan, Moore, & Sutton, 1995; Hu, Kostiadis, Hunter, & Kalyviotis, 2001; Fern´andez & Borrajo, 2008). One must also note that, in the proposed domains, an optimal action is also considered to be a safe action in the sense that it never produces error states (i.e., no action is considered optimal that leads the agent to a catastrophic situation).

(ii) Similar actions in similar states tend to produce similar effects. Considering a deterministic domain, the action performed in state always produces the same state . In a stochastic domain, it is understood intuitively that the execution of the action will produce similar effects (i.e., it produces states where 0). Additionally, the execution of the action in a state produces states As explained earlier, the present study uses Euclidean distance as a similarity metric, as it has been proven successful in the proposed domains. As a result of this assumption, approximation techniques can be used, such that actions that generate similar effects can be grouped together as one action (Jiang, 2004). In continuous action spaces, the need for generalization techniques is even greater (Kaelbling et al., 1996). In this paper, the assumption also allows us to assume that low values of increase the probability of visiting known states and, hence, of exploring less and taking less risks, while greater values of increase the probability of reaching error states.

3. The PI-SRL Algorithm

The PI-SRL algorithm is composed of two main steps described in detail below.

3.1 First Step: Modeling Baseline Behaviors by CBR

The first step of PI-SRL is an approach for behavioral cloning, using CBR to allow a software agent to behave in a similar manner to a teacher policy (baseline behavior) (Floyd et al., 2008). Whereas LfD approaches are named differently according to what is learned (Argall et al., 2009), to prevent terminological inconsistencies here, we consider behavioral cloning (also known as imitation learning) to be an area of LfD whose goal is the reproduction/mimicking of the underlying teacher policy (Peters, Tedrake, Roy, & Morimoto, 2010; Abbott, 2008).

When using CBR for behavioral cloning, a case can be built using the agent’s state received from the environment, as well as the corresponding action command performed by the teacher. In PI-SRL, the objective of the first step is to properly imitate the behavior of using the cases stored in a case-base. At this point, an important question arises; namely, how a case-base can be learnt using the sample trajectories provided by such that, at the end of the learning process, the resulting policy derived from mimics the behavior of ? Baseline behavior is a function that maps states to actions or, in other words, a function that, given a state , provides the corresponding action this paper, we want to build a policy derived from a case-base composed of cases (such that, for a new state , the case with the minimum Euclidean distance retrieved and the corresponding action is returned. Intuitively, it can be assumed that can be built simply by storing all cases () gathered from one interaction between and the environment during a limited number of episodes K. At the end of K episodes, one expects the resulting to be able to properly mimic the behavior of . However, informal experimentation in the helicopter hovering domain shows this not to be the case (Section 4.3). In helicopter hovering, after K = 100 episodes and the prohibitive number of 600,000 cases stored, the policy derived from the case-base is unable to correctly imitate the baseline behavior and, instead, continuously crashes the helicopter. Indeed, in order for in large continuous and stochastic domains, the approach requires a larger number of episodes and, consequently, a prohibitive number of cases. In fact, to perfectly mimic in these domains, an infinite number of cases would be required. Figure 5 attempts to explain why we believe that this learning process does not work. In it, the region of the space represented by simply storing cases derived from in the form c = (s, a) is shown. Each stored case (red circles) covers an area of the space and represents the centroid of a Voronoi region.

Figure 5: Effects of storing all training cases.

If the previously-learned policy is used when a new state is presented, the action is performed, corresponding to the case ) where the Euclidean distance ) is less than that with all other stored cases. However, if we use the policy provide an action in the situation , the action is provided which is different than At this point, the policy can be said to classify the state while the policy can be said to classify the state (insofar as is the policy to be mimicked), with 0. Furthermore, is understood as the classification error. If the case-base stored all the possible pairs (were able to generate in the domain, the actions would always be identical, with is impossible to store all such cases. The sum of all such classification errors in an episode leads to the visiting of unexplored regions of the case space (i.e., regions where the new state received from the environment has a Euclidean distance respect to the closest case ). When these unexplored regions are visited, the difference between the obtained class derived from and the desired class derived from is large (i.e., 0) and the probability that error states might be visited greatly increases.

It may be concluded, therefore, that simply storing the pairs c = (s, a) generated by is not sufficient to properly mimic its behavior. For this reason, the algorithm in Figure 6 below has been proposed.

Figure 6: CBR algorithm for behavioral cloning.

In the first step of the algorithm, the state-value function ) is initialized to 0 (see line 07). The value ) for each case is computed in the second step of the algorithm in Section 3.2. Additionally, this step uses the case-based risk function (Equation 2) to determine whether a new state should be considered risky (line 08). If the new state is not risky (i.e., it is a known state Ω), a 1-nearest neighbor strategy is followed (line 09). Otherwise, the algorithm performs the action using the baseline behavior new case 0) is built and added to the case-base B (line 13). Starting with an empty case-base, the learning algorithm continuously increases its competence by storing new experiences. However, there are a number of reasons why the inflow of new cases should be limited. Large case-bases increase the time required to find the closest cases to a new example. While this may be partially solved using techniques to reduce the retrieval time (e.g., k-d trees that have been used in this work), they nevertheless do not reduce the storage requirements. Several approaches to the removal of ineffectual cases during training exist, including Aha’s IBx algorithms (Aha, 1992) or any nearest prototype approach (Fernandez & Isasi, 2008). When the number of cases stored in B exceeds a critical value that the realization of a retrieval within a certain amount of time cannot be guaranteed, the removal of some cases is inevitable. An efficient approach to such a problem is through the removal of the least-frequently-used elements of B (line 18).

The result of this step is a constrained case-base B describing the safe case-based policy that mimics the baseline behavior , though perhaps with some deviation (line 20). Formally, let ) be an estimate of the utility of the baseline behavior computed by averaging the sum of rewards accumulated in each of trials. Then,

3.2 Second Step: Improving the Learned Baseline Behavior

In this step of the PI-SRL algorithm, the safe case-based policy learned in the previous step is improved by the safe exploration of the state-action space. First, for each case B, the state-value function ) is computed following a Monte Carlo (MC) approach (Figure 7).

Figure 7: Monte Carlo algorithm for the computation of state-value function for each case.

This algorithm is similar in spirit to a first-visit MC method for (Sutton & Barto, 1998), adapted in this paper to work with a policy given by a case-base. In the algorithm shown in Figure 7, all returns for each state are accumulated and averaged, following the policy derived by the case base B (see line 09). It is important to note that in the algorithm the term return following the first occurrence of s refers to the expected return of s (i.e., the expected cumulative future discounted reward starting from that state), whereas Returns refers to a list composed of each return of s in different episodes. One of the principal reasons for using the MC method is that it allows us to quickly and easily estimate state values ) for each case . In addition, MC methods have been shown to be successful in a wide variety of domains (Sutton & Barto, 1998). Once the state-value function ) is computed for each case , small amounts of Gaussian noise are randomly added to the actions of the policy in order to obtain new and improved ways to complete the task. The algorithm used to improve the baseline behavior learned in the previous step is depicted in Figure 8. The algorithm is composed of four steps performed in each episode.

- (a) Initialization step. The algorithm initializes the list used to store cases occurring during an episode and sets the cumulative reward counter of the episode to 0.

- (b) Case Generation. The algorithm builds a case for each step of an episode. For each new state , the closest case is computed using the Euclidean distance metric from Equation 1 (see line 09 in algorithm of Figure 8). In order to determine the perceived degree of risk of the new state , the case-based risk function is used (line 10). If Ω (known state). In this case, the action performed is computed using Equation 4 and a new case is built to be added to the list of cases having occurred in the episode (line 13). It is important to note that the new case is built replacing the action a corresponding to the closest case in , with the new action resulting from the application of random Gaussian noise to a in the Equation 4. Thus, the algorithm only produces smooth changes in the cases of . If, however, Υ (i.e., unknown state [line 14]). In unknown states, the action performed is suggested by the baseline behavior which defines safe behavior (line 15). A new case is built and added to the list of cases in the episode and actions will be performed using agent is not in a known state. Finally, the reward obtained in the episode is accumulated, where ) is the immediate reward obtained when action is performed in state (line 18).

- (c) Computing the state-value function for the unknown states. In this step, the state-value function of the states considered to be unknown in the previous step is computed. In the previous step (line 17), the state-value function for these states is set at 0. The algorithm proceeds in a manner similar to the first-visit MC algorithm in Figure 7. In this case, the return for each unknown state is computed, but not averaged since only one episode is considered (line 24 and 25). The return for each is computed, taking into account the first visit of the state in the episode (each occurrence of a state in an episode is called a visit to ), although the state could appear multiple times in the rest of the episode.

- (d) Updating the cases in B using experience gathered. Updates in B are made with the cases gathered from episodes with a cumulative reward similar to that of the best episode found to that point using the threshold Θ (line 27). In this way, good sequences are provided for the updates since it has been shown that such sequences of experiences can cause an adaptive agent to converge to a stable and useful policy, whereas bad sequences may cause an agent to converge to an unstable or bad policy (Wyatt, 1997). This also prevents the degradation of the initial performance of B as computed in the first step of the algorithm through the use of bad episodes, or episodes with errors, for updates. In this step, two types of updates appear, namely, replacements and additions of new cases. Again, the algorithm iterates for each case (line 29). If is a known state (line 30), we compute the case corresponding to the state (line 31). One should note that the case was built in line 13 of the algorithm, replacing the action a corresponding to the case new action and resulting from the application of random Gaussian noise to the action a

00 Given the case-base B, and the maximum number of cases 01 Given the baseline behavior 02 Given the update threshold Θ 03 1. Set maxTotalRwEpisode = 0, the maximum cumulative reward reached in an episode 04 2. Repeat 05 (a) Initialization step:

20 Set k = k + 1 21 (c) Computing the state-value function for the unknown states: listCasesEpisode 26 (d) Updating the cases in B using the experience gathered: 27 if totalRwEpisode > (maxTotalRwEpisode 28 maxTotalRwEpisode := max(maxTotalRwEpisode, totalRwEpisode) listCasesEpisode

Figure 8: Description of step two of PI-SRL algorithm.

by the Equation 4. Then, the temporal distance (TD) error is computed (line 32). If performing the action results in a positive change for the value of a state. The action, in turn, could potentially lead to a higher return and, thus, to a better policy. Van Hasselt and Wiering (2007) also update the value function using only the actions that potentially lead to a higher return. If the TD error is positive, is considered to be a good selection and is reinforced. In the algorithm, this reinforcement is carried out by updating the output of the case (line 34). Therefore, an update to the case-base only occurs when the TD error is positive. This is similar to a linear reward-inaction update for learning automata (Narendra & Thathachar, 1974, 1989) in which the sign of the TD error is used as a measure of success. PI-SRL only updates the case-base when actual improvements have been observed, thus avoiding slow learning when there are plateaus in the value space and TD errors are small. It has been shown empirically that this procedure can result in better policies than when step size depends on the size of the TD error (Van Hasselt & Wiering, 2007). It is important to note that these replacements produce smooth changes in the case-base B since an action a is replaced only if results in a higher This form of updating can be understood as a risk-seeking approach, overweighting only transitions to successor states that promise an above-average return (Mihatsch & Neuneier, 2002). Additionally, it prevents the degradation of B, ensuring that replacements are made only when an action can potentially lead to a higher

If, instead, is not a known state, the case is added to B (line 37). Finally, the algorithm removes cases from B if necessary (line 39). Complex scoring metrics to calculate which cases are to be removed for a given moment have been proposed by several authors. Forbes and Andres (2002) suggest the removal of cases that contribute least to the overall approximation, while Driessens and Ramon (2003) pursue a more error-oriented view and propose the deletion of cases that contribute most to the prediction error of other examples. The principal drawback of these more sophisticated measures is their complexity. The determination of the case(s) to be removed involves the computation of a score value for each , which in turn requires at least one retrieval and regression, respectively, for each ). Such entire repeated sweeps through the case-base entail an enormous computational load. Gabel and Riedmiller (2005) compute a different score metric for each , requiring the computation of the set of the k-nearest neighbors around Such approaches are not well-suited to systems learning with adjusted time requirements and with a high-dimensional state space, requiring the use of larger case-bases than those proposed here. Rather, in this paper, we propose the removal of the least-frequently-used cases. The idea seems intuitive insofar as the least-frequently-used cases usually contain worse estimates of a corresponding state’s value; although the strategy might lead to a function approximator that “forgets” some of the valuable experience made in the past (e.g., corner cases). Despite this, PI-SRL performs successfully in all domains proposed using the strategy, as demonstrated in Section 4. Thus, the ability to forget ineffectual known states described in Section 2 is a result of the algorithm removing from the least-frequently-used cases of B.

3.3 Parameter Setting Design

One of the main difficulties of applying the PI-SRL algorithm to a given problem is to decide on an appropriate set of parameter values for the threshold , the risk parameter the update threshold Θ and the maximum number of cases . An incorrect value for the parameter can lead to mislabeling a state as known when it is really unknown, potentially leading to damage or injury in the agent. In the case of the risk parameter , high values can continuously result in damage or injury; while low values are safe, but do not allow for exploration of the state-action space sufficient for reaching a near-optimal policy. Unlike , the parameter Θ is not related to risk, but instead is directly related to the performance of the algorithm. Parameter Θ is used to determine how good an episode must be with respect to the best episode obtained, since only the best episodes are used to update the case-base B. If the Θ value is too large, bad episodes may be used to update B (influencing the convergence and performance of the algorithm). If, instead, Θ is too low, the number of updates in B may be insufficient for improving the baseline behavior. Finally, a very high value allows for large case-bases, increasing the computational effort during retrieval and degrading the efficiency of the system. By contrast, a very low value might excessively restrict the size of the case-base and thus negatively affect the final performance of the algorithm. In this subsection, a solid perspective is given on the automatic definition of these parameters. The parameter setting proposed here are taken as a suitable set of heuristics tested successfully in a wide variety of domains (Section 4).

• : The parameter is domain-dependent and related to the average size of the actions. In this paper, the value for this parameter has been established by computing the mean distance between states during an execution of the baseline behavior . Expressed in another way, the execution of the policy provides a state-action sequence of the form . Thus, the value of is computed using Equation 5.

• : Several authors agree that it is impossible to completely avoid all accidents (Moldovan & Abbeel, 2012; Geibel & Wysotzki, 2005). It is important to note that PI-SRL is completely safe only if the first step of the algorithm is executed. However, by proceeding in this way, the performance of the algorithm is heavily limited by the abilities of the baseline behavior. The running of the subsequent exploratory process is inevitable if learner performance is to be improved beyond that of the baseline behavior. Since the agent operates in a state of incomplete knowledge of the domain and its dynamic, it is inevitable during the exploratory process that unknown regions of the state space will be visited where the agent may reach an error state. However, it is possible to adjust the risk parameter to determine the level of risk assumed during this exploratory process. In this paper, we start with low values (low risk) which we gradually increase. Specifically, we propose beginning with and increasing this value iteratively until either an accurate policy is obtained or the amount of damage or injury is high.

• Parameter Θ: The value of this parameter is set relative to the best episode obtained. In this paper, the Θ value is set to 5% of the cumulative reward of the best episode obtained.

Figure 9: Trajectories generated by the baseline policy in a deterministic, slightly stochastic and highly stochastic domain.

• : Previously, we estimated the maximum number of cases to be stored in the case-base as being the estimated maximum number of cases required to properly mimic the baseline behavior . What follows is a description of how this value is computed. Figure 9 presents the trajectories (sequences of states) followed by the baseline policy in three different domains: deterministic, slightly stochastic and highly stochastic. For each domain, different sequences of the states produced by represented where the initial state and the final state of the resulting trajectory in episode j. In the deterministic domain, the m different executions of always result in the same trajectory. In this case, we set the maximum number of cases to with all the cases computed in the episode being stored.

In the slightly stochastic domain, the trajectories produced in m different episodes are different, but only slightly so. Here, we suppose the case-base at the beginning to be empty. Additionally, we assume that all states corresponding to the first trajectory produced in the domain will be stored in the case-base. Furthermore, for each domain we execute m different episodes, obtaining m different trajectories. Following the execution of the m episodes, we compute the maximum distance between the i-th state of the first trajectory (previously added to the case-base) and the i-th state produced in the trajectory j such that maxslightly stochastic domain, this maximum distance does not exceed the threshold any case such that max. At this point, we assume the i-th state in trajectory j to have at least one neighbor with a distance less than (corresponding to the state ). Thus, the i-th state in j is not added to the case-base.

By contrast, in a highly stochastic domain, this maximum distance greatly exceeds the threshold in all the cases such that max. In this domain, we estimate the total number of cases that will be added to the case-base in the following

way. For each i-th state in the sequence of the first trajectory, we estimate the number of cases to be added to the case-base asor, in other words, we compute the number of intervals in the range [0)] with a width of (the threshold used to decide whether a new case is to be added or not to the case-base). Consequently, the estimated number of cases added to the case-base, taking into account all states in the sequence, is computed as the estimated maximum number of cases is computed as shown in Equation 6.

It is important to remember that in a deterministic domain, the summation in equation 6 is equal to 0 and that, therefore, . The increase of the value of this element is related to the increase of stochasticity of the environment, insofar as the greater stochasticity of the environment increases the number of cases required. Finally, if the number of cases is very large or nearly infinite, the threshold can be increased to make more restrictive the addition of new cases to the case-base. However, this increase may also adversely affect the final performance of the algorithm.

4. Experimental Results

This section presents the experimental results collected from the use of PI-SRL for policy learning in four different domains presented in order of increasing complexity (i.e., increasing number of variables describing states and actions): the car parking problem (Lee & Lee, 2008), pole-balancing (Sutton & Barto, 1998), helicopter hovering (Ng et al., 2003) and the business simulator SIMBA (Borrajo et al., 2010). For each of these domains, we have proposed the learning of a near-optimal policy which minimizes car accidents, pole disequilibrium, helicopter crashes and company bankruptcies, respectively, during the learning phase. All four of the domains are stochastic in our experimentation. While both helicopter hovering and the business simulator SIMBA are, in themselves, stochastic and, additionally, generalized domains, we have made the car parking and pole-balancing domains stochastic with the intentional addition of random Gaussian noise to the actions and reward function. The results of PI-SRL in the four domains are compared to those yielded by two additional techniques, namely, the evolutionary RL approach selected winner of the helicopter domain in the 2009 RL Competition (Mart´ın H. & Lope, 2009) and Geibel and Wysotzki’s risk-sensitive RL approach (Geibel & Wysotzki, 2005). In the evolutionary approach, several neural networks cloning error-free teacher policies are added to the initial population (guaranteeing rapid convergence of the algorithm to a near-optimal policy and, indirectly, minimizing agent damage or injury). Indeed, as the winner of the helicopter domain is the agent with the highest cumulative reward, the winner must also indirectly minimize helicopter crashes insofar as these incur large catastrophic negative rewards. On the other hand, the risk-sensitive approach defines risk as the probability ) of reaching a terminal error state (e.g., a helicopter crash ending agent control), starting at some initial state s. In this case, a new value function with the weighted sum of the risk probability, , and value function, , is used (Equation 7).

The parameter 0 determines the influence of the )-values compared to the values. For corresponds to the computation of minimum risk policies. For large values, the original value function multiplied by dominates the weighted criterion. While Geibel and Wysotzki (2005) consider only finite (discretized) action sets in their study, their algorithm has been adapted here for continuous action sets. We use CBR for value and risk function approximation and a Gaussian exploration around the current action. In the experiments, for each domain, three different values are used, modifying the influence of the V -values compared to the -values. In all cases, the goal is to improve the control policy while, at the same time, minimizing the number of episodes with agent damage or injury. In each domain, we establish different risk levels by modifying risk parameter values according to the procedure described in subsection 3.3. It is important to note that one baseline behavior used to initialize the evolutionary RL approach is exactly the same as that used subsequently in the first and second step of PI-SRL. Furthermore, the case-base in the risk-sensitive approach does not begin from scratch since it is initialized with the safe case-based policy . This makes the comparison of performances as fair as possible, but taking into account that the different techniques make its own use of the baseline behaviors.

4.1 Car Parking Problem

The car parking problem is represented in Figure 10 and originates from the RL literature (Cichosz, 1996). A car, represented as the rectangle in Figure 10, is initially located inside a bounded area, represented by the dark solid lines, referred to as the driving area. The goal for the learning agent is to navigate the car from its initial position into the garage, such that the car is entirely inside, in a minimum number of steps. The car cannot move outside of the driving area. Figure 10 (b) shows the two possible paths the car can take from the starting point to the garage with an obstacle in between in order to correctly perform the task. We consider the optimal policy for the domain to be that which reaches the goal state in the shortest time and which, at the same time, is free of failures.

The state space of the domain is described by three continuous variables, namely, the coordinates of the center of the car and the angle between the car’s axis and the X of the coordinate system. While the car can be modeled essentially with two control inputs, speed v and steering angle , let us suppose here that the car is controlled only by the steering angle (i.e., it moves at a constant speed). Thus, the action space is described by one continuous variable 1] corresponding to the turn radius, as used in the equations below. The agent receives a positive reward value of ) is the center of the car, ) is the center of the garage (i.e., the goal position) and is a normalizing function scaling the Euclidean distance to a range [0, 1] when the car is inside the garage (i.e., the reward value is greater if the car is parked correctly in the center of the garage). The agent receives a reward of -1 whenever it hits the wall or obstacle. All other steps receive a reward of -0.1. Thus, the difficulty of the problem lies not only in the reinforcement delay, but also in the fact that

Figure 10: Car Parking Problem: (a) Model of the car parking problem. (b) Examples of trajectories generated by the agent to park the car in the garage.

punishments are much more frequent than positive rewards (i.e., it is much easier to hit a wall than park the car correctly). The motion of the car is described by the following equations (Lee & Lee, 2008)