Learning to Identify Irrelevant State Variables

Size: px

Start display at page:

Download "Learning to Identify Irrelevant State Variables"

Janis Hunt
6 years ago
Views:

1 Learning to Identify Irrelevant State Variables Nicholas K. Jong Department of Computer Sciences University of Texas at Austin Austin, Texas Peter Stone Department of Computer Sciences University of Texas at Austin Austin, Texas Abstract When they are available, safe state abstractions improve the efficiency of reinforcement learning algorithms by allowing an agent to ignore irrelevant distinctions between states while still learning an optimal policy. Prior work investigated how to incorporate state abstractions into existing algorithms, but most approaches required the user to provide the abstraction. How to discover this kind of domain knowledge automatically remains a challenging open problem. In this paper, we introduce a general approach for testing the validity of a potential state abstraction. We reduce the problem to one of determining whether an action is optimal in every state in a given set. To decide optimality we give two statistical methods, which trade off between computational and sample complexity. One of these methods applies statistical hypothesis testing directly to learned state-action values, and the other applies Monte Carlo sampling to a learned Bayesian model. Finally, we demonstrate the ability of these methods to discriminate between safe and unsafe state abstractions in the familiar Taxi domain. 1 Introduction Reinforcement learning (RL) addresses the problem of how an agent ought to select actions in a Markov decision problem (MDP) so as to maximize its expected reward despite not knowing the transition and reward functions beforehand. Early work in this field led to simple algorithms that guarantee convergence to optimal behavior in the limit, but the rate of convergence has proven unacceptable for large, real-world applications. One key problem is the choice of state representation. The representation must include enough state variables for the problem to be Markov, but too many state variables incur the curse of dimensionality. Since the number of potential state variables is typically quite large for interesting problems, an important step in specifying an RL task is selecting those variables that are most relevant for learning. In this paper, we consider the task of automatically recognizing that a certain state variable is irrelevant. We define a state variable as irrelevant if an agent can completely ignore the variable and still behave optimally. For each state variable that it learns to ignore, an agent can significantly increase the efficiency of future training. The overall learning efficiency thus becomes more robust to the initial choice of state representation.

2 In general, an agent must learn a particular task rather well before it can reach safe conclusions about relevancy. One premise of our work is that an abstraction learned in one problem instance is likely to apply to other, similar problems. Learning in these subsequent problems can be accomplished with fewer state variables and therefore more efficiently. In this way an agent might learn from a comparatively easy problem a state representation that applies to a more difficult but related problem. Our work is motivated in part by recent work on temporal abstractions and hierarchy in RL [1, 4, 7, 8]. The introduction of reusable subtasks creates an opportunity for applying dynamic state abstractions, which apply at some parts of the hierarchy but not others. In this context a flexible mechanism for the automated discovery of abstractions is particularly important, since otherwise the user must consider individually each task in a potentially large hierarchy. Furthermore, a method for discovering the conditions under which a state abstraction applies may prove useful in the discovery of the task decomposition itself. For this reason we develop our approach in the context of a non-hierarchical learning algorithm yet in a domain familiar from the hierarchical learning literature. The main contributions of this paper are (i) our reformulation of the question of state irrelevance into a question of action optimality and (ii) the methods we give for answering this latter question. In Section 2 we describe the domain in which we develop our ideas. In Section 3 we give our definition of state irrelevance in terms of action optimality. In Section 4 we describe two distinct statistical methods for deciding whether an action is optimal. In Section 5 we show that both methods yield the desired results, but with differing levels of computational and sample complexity. In Section 6 we discuss related work, and in Section 7 we conclude. 2 Safe state abstractions in the Taxi domain We use Dietterich s Taxi domain [4], illustrated in Figure 1, as the setting for our work. This domain has four state variables. The first two correspond to the taxi s current position in the grid world. The third indicates the passenger s current location, at one of the four labeled positions (Red, Green, Blue, and Yellow) or inside the taxi. The fourth indicates the labeled position where the passenger would like to go. The domain therefore has = 500 possible states. At each time step, the taxi may move north, move south, move east, move west, attempt to pick up the passenger, or attempt to put down Figure 1: The Taxi domain. the passenger. Actions that would move the taxi through a wall or off the grid have no effect. Every action has a reward of -1, except illegal attempts to pick up or put down the passenger, which have reward -10. The agent receives a reward of +20 for achieving a goal state, in which the passenger is at the destination (and not inside the taxi). In this paper, we consider the stochastic version of the domain. Whenever the taxi attempts to move, the resulting motion occurs in a random perpendicular direction with probability 0.2. Furthermore, once the taxi picks up the passenger and begins to move, the destination changes with probability 0.3. Dietterich demonstrates that a handcrafted task hierarchy can facilitate learning in this domain. The crucial reusable tasks in his hierarchy are those that take the taxi to each of the four landmarks. For example, an agent can execute a task that navigates to the Red landmark whenever it must pick up a passenger there and also whenever it must deliver a passenger there. Dietterich also observes that the location of the passenger and the passenger s final destination are irrelevant to the task of travelling to the Red landmark. State abstractions such as this one are what allow his MAXQ framework to learn the Taxi domain efficiently.

3 How might a learning algorithm discover this abstraction autonomously, based only on experience with the domain? We consider this question in a non-hierarchical framework. Even without a task decomposition, an agent can safely ignore the passenger s final destination in any state where the passenger is not inside the taxi, since the optimal action then does not depend on the final destination. Our approach learns this static state abstraction, which we can then apply to both non-hierarchical and hierarchical algorithms. 3 Defining irrelevance Suppose without loss of generality that the n + 1 state variables of an MDP are X 1, X 2,... X n = X and Y. Let X denote a set of possible values for X, determining a region of the state space. We wish to determine whether or not knowing the value of Y affects the quality of an agent s decisions in this region of the state space. One simple sufficient condition is that the agent s learned policy ˆπ ignores Y : x X, y 1, y 2 : ˆπ( x, y 1 ) = ˆπ( x, y 2 ). However, this condition is too strong in practice. If some states have more than one optimal action, then the learned policy may specify one action when Y = y 1 and a different one when Y = y 2, due to variance in the learned Q-values. We instead examine the Q-values directly. We check that in every case there exists some action that achieves the maximum expected reward regardless of the value of Y : x X a y : ˆQ( x, y, a) ˆV ( x, y). Essentially, this condition examines the learned Q-values to determine whether a policy exists that ignores Y. However, our determination of whether an action maximizes the expected reward must be robust to uncertainty in the value estimates. Learning algorithms that mix exploration and exploitation are especially likely to attain accurate value estimates for only one optimal action from a given state. For example, consider a state in the stochastic taxi domain where the passenger is in the upper left corner and the taxi is in the upper right corner. To maximize expected reward, an agent must navigate to the passenger as quickly as possible. Due to the configuration of obstacles in the world, both moving south and moving west are optimal actions from this state, regardless of the passenger s eventual destination. The following table shows some of the learned Q-values for this situation, obtained using Q-learning with Boltzmann exploration. 1 Dest Action Q Blue West Blue South Green West Green South For each destination, the value of moving south and of moving west should be approximately the same, but the exploitation component of the learning policy caused Q-learning only to converge to a correct estimate for one of the two optimal actions. Q-learning with an exploitative policy is an extreme case, but even an algorithm that explores the domain in a more balanced fashion is likely to have different estimates for Q- values that are in truth the same, simply due to the stochastic nature of the domain. To determine whether one Q-value is greater than another, we must take into account the uncertainty in our estimate. 1 For all the Q-learning runs in this paper, we used a starting temperature of 50, a cooling rate of , a learning rate of 0.25, and no discount factor.

4 4 Testing hypotheses To evaluate whether a state-action value is optimal, we draw inspiration from statistical hypothesis testing. In this family of techniques, we consider a default hypothesis (called the null hypothesis) and wish to determine whether some data tends to refute this hypothesis. We calculate a certain scalar statistic of the data and determine the distribution of this statistic assuming the null hypothesis is true. Then we compute the likelihood p of observing a value as extreme as the observed statistic given that distribution. We reject the null hypothesis if and only if this likelihood falls below some predetermined threshold (called the significance level), indicating that the statistic lies in the unlikely tail of the distribution. In our framework we define for each state ( x, y) and action a a separate null hypothesis that a is an optimal action in ( x, y): Q( x, y, a) V ( x, y). If for a given x we accept the null hypothesis for all y, then action a is optimal regardless of the value of Y. In this case Y is irrelevant given x according to our definition of irrelevance. Conversely, Y is relevant given x if for every action we reject the null hypothesis for some value y. 4.1 Classical hypothesis testing If we regard each state-action value as a random variable, then we apply established statistical tests for determining whether the mean of two random variables differ. This straightforward approach requires us to draw independent samples of the estimated value for each state-action pair. We can obtain this sample by repeatedly running any RL algorithm that computes these estimates until it converges to an optimal policy. After n runs we have a sample of size n of each state-action value. Instead of directly testing the hypothesis that a is an optimal action in state s, we test the hypothesis that Q(s, a) Q(s, a ) for each other action a. Only if we accept all of these hypotheses do we accept the hypothesis that a is optimal. If we assume that our sample of Q-values has a Gaussian distribution (for each state-action pair), we could use a paired t test to test these hypotheses. In general, we have reason to believe that the actual distribution is somewhat skewed since these values are the max of other values. Fortunately, the statistical literature provides a test that does not require us to know the distribution of our sample: the Wilcoxon signed ranks test [3]. This test computes a statistic of the difference between Q(s, a) and Q(s, a ) for each run that is known to converge to a Gaussian distribution for sufficiently large n. It outputs the maximum significance level at which we should still accept the hypothesis that Q(s, a) Q(s, a ). We then accept the hypothesis that a is optimal in state s if and only if the maximum significance level for each a is greater than our threshold. 4.2 Monte Carlo simulation The straightforward implementation described above makes very poor use of experience data, since each time step contributes to only one of the sample solutions. Here we develop an alternate approach that draws upon recent work in Bayesian MDP models [2]. This technique regards the successor state that results from a given state-action pair as a random variable drawn from a multinomial distribution. Using Bayesian parameter estimation techniques, we start with a prior probability distribution over the parameters for each multinomial and then update these distributions given experience data. The joint distribution over the transition probabilities and one step rewards for each state-action pair comprise the Bayesian model. This Bayesian MDP model thus represents a single probability distribution over MDPs whose mean converges in the limit to the MDP that generated the data. In our approach we use all of the experience data to learn a single Bayesian model of the do-

5 main. We then draw sample MDPs that are independent given the model and apply Monte Carlo simulation to make probabilistic statements about the Q-values of the underlying MDP. We directly estimate the probability that an action is optimal in a given state (given our prior distribution) as the fraction of samples in which the action is in fact optimal. We then accept the hypothesis that an action is optimal unless the estimated probability of optimality is too low: ˆQ( x, y, a) ˆV ( x, y) Pr(Q( x, y, a) = V ( x, y) h) p, where p is a significance level as in classical hypothesis testing. 5 Results We verified the correctness of both our statistical hypothesis testing and Monte Carlo approaches on the stochastic version of the Taxi domain. Both methods generally proceed in two phases. In the first phase, we run an established RL algorithm until convergence, perhaps multiple times. In the second phase, we use the output of the RL algorithm to accept or to reject the hypotheses that certain state variables are irrelevant in certain conditions. Evaluating all possible hypotheses of this form would be prohibitively expensive, so here we examine just two cases in the Taxi domain to demonstrate that we can discriminate between hypotheses that we should reject and hypotheses that we should accept. In the first case, the passenger is at the upper left landmark. In this case we wish to show that the passenger s destination is irrelevant to the optimal action, which is always to navigate towards the upper left landmark. In the second case, the passenger is inside the taxi. Here we wish to show that the passenger s destination is not irrelevant, since the optimal action is to navigate towards the destination. 5.1 The Wilcoxon signed ranks test To obtain the sample Q-values necessary to apply the Wilcoxon signed ranks test, we ran 25 independent instances of Q-learning with Boltzmann exploration for time steps each, enough time to ensure convergence to an optimal policy. Then for the two cases above, we applied the Wilcoxon signed ranks test to determine for each possible location of the taxi the maximum significance level at which we would conclude that the passenger s destination is relevant to the optimal policy. The following table displays the values obtained in a typical run for the first case, in which the passenger s destination is not relevant Consider the upper right hand square, for which we obtain the maximum significance level This number means that there exists some action a for which we accept, across all possible passenger destinations, the null hypothesis that a is optimal, so long as we choose a significance level p < Only if we choose a significance level p does there exist a passenger destination and an alternative action a such that we should reject the hypothesis that a is as good as a. We see from the table that on this trial for p < this approach correctly identifies the passenger destination as irrelevant for all 25 taxi locations (when the passenger is currently at the upper left landmark). The next table represents the case when the passenger is inside the taxi and the destination is generally relevant.

6 All but four locations in this case have extremely low p-values, suggesting that we reject the hypothesis that passenger destination is irrelevant (thus indicating that it is relevant) in these states. In the four locations with higher p-values yhe passenger destination actually is irrelevant: although the passenger is already inside the taxi, moving north is an optimal first action towards all four of the possible destinations. (Recall the possible passenger destinations as indicated in Figure 1.) These values indicate that on this trial this approach avoids false positive identifications of irrelevant state for p In ten trials, this approach never generated a p-value above for a state where the null hypothesis was false, and it never generated a p-value below for a state where the null hypothesis was true. Over these ten trials, a typical significance level of 0.05 would have correctly classified the relevancy of the passenger destination in every state. 5.2 Monte Carlo simulation We also validated our Monte Carlo approach on the Taxi domain. We used prioritized sweeping [6] with t Bored = 10 to ensure that the Bayesian model had at least ten samples for each reachable input to the transition function. We allowed the agent to explore for 40,000 time steps, enough to ensure that it completed its exploration. The agent assumed that the reward function was deterministic, so it knew all the one step rewards after visiting each state-action pair at least once. In general, if we do not make this assumption, then we must choose some prior distribution over rewards for each state-action pair. Since the Taxi domain has a deterministic reward function, we chose to avoid this complication in the work reported here. Furthermore, we initialized each parameter of the Dirichlet distributions to 0. This prior distribution is not formally a Dirichlet distribution, which assumes that each parameter is positive. However, we can still sample from these distributions by assuming that unobserved state transitions have probability 0. This improper prior has the advantage of yielding a Bayesian model whose mean is identical to the maximum likelihood model, and it is slightly more computationally efficient than the approach of Dearden et al [2]. After the exploration phase, we sampled 100 MDPs from the learned Bayesian model. We solved each of these using value iteration and examined the same two cases as in Section 5.1. The following table shows for each of the 25 taxi locations the maximum probability at which some action is optimal across all passenger destinations, given that the passenger is still waiting at the upper left landmark. In other words, each cell contains the quantity max a min y ˆPr(Q( x, y, a) = V ( x, y)), where x corresponds to the taxi location and passenger location Although these estimated probabilities do not convey the same formal meaning as the significance values that statistical hypothesis tests output, we may interpret them in a somewhat similar fashion. Consider the taxi location with the smallest estimated probability, which is If we start with the null hypothesis that some action is optimal at that lo-

7 cation across all passenger destinations, our Monte Carlo simulation gives us no reason to reject that hypothesis, since at least one action was optimal in 20 of the 100 sampled MDPs. The next table shows the estimated probabilities for the second case, when the passenger is inside the taxi Note that for all the locations where the passenger destination is in fact relevant, no action was optimal across passenger destinations in any of the 100 sampled MDPs. We can easily imagine setting a probability threshold similar in meaning to the significance level of statistical hypothesis tests. We then rejecting the null hypothesis only when the estimated probability falls below that threshold. In the ten trials that we ran, a threshold of 0.05 never caused any false negatives but did lead the algorithm erroneously to classify the passenger s destination as relevant in three instances out of (In each trial, the destination is irrelevant for each combination of four passenger locations and 25 taxi locations.) The principal cost of the Monte Carlo approach is computational. The process of learning the Bayesian model, sampling 100 MDPs, and performing value iteration until convergence 100 times required 335 seconds on a 2.8 GHz Pentium 4 CPU, in contrast to the 9 seconds required to run 25 instances of Q-Learning and to apply the Wilcoxon signed ranks test. On the other hand, the Monte Carlo approach makes more efficient use of the data, requiring only steps of direct experience with the environment instead of = steps. Thus one method emphasizes computational efficiency and the other sample complexity. Learning a state abstraction even from a solved task could be well worth the cost. For example, our implementation of prioritized sweeping required over 24 minutes to solve a random instance of the Taxi domain. In contrast, solving the original 5 5 domain, applying the Monte Carlo approach to discover when the passenger destination was irrelevant, and then using this abstraction solved the same instance in only 12.5 minutes. As with all forms of statistical hypothesis testing, random chance will occasionally cause these procedures to accept an incorrect hypothesis or to reject a correct hypothesis. Collecting more data can reduce the likelihood of error, but our results show that a our approach already fairly reliably discriminates between situations when the passenger destination is relevant to behaving optimally and when it is irrelevant. 6 Related work Our work bears a strong resemblance to aspects of McCallum s U-tree algorithm [5], which uses statistical hypothesis testing to determine what features to include in its state representation. U-tree is an online instance-based algorithm that adds a state variable to its representation if different values of the variable predict different distributions of expected future reward. The algorithm computes these distributions of values in part from the current representation, resulting in a circularity that prevents it from guaranteeing to converge on an optimal state abstraction. In contrast, our approach explicitly employs only state abstractions that preserve an optimal policy. Both of the methods we described in Section 4 require information obtained from a com-

8 plete solution to the given task, so at present they are most likely to be useful for finding state abstractions in small problems that might apply in similar but much larger problems. We leave for future work the question of how we might apply these techniques online to tasks that are not yet fully learned. In this situation the uncertainty in the value function is much larger, and our approach will tend to assume that all state variables are irrelevant in the absence of sufficient evidence to the contrary. We also also leave for future work how to determine what candidate state abstractions to test, if we cannot afford to test them all. 7 Conclusion This paper has addressed the problem of determining what state variables are relevant to the solution of an RL task. We defined the relevancy of a state variable in terms of the existance of an action that is optimal across all values of that state variable. We described two statistical methods for determining whether an action is optimal in a given state. One method applies an established statistical hypothesis test to Q-values obtained from independent runs of an RL algorithm. This method is as computationally efficient as the RL algorithm used. The other method applies Monte Carlo simulation to a learned Bayesian model and requires far less experience data. Finally, we demonstrated that both methods accurately identify the conditions under which a certain state variable is irrelevant in the Taxi domain. Acknowledgments We would like to thank Greg Kuhlmann for helpful comments and suggestions. This research was supported in part by NSF CAREER award IIS References 1. Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete-Event Systems, 13:41 77, Special Issue on Reinforcement Learning. 2. Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages , Morris H. Degroot. Probability and Statistics. Addison-Wesley Pub Co, 2nd edition, Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13: , Andrew Kachites McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13: , Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems 10, Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1 2): , 1999.

Towards Learning to Ignore Irrelevant State Variables

Towards Learning to Ignore Irrelevant State Variables Nicholas K. Jong and Peter Stone Department of Computer Sciences University of Texas at Austin Austin, Texas 78712 {nkj,pstone}@cs.utexas.edu Abstract