Enhancing Inference in Relational Reinforcement Learning via Truth Maintenance Systems

Size: px

Start display at page:

Download "Enhancing Inference in Relational Reinforcement Learning via Truth Maintenance Systems"

Aileen Fitzgerald
5 years ago
Views:

1 Enhancing Inference in Relational Reinforcement Learning via Truth Maintenance Systems Mandana hamidi Tele Robotics and Applications Department Italian Institute of Technology (IIT) Genova, Italy Amir Fijany Tele Robotics and Applications Department Italian Institute of Technology (IIT) Genova, Italy Jean-Guy Fontaine Tele Robotics and Applications Department Italian Institute of Technology (IIT) Genova, Italy Abstract Computational complexity is still a challenging problem for intelligent systems operating in complex environments. To tackle it, an agent has to deal with perceptual information intelligently. In this paper, we propose an efficient and adaptive reasoning system based on Adaptive Logic Interpreter reasoning system, a mechanism for guiding inference through relational reinforcement learning, and a variation of truth maintenance systems to speed up the inference. Relational reinforcement learning guides the inference toward the most rewarding parts of the knowledge base and truth maintenance system maintains beliefs, avoids repetitive inferences and reduces the state space. Empirical results demonstrate higher performance than the basic approach in terms of number of inferred instances, average reward, and average reward accuracy. Keywords-component; Inference engine; truth maintenance system; relational reinforcement learning; Dynamic Environment. I. INTRODUCTION Logical reasoning systems have been widely used for solving problems from different applications domains. However, these systems are usually limited by the computational complexity of their inference process, particularly when the number of inferential rules and amount of available background knowledge grow. The inference mechanism is the core element of an artificial intelligent agent. When such an agent operates in real-time, it is subject to time and computational resource constraints. An agent like a robot living in a complex perceptual environment must guarantee a reasonable response time given its computational resources. One common approach to overcome the complexity bottleneck is to guide the reasoning process toward the most rewarding parts of the knowledge base which leads to a better performance with time and computational constraints [1]. It is clear that under time and computational constraints an agent cannot try all possible inferences. Therefore a viable approach is to give priority to drawing more important conclusions first and then, if time permits, move on to other less important ones. Given a criterion, the question now is how to focus the reasoning of a reactive agent which cannot afford all possible inferences and therefore have to assign its time and computational resources efficiently? In another words how to attend to the most important, in terms of reward, parts of the knowledge base? Some proposed methods attempted to speed up the inference mechanism by combining some clever indexing schemes with exhaustive search methods. For instance, Rete match algorithm, operate by constructing a network to determine which rules should be triggered [3]. Truth Maintenance Systems (TMSs) applied a collection of techniques for maintaining beliefs and doing belief revision [2]. In our work, we combine a variation of TMSs with learning algorithm to gain more efficient inference system. Some approaches have incorporated heuristic and control strategies to guide inference. The meta-level reasoning system developed by Genesereth et al. [4] lets the designer write Prolog-like control clauses that specify how inference rules should be prioritized. Other works, have attempted to apply metareasoning, reasoning about reasoning, to deal with real-world complexity. For instance, Russell and Welford [5] sought to develop a general framework for meta-reasoning based on probability and decision theory. They introduced the idea of rational metareasoning, wherein computations are mental actions. Other approaches have focused on learning control rules to reduce search or speed up processing [6]. In [7] explanation-based learning techniques were combined with inductive logic programming ideas to learn such control conditions over inference rules. In a different approach, Cohen et al. [8] used bootstrapping to learn similar rules. These approaches assumed a query-based, top-down inference mechanism and sought to modify the logic program itself in order to achieve performance gain. In [9], authors demonstrated that it is possible to learn to guide inference efficiently via reinforcement learning. One of the learning methods which has received a lot of attentions since early 9 s is Relational Reinforcement learning (RRL), which combines the reinforcement learning with relational learning or inductive logic programming [1]. In the RRL algorithm that we applied in this work, similar to Russell and Welford, actions are internal to the agent. In recent years, there has been an increased interest in applying learning methods for reasoning systems under

2 resource constraints. An efficient approach for speeding up inference is Adaptive Logic Interpreter () reasoning system [1] that employs a variation of RRL for controlling the inference process. In this system, the reasoning is focused on the most rewarding parts of its knowledge base and hence performs better under time and computational resource constraints [1]. This approach assumes a disembodied agent interacting with its surrounding world. However, when the time constraint is extremely strict or the environment is changing too rapidly, does not provide a satisfying performance gain. This is partly because RRL s training is time consuming, especially when the state space is large, and partly because of the fact that it does not maintain its beliefs and makes inference regardless of whether it has been already inferred in the previous steps or not. To tackle the above-mentioned drawbacks, in this paper Justification-based TMS (JTMS) is applied for enhancing and making it efficient in real time situations. We called our approach, which combines the desirable features of both JTMS and and diminishes their disadvantages. Our approach differs from in two primary aspects. Firstly, we employ a JTMS, to maintain the beliefs and their dependencies. Secondly, JTMS helps RRL to reduce (prune) the space of actions to consider during learning. To the best of our knowledge, this is the first learning method combining RRL and JTMS. This paper is organized as follows. In Sections 2 and 3, and TMSs approaches are briefly discussed. Our new approach for combining JTMS and is presented in Section 4. Experiments and performance results are discussed in Section 5. Finally, some concluding remarks are made in Section 6. II. ADAPTIVE LOGIC INTRPRETER () REASONING SYSTEM is an online learning algorithm which enables an agent architecture acquire a controlled inference strategy adapted to the environment [1]. It is built upon ICARUS, a reactive agent architecture which supports reasoning and decision making [13]. ICARUS controls a physical agent in the environment composed of collections of objects whose attributes and mutual relations change over time. The agent s knowledge about domain content in terms of concepts and skills are stored in the long-term memory. Concepts describe situations in the environment and skills describe how to respond to these situations. The agent s dynamic beliefs are stored in short-term conceptual memory. These beliefs are specific instances of concept definitions that can be inferred from the perceptual buffer. Table I shows three concept definitions from blocks-world environment with variables indicated by question mark. Each concept can have five optional fields :percept(the perceived entities), :positives(the lower-level concepts that must match), :negatives(the lowerlevel concepts that must not match), :tests (the numeric relations that must hold), and :reward (the internal reward function for the matched concept). ICARUS operates in a cyclic fashion. In each cycle, the perception of the agent changes depending on its field of view. Then its perceptual memory is updated with the descriptions of all objects that are visible in the environment. An inference mechanism revises the agent s belief based on the perceptions and its domain knowledge. ICARUS infers all the matched instances of concepts in the hierarchy in a breath-first, bottom-up fashion. Finally, based on the belief state, the agent finds all the applicable skills and selects the skill with highest utility to execute. TABLE I. SOME CENCEPT DEFINITIONS FORM BLOCKS-WOLRD. (is-block (?b) :percepts ( (block?b xpos?x) ) :reward.) (left-of (?b1,?b2) :percepts ((block?b1 xpos?x1) (block?b2 xpos?x2)) :positives ( (is-block?b1) (is-block?b2)) :tests ( (<?x1?x2)) :reward.) (between (?b1,?b2,?b3) :percepts( (block?b1 xpos?x1) (block?b2 xpos?x2) (block?b3 xpos?x3)) :positives ( (left-of?b1?b2) (left-of?b2?b3) ) :test ((<?x3-x1 3)) :reward (*1(-3?x3?x2?x1) ) ICARUS s inference method, exhaustive inference, considers concepts in bottom-up breath-first fashion with no control over the reasoning strategy. Hence, it does not perform well when the agent operates under time constraints. In order to overcome the drawback of exhaustive inference, applies RRL to capture the important instances by estimating their values. In all the concepts are instantiated but only the most important ones are inferred. consists of two components: a) learning mechanism and b) generalization. The first component uses attentionrelevant values assigned to instances to determine the most rewarding subset. The second component generalizes the instance-specific values that result from the reinforcement learning algorithm to value functions for their corresponding concept definitions. A. Learning Mechanism Let be the set of all concept instances. The agent aims to find the subset which maximizes the accumulative reward under time constraint. To achieve this goal, the RRL algorithm tries to learn values over. Mental state is defined as the set of instances inferred to be true after inference steps within the current execution cycle. Fringe is defined as the set of all inferable concept instances whose children are already included in state. Each inference step consists of selecting an inferable instance from and checking whether it holds or not. A, is an inference step that infers the instance, at the current step. After inferring, the reward of the action is computed by (1): Where is the reward function, associated with concept instance and denotes the attribute vector for perceived

3 objects on which depends. The expected value of instance is updated by (2): Where is a set of inferable instances in the current fringe for which is a child. Furthermore, and are defined as: Where visits( ) indicates the number of updates performed on. For better understanding, suppose the knowledge base consists of three concept definitions introduced in Table I. Fig. 1 (bottom) shows how the belief state and the fringe are updated by selecting a valid action. Once the system selects instance (is-block C) as an action and infers it as a true belief, and adds it to the belief state. Because (is-block B) and (is-block A) are already in the belief state, the new inferable instances (left-of A C), (left-of C A), (left-of B C), and (left-of C B) are added to the fringe. (left-of A B) (left-of B A) A B C Infer (is-block C) (left-of AB) (left-of BA) (left-of AC) (left-of CA) (left-of BC) (left-of CB) Figure 1. Top) A simple example with three blocks from blocks-world. Bottom) Belief state and fringe after updating. The bold literals are the instances belonging to belief states, and the blue literals are the instances belong to fring. B. Generalization mechanism For each concept definition, generalization algorithm uses linear regression method to initialize the expected value of new candidate beliefs in the instance space. The linear regression method,, is incrementally updated by (4) and is fitted to the training examples. Here x(u) denotes the vector of attributes of the perceptions that appear in concept instance u, and is the set of all instances derived from concept definition c. III. TRUTH MAINTENANCE SYSTEMS(TMSS) In 1987, Doyle introduced TMS, a general problem solving facility to help inference engine efficiently manipulate beliefs [2]. In most problem solvers, inference engine explores alternatives, makes belief and examines the consequence of beliefs. But it often makes repetitive inferences. TMS maintains the caches of all the inferences ever made. Thus inferences, once made, need not be repeated and contradictions, once discovered, are avoided in the future. During problem solving, the inference engine and TMS continuously interact through a well-defined protocol, as shown in Fig. 2. Whenever the problem solver performs an inference step, it sends a datum to the TMS. The task of TMS is to maintain a consistent state of belief in such a way that they are not known to be contradictory and no belief is kept without reason. Inference engine Figure 2. Structure of truth minatenance systems TMSs can be classified as JTMS [2], Assumption-based TMS [11], and Logic-based TMS. These systems differ in the way they represent dependencies between the derived facts and the facts they have been derived from [12]. JTMS, the simplest kind of TMSs, is classically represented by a network of nodes together with a set of links (justifications) that represent dependencies between nodes. Every important problem solving datum (fact, assumption, and justification) sent to the TMS, is stored as a node in TMS. Assumptions nodes represent the data which the inference engine prefers to believe but which it may want to retract later. Each justification records the dependencies between the justified nodes and their antecedents. The current belief status of a node is represented as a label. If nodes are currently believed their labels are in and if they are currently not believed their labels are out. A justification contains two lists - inlist and outlists - to describe which nodes must be in and which must be out for a justified node to be believed. Justifications are used for two reasons. The first is to perform the belief update when the belief state of a node changes. Justifications are used to find all the affected nodes, whose belief states are then reexamined. As a result, some of these nodes will become in, while, others will become out. Second, justifications are used to handle contradiction. When a contradiction is discovered by a problem solver, a justification for the contradiction is added. This system searches through the dependency network (using justifications and nodes) for the assumptions underlying the contradiction. It selects one assumption as the culprit and justifies one of the nodes from its outlist, thus removing the contradiction. IV. Justifications Assumptions Beliefs Contradictions TMS TMS-BASED Our approach is similar to in terms of learning, since both methods learn by RRL; the difference is that we

4 apply JTMS for improving the performance. JTMS is employed not only for maintaining the beliefs, but also for reducing RRL s state space. The schematic of is exhibited in Fig. 3. It consists of inference engine, long-term conceptual memory, JTMS, learner and perception. Despite s model that stores agent s dynamic beliefs in short-term conceptual memory, in our model beliefs and their dependencies are stored as JTMS nodes. While in the set of all concept instantiations are kept in memory and are updated whenever a new perception is added or an old one is deleted, in our model, concept instances are incrementally generated and added to JTMS. Long-term Conceptual Memory Perception Environment Inference Engine Justifications/ Assumptions Inferable Instance Reward Beliefs/ Contradictions JTMS Figure 3. A schematic of Learner (RRL) The set of all nodes or concept instances,, in JTMS is classified in three set of nodes:, and. is the set of believed JTMS nodes with label, is the set of unbelieved JTMS nodes with label, and is the set of nodes in JTMS that have not inferred and the belief status of these nodes are unknown. works in a cyclic fashion. Each cycle consists of two steps: A) Perception and Belief revision, and B) learning mechanism. A. Perception and belif revision At the beginning of each cycle, the agent perceives the environment and updates its perceptual memory. Then a matcher, checks those primitive concepts that are matched with the perceptual buffer information. It generates primitive concept instances and sends them to inference engine. Inference engine checks these new beliefs with JTMS. If the new belief has not been seen before by JTMS, then stores it as an assumption node in JTMS. All the dependencies between new belief and other beliefs are sent to the JTMS and added to. If the system perceives an object as deleted, it will send a justification for this contradiction to the JTMS, and JTMS automatically revises its beliefs by searching through the dependency network (using justifications and nodes) for the assumptions underlying the contradiction and then changing their labels form in to out. B. Learning Mechanism This step is similar to s learning algorithm. The only difference is the way that JTMS helps RRL in updating fringe. Whenever RRL selects an inferable instance u from fringe and sends it to inference engine, inference engine asks JTMS, whether this data has been already seen or not. If u is a new inferable concept instance it will be then inferred by the inference engine and will be stored in JTMS. If it has been inferred before, it does not need to be inferred again. Once instance u is inferred true, all JTMS nodes with label in or unknown whose child is u, are added to the fringe. More precisely, the fringe does not contain any nodes with label out. Therefore, it reduces the size of the fringe and prevents the selection of concept instances that have been already inferred false on previous cycles. TMS-based algorithm is summarized in Table II. TABLE II. ALGORITHM FOR TMS-BASED Repeat the following steps for N execution cycles 1. Perception and belief revision 1.1. Let state and If a new object is perceived, A new assumption node with label is stored in JTMS ( ) All inferable instances whose children are included in n and other assumption node, are saved in with value zero If a perceived object is deleted, add a justification for this contradiction to JTMS Initialize the fringe with all JTMS assumption nodes. 2. Learning mechanism: Repeat the following steps until the time runs out or there is no inferable instance in 2.1. Let ; select instance 2.2. Let and = 2.3. If, add it to,and return reward 2.4. If, infer u If is true, add it to, compute its reward, change label to. Add all inferable instances whose children are included in nodes of, and are not exist in JTMS, to If is false, set its reward to zero and justification for this contradiction to JTMS Remove from Let 2.7. Update by using (2),(3) and For a better understanding, let us assume the simple blocks-world example shown in Fig.1, top). In the first cycles, there is no any concept instance in JTMS. Therefore, RRL generates all possible inferable instances for updating its fringe. A sample of this updating was shown in Fig. 1(bottom). After few cycles, inference engine infers that concept instances (left-of A, B), (left-of B, C), and (left-of A, C) are false. Thus, these inference concepts are stored in JTMS with label out. RRL can apply this declarative knowledge for updating the fringe. Comparison of Fig.1 (bottom) and Fig. 4, shows that number of member of the fringe is reduced from 6 to 3. In such a case, RRL instead of

5 Number of inferred instances checking 6 inferable instances in fringe, checks only 3 instances. (left-of A B) (left-of AB) (left-of AC) (left-of BC) Infer (is-block C) Figure 4. An example of belief state and fringe update in TMS-based. The bold literals are the instances belong to belief states, and the blue literals are the instances belong to fringes V. EXPERIMENT AND RESULTS In order to evaluate the performance of the proposed algorithm, we have carried out a set of experiments employing a known blocks-world environment [1]. This environment contains a set of blocks placed in line, each with a name and a position specified as its distance from a reference point, which we call the origin. Three concepts for this environment are shown in Table I. Clearly, when a large number of blocks are presented, the number of feasible instances from between concept will be enormous. It is assumed that the agent is located at the origin and has a tendency to interact with nearby blocks. Therefore, we assigned a reward function to the highest level concept (between) as a linear function that favors relations whose corresponding blocks are closer to the origin. The other two concept definitions (left-of and is-block) have no assigned reward function. There are several factors that affect the performance of inference systems, e. g., the rate of the change in the environment, domain complexity, and time limit for each inference cycle. In the following, we discuss four experiments to investigate the effect of these factors on the performance of five different inference systems: TMS-based, with generalization,, with generalization, and exhaustive inference. Reported results are averaged over 2 independent runs. A. Experiment I First, we analyze the number of inferred instances on five different inference systems. The number of inferred instances refers to the number of concept instances, which are inferred by the inference engine in one cycle. Six blocks A, B, C, D, E, and F with positions 1, 2, 3, 4, 5, and 6 are placed in initial world. Each inference system learns for 1 cycles under a fixed time limit of.4 seconds. Then they are tested by adding and deleting blocks from the environment. Eight blocks G, H, I, J, K, L, M, and N with positions.5, 6, 1.7, 7, 1.5,.25, 8, 1, and 2.5 are added to the environment at cycles 1, 125, 125, 15, 175, 2, 225, and 275 respectively. Blocks A and D are deleted from the environment at cycles 16 and 25, respectively. Fig. 5 summarizes the number of inferred instances over cycles. As expected, in both and TMSbased with generalization, when a new object is added to the environment, the number of inferred instances jump for a few cycles. The reason is that, in these cycles, the inference engine infers new instances, which have not yet been inferred. When JTMS caches these instances, the number of inferred instances decreases again. In the other inference systems, the number of inferred instances in each cycle is higher than the two above mentioned methods. This result clearly demonstrates that the JTMS plays an important role in decreasing the number of inferences in the system. In, with generalization, and exhaustive inference systems, as the number of objects increases, the number of inferred instances decreases. When an object is deleted from environment (in cycles 16 and 25), the number of inferences is increased. It shows that, the number of inferred instances in each cycle, is related to the complexity of the environment. Notice, a good inference system not only should avoid repetitive inferences, but also should be able to infer new belief instances, when new objects are added to the environment. Generalization mechanism, by initializing the expected value of new belief candidates, plays an important role in dealing with new inferred instances. It causes these beliefs to have more chance to be selected as an action. By comparing number of inferred instances in TMS-based and with generalization, one can understand that at cycles 2, 225, and 275, when objects L, M, and N are added to the environment, TMSbased with generalization is more sensitive to the change of the environment. Although, has less number of inferred instances w ith genralization w ith genralization Number of cycles Figure 5. Comparing the number of infered instancesof five inference systems over cycles. To see how successful our inference system is in guiding inference toward the most rewarding parts of the current instance space, we measure the reward accuracy. Reward accuracy is defined as and is measured in every cycle. denotes the cumulative rewards obtained by the inference system under time constraints in each cycle. denotes the total reward that would be accumulated in the same cycle by making all possible inferences under no time constraint. The reward accuracy of five inference systems over cycles is compared in Fig. 6. From cycle one to cycle 1, all inference systems achieved maximum accuracy

6 Average reward Reward Reward accuracy Average reward accuracy reward. It shows that the inference systems have enough time to generate all possible concept instances in each cycle. However, after cycle 1, by increasing the number of objects in the environment, the size of the belief state is increased and there is not enough time to generate all inference instances. Since exhaustive inference system works in the bottom-up breath-first manner, it has no time to infer higher-level concept instances. Therefore, the reward accuracy remains zero after cycle 1. The only method that its reward accuracy does not decrease rapidly is the TMSbased with generalization. 1 cycles. The averaged inferred reward during the last 1 cycles over decreasing amounts of variable time, is plotted in Fig. 8. Average reward accuracy drops as time limit increases (smaller values). with generalization gains the best reward accuracy among the other methods. with generalization gives better performance than and exhaustive inference system w ith genralization w ith genralization Number of cycles Figure 6. Accuracy of diferrent inferences systems over cycles Reward of the inference systems over cycles is illustrated in Fig. 7. Here, the reward refers to the cumulative reward obtained on a particular execution cycle. Exhaustive inference performs very poorly,, with generalization, and give better performance and with generalization performs the best. This is due to better action selection and value initialization of with generalization, which eventually ends up to higher reward w ith genralization w ith genralization Total rew ard in the environemnt Time limitation for each inference cycle(seconds) Figure 8. Reward accuracy of different inference systems over increasing amounts of available time for each inference cycle. C. Experiment III We have also considered the effect of domain complexity, which refers to the size of the instance space. Similar to the experiment II, five inference systems are trained in the similar situation. But, at cycle 2, the complexity is increased by abruptly adding p blocks, with p a random number between one to eight. As before, we measured the average inferred reward over the final1 cycles. Fig 9 shows the average reward results as a function of the number of blocks added. As expected, when the environment becomes complex, and with generalization perform better than the other methods. ALDIN and with generalization perform poorly. system does not achieve any reward with genralization with genralization with genralization with genralization Number of cycles Figure 7. reward comparison over cycles. B. Experiment II Next, the effect of time constraint on the performance of the five different inference systems is considered. Each system learns for 2 cycles under a fixed time limit of.4 seconds in an initial world state with six blocks. Then the systems are tested under various time limits ranging from.1 to.2 seconds, in a dynamic environment. Every 25 cycles, a new block is added to the environment for a total of Number of blocks added at cycle 2 Figure 9. Performance comparision between five inference sytems over increasingly complex blocks-world environments. D. Experiment IV Finally, we have studied the rate of the change in the environment. The train phase of this experiment is similar to

7 Average Reward Accuracy experiment II. After a 2-cycle learning period, we inserted new blocks one at a time with a specific number of cycles between consequent insertions. Then, the average performance of each inference system over its last 1 cycles is measured. Fig. 1 shows and TMSbased with generalization achieves the best performance among other methods with genralization with genralization Number of cycles betw een consequent block insertions Figure 1. Accuracy of diferrent inference systems over increasind rates of environemntal change in the block-world domain When the environment changes rapidly, when it becomes too complex, or when the available time for each inference cycle is too limited, the performance of all inference mechanisms degrade significantly. Indeed, they don t have enough time to learn the value of all possible concept instances. However, even in the short learning time, TMSbased systems gain better performance than systems for two main reasons. First, does not cache the results of the inferences. Second, in, the concept instances that have been inferred false in the previous cycles, might be selected in current cycle. Selecting these instances does not have any positive effect on the performance of the systems and it only consumes the time of the learning. VI. CONCLUSION In this paper, we presented, a new approach for adaptive reasoning systems based on system. We applied JTMS to improve the performance of. The advantages of JTMS are two-fold: first, maintaining the beliefs to reduce the number of inferred instances and to speed up inference mechanism; and second, reducing the size of the state space that needs to be considered by the RRL method. Our experiments on dynamic blocks-world domain proved that TMS-based achieved the best results in guiding the reasoning process towards high-utility parts. Our results also demonstrate that learns much faster than and exhaustive inference systems in dynamic environment. This clearly proves that JTMS is indeed a good candidate for speeding up the s inference. For future work, we plan to investigate the role of the TMS in guiding the leaning algorithm. TMSs can have more collaboration with learning mechanism. Whenever the environment changes, TMS updates the truth-value of beliefs. It can guide RRL to concentrate on the belief instances that their truth-values have been updated recently by TMS and need to be inferred again. Since uncertainty is an inevitable fact for real-world applications, we intend to extend our approach to probabilistic domain. One promising direction in this regard is applying it in to more complex task like visual attention and human robot interaction. Furthermore, JTMS is restricted in a sense that it can only accept Horn clauses as justifications. However, many applications need to express more than just Horn clauses, and hence we plan to investigate other type of TMS that which accept general clauses. VII. AKNOWLEDGEMNT The authors would like to thank Nima Asgharbeygi for helpful discussion on algorithm. REFERENCES [1] N. Asgharbeygi, N. Nejati, P. Langley, and S. Arai, Guiding inference through relational reinforcement learning, In: Inductive Logic Programming: 15th International Conference, Springer, Germany, 25, pp [2] J. Doyle, A truth maintenance system, Artificial Intelligence. vol. 12, pp ,1979. [3] C. Forgy, Rete: A fast algorithm for the many pattern/many object pattern match problem, Artificial Intelligence 19, pp , [4] M. R. Genesereth and M.L. Ginsberg, Logic programming. Communications of the ACM, vol. 28, PP ,1985. [5] S. Russell and E. Wefald, Principles of metareasoning, In: Proceedings of the First In- ternational Conference on Principles of Knowledge Representation and Reasoning, San Mateo, CA, Morgan Kaufmann,1989. [6] S. Minton, Quantitative results concerning the utility of explanationbased learning, Artificial Intelligence, vol. 42, pp ,199. [7] J.M. Zelle and R.J. Mooney, Combining FOIL and EBG to speedup logic programs, In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, Chambery, France, Morgan Kaufmann, pp , [8] W.W. Cohen and Y. Singer, A simple, fast, and effective rule learner, In: Proceedings of the Forteenth National Conference on Artificial Intelligence. pp , [9] M.E. Taylor, C. Matuszek, R. Smith,and M. J. Witbrock, Guiding Inference with Policy Search Reinforcement Learning, FLAIRS Conference, pp , 27. [1] S. Dzeroski, L. De Raedt, and H. Blockeel,. Relational reinforcement learning; In Proceedings of the Fifteenth International Conference on Machine Learning, pp , Madison, WI, Morgan Kaufmann [11] J. De Kleer, An assumption based TMS, Artificial Intelligence,.vol. 28, pp , [12] M. Stanojevic, S. Vranes, and D. Velasevic, Using Truth Maintenance Systems: A Tutorial, IEEE Expert: Intelligent Systems and Their Applications. vol. 9, pp , [13] D. Choi, M. Kaufman, P. Langley, N. Nejati, and D. Shapiro, An architecture for persistent reactive behavior, In Proceedings of the Third International Joint Conference on Autonomous Agents and Multi Agent Systems, ACM Press, pp ,New York, 24.

Guiding Inference through Relational Reinforcement Learning

Guiding Inference through Relational Reinforcement Learning Nima Asgharbeygi, Negin Nejati, Pat Langley, and Sachiyo Arai Computational Learning Laboratory Center for the Study of Language and Information