Introduction to Deep Reinforcement Learning and Control

Similar documents
CS343: Artificial Intelligence

Reinforcement Learning

Artificial Intelligence Lecture 7

Intelligent Machines That Act Rationally. Hang Li Bytedance AI Lab

M.Sc. in Cognitive Systems. Model Curriculum

Module 1. Introduction. Version 1 CSE IIT, Kharagpur

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

Chapter 1. Give an overview of the whole RL problem. Policies Value functions. Tic-Tac-Toe example

Introduction to Artificial Intelligence 2 nd semester 2016/2017. Chapter 2: Intelligent Agents

Lecture 17: The Cognitive Approach

Artificial Intelligence

RJT. Pupil Task Cards: Jumping JUMP THROW RUN

Lecture 2 Agents & Environments (Chap. 2) Outline

Agents and Environments

Lecture 13: Finding optimal treatment policies

CS 771 Artificial Intelligence. Intelligent Agents

COMP329 Robotics and Autonomous Systems Lecture 15: Agents and Intentions. Dr Terry R. Payne Department of Computer Science


Intelligent Agents. CmpE 540 Principles of Artificial Intelligence

Foundations of Artificial Intelligence

Agents and Environments

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Contents. Foundations of Artificial Intelligence. Agents. Rational Agents

What is Artificial Intelligence? A definition of Artificial Intelligence. Systems that act like humans. Notes

22c:145 Artificial Intelligence

EEL-5840 Elements of {Artificial} Machine Intelligence

AI Programming CS F-04 Agent Oriented Programming

Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) Computer Science Department

Cost-sensitive Dynamic Feature Selection

Reinforcement Learning and Artificial Intelligence

Learning and Adaptive Behavior, Part II

Learning Utility for Behavior Acquisition and Intention Inference of Other Agent

CS148 - Building Intelligent Robots Lecture 5: Autonomus Control Architectures. Instructor: Chad Jenkins (cjenkins)

CS324-Artificial Intelligence

GYMNASTICS YEAR 1 WEEK 5

KECERDASAN BUATAN 3. By Sirait. Hasanuddin Sirait, MT

AI: Intelligent Agents. Chapter 2

Competing Frameworks in Perception

Competing Frameworks in Perception

Overview. What is an agent?

Unmanned autonomous vehicles in air land and sea

Robotic Dancing: Exploring Agents that use a Stratified Perceive- Decide-Act Cycle of Interaction

Lecture 6. Perceptual and Motor Schemas

Intelligent Agents. Instructor: Tsung-Che Chiang

ERA: Architectures for Inference

Behavior Architectures

AGENT-BASED SYSTEMS. What is an agent? ROBOTICS AND AUTONOMOUS SYSTEMS. Today. that environment in order to meet its delegated objectives.

Intelligent Agents. Russell and Norvig: Chapter 2

Planning a Fitness Program. Chapter 3 Lesson 3

Dr. Mustafa Jarrar. Chapter 2 Intelligent Agents. Sina Institute, University of Birzeit

Agents. This course is about designing intelligent agents Agents and environments. Rationality. The vacuum-cleaner world

Intelligent Agents. Instructor: Tsung-Che Chiang

Artificial Intelligence. Intelligent Agents

Artificial Intelligence CS 6364

Bayesian Perception & Decision for Intelligent Mobility

Web-Mining Agents Cooperating Agents for Information Retrieval

neurons: how kids learn

Intelligent Agents. Chapter 2 ICS 171, Fall 2009

PHYSIOLOGICAL RESEARCH

Machine Learning and Event Detection for the Public Good

6/19/18. What Is Growth Mindset? Growth Mindset and Development. Historically, we have tried to change behavior from the outside:

Definitions. The science of making machines that: This slide deck courtesy of Dan Klein at UC Berkeley

Pavlovian, Skinner and other behaviourists contribution to AI

Dynamic Rule-based Agent

What is AI? The science of making machines that:

Stress Management & Mental Health Enhancement

Introduction to affect computing and its applications

Section 2 8. Section 2. Instructions How To Get The Most From This Program Dr. Larry Van Such.

Artificial Cognitive Systems

What is Self-Esteem? Why is it Important? Where Does Self-Esteem Come From? How Can You Boost Self-Esteem?

Bayesian Networks Representation. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

CHAPTER 6: Memory model Practice questions at - text book pages 112 to 113

Intelligent Agents. Soleymani. Artificial Intelligence: A Modern Approach, Chapter 2

Technical and Tactical Discussion October Sasha Rearick

Katsunari Shibata and Tomohiko Kawano

Developing Your Intuition

The Horizontal Jumps: Approach Run


Intelligent Autonomous Agents. Ralf Möller, Rainer Marrone Hamburg University of Technology

Carnegie Mellon University Annual Progress Report: 2011 Formula Grant

Top-50 Mental Gym Workouts

How do you design an intelligent agent?

Web-Mining Agents Cooperating Agents for Information Retrieval

Lecture 9: Lab in Human Cognition. Todd M. Gureckis Department of Psychology New York University

Lesson 1 Lesson 2 Lesson 3 Lesson 4 Lesson 5. Race Track Fitness. Target Heart Rate Worksheet. Equipment: Footballs Beanbags

CS 331: Artificial Intelligence Intelligent Agents

CS 331: Artificial Intelligence Intelligent Agents

ArteSImit: Artefact Structural Learning through Imitation

Players must be committed to getting better on a daily basis.

Vorlesung Grundlagen der Künstlichen Intelligenz

Machine Learning! R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction! MIT Press, 1998!

Situation Reaction Detection Using Eye Gaze And Pulse Analysis

State Estimation: Particle Filter

Raising the Performance Bar through a Season. Wade Gilbert, PhD

CS/NEUR125 Brains, Minds, and Machines. Due: Friday, April 14

Simple steps to start your own peer support group

The drills grouped together in this course all relate to the Glide Skate Kick.

Fundamentals of Cognitive Psychology, 3e by Ronald T. Kellogg Chapter 2. Multiple Choice

Action Based Learning The benefits of Physical activity on academic performance

Oscillatory Neural Network for Image Segmentation with Biased Competition for Attention

Transcription:

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Introduction to Deep Reinforcement Learning and Control Lecture 1, CMU 10703 Katerina Fragkiadaki

Logistics 3 assignments and a project Russ will announce those in the next lecture!

Goal of the Course How to build agents that learn behaviors in a dynamic world? as opposed to agents that execute preprogrammed behavior in a static world Behavior: a sequence of actions with a particular goal

Behaviors are Important The brain evolved, not to think or feel, but to control movement. Daniel Wolpert, nice TED talk

Behaviors are Important The brain evolved, not to think or feel, but to control movement. Daniel Wolpert, nice TED talk Sea squirts digest their own brain when they decide not to move anymore

Behaviors are Important The brain evolved, not to think or feel, but to control movement. Daniel Wolpert, nice TED talk Learning behaviors that adapt to a changing environment is considered the hallmark of human intelligence (though definitions of intelligence are not easy)

Learning Behaviors Learning to map sequences of observations to actions

Learning Behaviors Learning to map sequences of observations to actions, for a particular goal goal g t

Supervision What supervision does an agent need to learn purposeful behaviors in dynamic environments? Rewards: sparse feedback from the environment whether the desired behavior is achieved e.g., game is won, car has not crashed, agent is out of the maze etc. Demonstrations: experts demonstrate the desired behavior, e.g. by kinesthetic teaching, teleoperation, or through visual imitation (e.g., instructional youtube videos) Specifications/Attributes of good behavior: e.g., for driving such attributes would be respect the lane, keep adequate distance from the front car, etc. DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving, Chen at al.

Behavior: High Jump scissors Fosbury flop 1. Learning from rewards Reward: jump as high as possible: It took years for athletes to find the right behavior to achieve this 2. Learns from demonstrations It was way easier for athletes to perfection the jump, once someone showed the right general trajectory 3. Learns from specifications of optimal behavior For novices, it is much easier to replicate this behavior if additional guidance is provided based on specifications: where to place the foot, how to time yourself etc.

Learning Behaviors How learning behaviors is different than other machine learning paradigms, e.g., learning to detect objects in images?

Learning Behaviors How learning behaviors is different than other machine learning paradigms? The agent s actions affect the data she will receive in the future

Learning Behaviors How learning behaviors is different than other machine learning paradigms? The agent s actions affect the data she will receive in the future: The data the agent receives are sequential in nature, not i.i.d. Standard supervised learning approaches lead to compounding errors, An invitation to imitation, Drew Bagnell

Learning to Drive a Car: Supervised Learning

Learning to Drive a Car: Supervised Learning

Learning to Race a Car : Interactive learning-dagger

Learning to Race a Car : Interactive learning-dagger This assumes you can actively access an expert during training! A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning Stephane Ross, Geoffrey J. Gordon, J. Andrew Bagnell

Learning to Drive a Car: Supervised Learning Policy network : mapping of observations to actions

Learning Behaviors How learning behaviors is different than other machine learning paradigms? 1) The agent s actions affect the data she will receive in the future 2) The reward (whether the goal of the behavior is achieved) is far in the future

Learning Behaviors How learning behaviors is different than other machine learning paradigms? 1) The agent s actions affect the data she will receive in the future 2) The reward (whether the goal of the behavior is achieved) is far in the future: Temporal credit assignment: which actions were important and which were not, is hard to know

Learning Behaviors How learning behaviors is different than other machine learning paradigms? 1) The agent s actions affect the data she will receive in the future 2) The reward (whether the goal of the behavior is achieved) is far in the future: 3) Actions take time to carry out in the real world, and thus this may limit the amount of experience

Learning Behaviors How learning behaviors is different than other machine learning paradigms? 1) The agent s actions affect the data she will receive in the future 2) The reward (whether the goal of the behavior is achieved) is far in the future: 3) Actions take time to carry out in the real world, and thus this may limit the amount of experience We can use simulated experience and tackle the sim2real transfer We can buy many robots

Supersizing Self-Supervision Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours, Pinto and Gupta

Google s Robot Farm

Successes of behavior learning

Backgammon High branching factor due to dice roll prohibits brute force deep searches such as in chess

Backgammon TD-Gammon Neuro-Gammon Developed by Gerarl Tesauro in 1992 in IBM s research center

Backgammon TD-Gammon Neuro-Gammon Temporal Difference learning Developed by Gerarl Tesauro in 1992 in IBM s research center A neural network that trains itself to be an evaluation function by playing against itself starting from random weights Using features from Neuro-gammon it beat the world s champions Learning from human experts, supervised learning

Backgammon TD-Gammon Temporal Difference learning Developed by Gerarl Tesauro in 1992 in IBM s research center A neural network that trains itself to be an evaluation function by playing against itself starting from random weights Using features from Neuro-gammon it beat the world s champions There is no question that its positional judgement is far better than mine. Its technique is less than perfect is such things as building up a board without opposing contact when the human can often come up with a better play by calculating it out. Kit Woolsey

Locomotion Optimization and learning for rough terrain legged locomotion, Zucker et al.

Self-Driving Cars

Self-Driving Cars Pomerleau: https://www.youtube.com/watch?v=ilp4apdtbpe Behavior Cloning: data augmentation to deal with compounding errors, online adaptation (interactive learning) ALVINN (Autonomous Land Vehicle In a Neural Network), Efficient Training of Artificial Neural Networks for Autonomous Navigation, Pomerleau 1991

Self-Driving Cars Pomerleau: https://www.youtube.com/watch?v=ilp4apdtbpe Computer Vision, Velodyne sensors, object detection, 3D pose estimation, trajectory prediction

Atari Deep Mind 2014+ Deep Q learning

GO

AlphaGo Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, trained from expert demonstrations, self play

AlphaGo Policy net trained to mimic expert moves, and then finetuned using self-play Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, trained from expert demonstrations, self play

AlphaGo Policy net trained to mimic expert moves, and then finetuned using self-play Value network trained with regression to predict the outcome, using self play data of the best policy. Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, trained from expert demonstrations, self play

AlphaGo Policy net trained to mimic expert moves, and then finetuned using self-play Value network trained with regression to predict the outcome, using self play data of the best policy. At test time, policy and value nets guide a MCTS to select stronger moves by deep look ahead. Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, trained from expert demonstrations, self play

AlphaGo Monte Carlo Tree Search, learning policy and value function networks for pruning the search tree, expert demonstrations, self play, Tensor Processing Unit

AlphaGo Tensor Processing Unit from Google

AlphaGoZero No human supervision! MCTS to select great moves during training and testing!

AlphaGoZero

AlphaGoZero

AlphaGoZero

AlphaGoZero

Alpha Go Versus the real world How the world of Alpha Go is different than the real world? 1. Known environment (known entities and dynamics) Vs Unknown environment (unknown entities and dynamics). 2. Need for behaviors to transfer across environmental variations since the real world is very diverse 3. Discrete Vs Continuous actions 4. One goal Vs many goals 5. Rewards automatic VS rewards need themselves to be detected

Alpha Go Versus the real world How the world of Alpha Go is different than the real world? 1. Known environment (known entities and dynamics) Vs Unknown environment (unknown entities and dynamics). 2. Need for behaviors to transfer across environmental variations since the real world is very diverse

Alpha Go Versus the real world How the world of Alpha Go is different than the real world? 1. Known environment (known entities and dynamics) Vs Unknown environment (unknown entities and dynamics). 2. Need for behaviors to transfer across environmental variations since the real world is very diverse State estimation: To be able to act you need first to be able to see, detect the objects that you interact with, detect whether you achieved your goal

State estimation Most works are between two extremes: Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal. Rearrangement Planning via Heuristic Search, Jennifer E. King, Siddhartha S. Srinivasa

State estimation Most works are between two extremes: Assuming the world model known (object locations, shapes, physical properties obtain via AR tags or manual tuning), they use planners to search for the action sequence to achieve a desired goal. Do not attempt to detect any objects and learn to map RGB images directly to actions End to End Learning for Self-Driving Cars, NVIDIA

State estimation Recent works have shown that to be able to transfer behaviors across environment variations, factorization of the world state in terms of entities, their attributes and their dynamics are important!

State estimation Recent works have shown that to be able to transfer behaviors across environment variations, factorization of the world state in terms of entities, their attributes and their dynamics are important! They assume perfect entity/attribute detection and attempt to learn the dynamics of the game.

State estimation Recent works have shown that to be able to transfer behaviors across environment variations, factorization of the world state in terms of entities, their attributes and their dynamics are important! They assume perfect entity/attribute detection and attempt to learn the dynamics of the game. They infer actions by chaining dynamics forward to achieve desired goals.

State estimation Recent works have shown that to be able to transfer behaviors across environment variations, factorization of the world state in terms of entities, their attributes and their dynamics are important! Schema Nets can handle the environmental variation, since dynamics do not change (only the entities). Thus, structured state representations in terms of objects, attributes are important.

State estimation Recent works have shown that to be able to transfer behaviors across environment variations, factorization of the world state in terms of entities, their attributes and their dynamics are important! Schema Nets can handle the environmental variation, since dynamics do not change (only the entities). Thus, structured state representations in terms of objects, attributes are important. Therefore, behavior learning is difficult because state estimation is difficult, in other words, because Computer Vision is difficult.

Alpha Go Versus the real world How the world of Alpha Go is different than the real world? 1. Known environment (known entities and dynamics) Vs Unknown environment (unknown entities and dynamics). 2. Need for behaviors to transfer across environmental variations since the real world is very diverse 3. Discrete Vs Continuous actions 4. One goal Vs many goals 5. Rewards automatic VS rewards need themselves to be detected

Alpha Go Versus the real world How the world of Alpha Go is different than the real world? 1. Known environment (known entities and dynamics) Vs Unknown environment (unknown entities and dynamics). 2. Need for behaviors to transfer across environmental variations since the real world is very diverse 3. Discrete Vs Continuous actions (curriculum learning, progressively add degrees of freedom) 4. One goal Vs many goals 5. Rewards automatic VS rewards need themselves to be detected

Alpha Go Versus the real world How the world of Alpha Go is different than the real world? 1. Known environment (known entities and dynamics) Vs Unknown environment (unknown entities and dynamics). 2. Need for behaviors to transfer across environmental variations since the real world is very diverse 3. Discrete Vs Continuous actions (curriculum learning, progressively add degrees of freedom) 4. One goal Vs many goals (generalized policies parametrized by the goal, Hindsight Experience Replay) 5. Rewards automatic VS rewards need themselves to be detected

Alpha Go Versus the real world How the world of Alpha Go is different than the real world? 1. Known environment (known entities and dynamics) Vs Unknown environment (unknown entities and dynamics). 2. Need for behaviors to transfer across environmental variations since the real world is very diverse 3. Discrete Vs Continuous actions (curriculum learning, progressively add degrees of freedom) 4. One goal Vs many goals (generalized policies parametrized by the goal, Hindsight Experience Replay) 5. Rewards automatic VS rewards need themselves to be detected (learning perceptual rewards, use Computer Vision to detect success)

Learning Behaviors in the Real World Be multi-modal Be incremental Be physical Explore Be social Learn a language The Development of Embodied Cognition: Six Lessons from Babies Linda Smith, Michael Gasser