Cost-Sensitive Learning for Biological Motion

Size: px

Start display at page:

Download "Cost-Sensitive Learning for Biological Motion"

Mavis Gillian Sparks
5 years ago
Views:

1 Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 October 5, / 42

2 Table of contents The problem of movement time Statement of problem A discounted reward approach Emergent time of motion: Lionel Rigoux' work Policy compilation: Jeremie Decock's work General idea: supervised learning of planned trajectories Generalisation results Improving control between trials Neural implementation of the model Motor sequences Biological background Kohonen maps GRS Model 2 / 42

3 The problem of movement time Statement of problem Variable time movements In the case of a force eld, the time of motion changes 3 / 42

4 The problem of movement time Statement of problem Limitation of standard model The optimal feedback control (OFC) framework as the leading explanation for motor control In standard OFC ([Todorov & Jordan, 2002]), the movement time is given V = C f + t f (xqx + uru)dt t0 Motivation to reach the goal represented as a cost for not being there Compromise between state cost and control cost [Guigon et al., 2008]'s proposal : the movement is realized whatever the cost t f V = u 2 dt t0 Reaching the goal (at time t f ) is a constraint The remaining movement depends on the remaining time If, due to a force eld, the hand drifts away, it must come back very fast 4 / 42

5 The problem of movement time A discounted reward approach General idea Can the movement time emerge from the problem? Reaching a goal produces a reward (represented as a scalar) Irrespective of movement cost, we try to reach the goal as fast as possible Because after that, we can look for another reward Because in a dynamic world the source of reward may not stay there Considering movement cost tends to favor slow motion The movement time emerges as an equilibrium between both contradictory pressures 5 / 42

6 The problem of movement time A discounted reward approach Emergence of movement time The subjective reward is maximum at minimal time 6 / 42

7 The problem of movement time A discounted reward approach Emergence of movement time The shorter the movement, the most expensive There is a physical limitation 6 / 42

8 The problem of movement time A discounted reward approach Emergence of movement time An optimum global value emerges Predictions : If the global value is negative, it's not worth it, no movement If the reward increases, the movement time decreases... If the cost increases, the movement time increases... 6 / 42

9 The problem of movement time A discounted reward approach Temporary goal The model deals with the case of a reward available between t 1 and t 2 If t 1 too small or t 2 too large, the subject won't move 7 / 42

The problem of movement time A discounted reward approach Removing t f : intuition from Dynamic Programming Optimal value function V (s) : represents the optimal cumulated utility one can

10 The problem of movement time A discounted reward approach Removing t f : intuition from Dynamic Programming Optimal value function V (s) : represents the optimal cumulated utility one can get from s Bellman equation : V π (s) = R(s, π(s)) + γ s p(s s, π(s))v π (s ) The agent reaches the reward as fast as possible from anywhere If we add immediate costs, we may get rid of γ 8 / 42

11 The problem of movement time A discounted reward approach Removing t f : the Dynamic Programming trick An innite horizon is considered In Dynamic Programming with innite horizon, the global utility can be written : V = E[ γ t r t], γ ]0, 1] t=0 Thanks to the innite horizon trick, no need to specify a t f γ t favors reaching the goal quickly Whatever γ's value, it is always better to reach faster γ small/large : rather choose an immediate/far away goal The time of remaining movement depends on the state, not on the elapsed time 9 / 42

12 The problem of movement time Emergent time of motion : Lionel Rigoux' work General approach Lionel Rigoux uses the following discounted cost function J(x(t)) = t (s t) [ e γ0 αu(s) 2 r(x(s)) ] ds With r(x) = δ(x x ) where x is the target state Use of a deterministic variation calculus approach to nd optimal trajectory with this cost function Add noise and replan after every step No learning involved But a simulator of the system is required, it could be learned as a forward model (not studied yet) [Shadmehr et al., 2010] : same idea / 42

13 The problem of movement time Emergent time of motion : Lionel Rigoux' work Reproduction of force eld experiments The global aspect of the trajectory matches 11 / 42

14 The problem of movement time Emergent time of motion : Lionel Rigoux' work Other phenomena Displacement of target during motion 12 / 42

15 The problem of movement time Emergent time of motion : Lionel Rigoux' work Discussion In Lionel Rigoux' work, the time of motion emerges and most motor control properties still hold But Variation calculus is very expensive, considers very unlikely solutions 2. Variation calculus is deterministic whereas motion is probably stochastic (inherent noise) 3. Planning must be performed each time, even in a well-known situation Solution to 1 : task space to joint space constraints... Solution to 2 and 3 : next section 13 / 42

16 Policy compilation : Jeremie Decock's work General idea : supervised learning of planned trajectories Block diagram description of the approach XCSF ũ = f ( x, x ) x t x x0 x u control noise arm (model) K x t+1 ũ XCSF learns associations between estimated state, goal and current action 14 / 42

17 Policy compilation : Jeremie Decock's work General idea : supervised learning of planned trajectories Block diagram description of the approach x t x x0 x u ũ XCSF noise arm (model) K x t+1 XCSF controller trained with planned trajectories XCSF controller used instead of planning when the actions are learned 14 / 42

18 Policy compilation : Jeremie Decock's work General idea : supervised learning of planned trajectories Experimental set-up : arm A 2 dofs planar arm with gravity 15 / 42

19 Policy compilation : Jeremie Decock's work General idea : supervised learning of planned trajectories Experimental set-up : muscles The arm has 6 muscles 16 / 42

20 Policy compilation : Jeremie Decock's work Generalisation results Trajectories with planning 17 / 42

21 Policy compilation : Jeremie Decock's work Generalisation results Trajectories with generalization 18 / 42

22 Policy compilation : Jeremie Decock's work Generalisation results Corresponding trajectories 19 / 42

23 Policy compilation : Jeremie Decock's work Generalisation results Corresponding trajectories 19 / 42

24 Policy compilation : Jeremie Decock's work Generalisation results Planned trajectories on a larger training set 20 / 42

25 Policy compilation : Jeremie Decock's work Generalisation results Generalization capabilities Generalization to other goals 21 / 42

Policy compilation : Jeremie Decock's work Improving control between trials Limitations of the model Performance improves between trials, the level of performance improvement depends on the

26 Policy compilation : Jeremie Decock's work Improving control between trials Limitations of the model Performance improves between trials, the level of performance improvement depends on the inter-trial time Three ways to implement that process : 1. Perform planning experiments in the head of the agent (using a learned forward model). That's a Model-based RL process ([Sutton, 1990]). 2. Improve the compiled policy with a Model-free RL process 3. Eventually, combine both (as suggested by [Daw et al., 2005]) An internship on that in / 42

Neural implementation of the model Neural relevant properties Implementation of temporal dierence algorithm in the striatal dopaminergic

27 Neural implementation of the model Neural relevant properties Implementation of temporal dierence algorithm in the striatal dopaminergic neurons (Basal Ganglia) ([Schultz et al., 1997]) Basal Ganglia are believed to implement an Actor-critic architecture (see [Joel et al., 2002]) 23 / 42

28 Neural implementation of the model Model-based architecture [Daw et al., 2005] (within an Action Selection context) : provides a cue on when to perform reoptimization (on-line replanning) based on outcome uncertainty Integration of a bayesian inference view in this architecture? 24 / 42

29 Motor sequences Biological background Graziano et al. (1) [Graziano et al., 2005] : exciting specic neurons in the intraparietal sulcus results in specic, ecologically relevant postures 25 / 42

30 Motor sequences Biological background Graziano et al. (2) The intraparietal sulcus areas is organized with gradients : somatotopic map of the eector, ecological meaning, target of eector 26 / 42

31 Motor sequences Kohonen maps Kohonen maps model [Aalo & Graziano, 2006b] : abstract encoding of the dimensions Shows that the gradients emerge (no manikin simulation) 27 / 42

32 Motor sequences GRS Model Goal of the study [Gabalda et al., 2007] : reproduce [Aalo & Graziano, 2006a] showing that ecological postures can emerge from the interaction with environment 28 / 42

33 Motor sequences GRS Model Example of sequence To each gesture corresponds a rewarded area 29 / 42

34 Motor sequences GRS Model Kohonen map initialization The Kohonen map codes for motor goals Trained on 2 millions random postures As a result, a few cells code for rewarded postures 30 / 42

35 Motor sequences GRS Model Links from contexts to goals The sequence contains four contexts The algorithm must associate the correct goals to contexts among 384 potential goals 31 / 42

36 Motor sequences GRS Model Goal selection Chosing a goal in a context is an action in that context The active goal is the one whose link to the context is the strongest 32 / 42

37 Motor sequences GRS Model Posture for target reaching A task space goal corresponds to a joint conguration 33 / 42

38 Motor sequences GRS Model Déplacement vers le but Low-level control drives the manikin towards its goal 34 / 42

39 Motor sequences GRS Model Reinforcing the goal When the goal is reached, a reward is received and the link to the celle coding for the current posture is strenghtened 35 / 42

40 Motor sequences GRS Model learning the map When a reward is received, the current cell is trained : it extends its domain 36 / 42

41 Motor sequences GRS Model Learnt map One can see the emergence of zones coding for relevant postures 37 / 42

42 Motor sequences GRS Model Global view We get a hierarchical architecture Linking contexts and goal is an RL problem 38 / 42

43 Motor sequences GRS Model Other topics Motor synergies : used to reduce the size of the optimisation problem Motor primitives : repertoire of ready-to-use simple controllers / 42

44 Motor sequences GRS Model Final discussion RL tools are used at the action selection level because RL theory was about discrete choices By contrast, OC tools are used at the motor control level, considered continuous AC methods provide a potential unication (continuous RL methods), thus the possibility to consider an unique neural substrate (taking the multiple BG loops into account) 40 / 42

45 Motor sequences GRS Model Messages Much progress in maths since naive early proposal (NAC, enac and INAC) That progress did not propagate to biological modelling Actor-critic methods are ecient for motor control modelling Incremental versions might be biologically plausible Evolution towards Bayesian inference models Model-based actor-critic architectures might integrate cost-sensitive learning and planning 41 / 42

46 Motor sequences GRS Model Any question? 42 / 42

47 Motor sequences GRS Model Aalo, T. N. & Graziano, M. S. A. (2006a). Possible origins of the complex topographic organization of motor cortex : Reduction of a multidimensional space onto a two-dimensional array. Journal of Neuroscience, 26(23) : Aalo, T. N. & Graziano, M. S. A. (2006b). Relationship between unconstrained arm movements and single neuron ring in the macaque motor cortex. Journal of Neuroscience, 27(11) : Daw, N., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioural control. Nature Neuroscience, 8 : Gabalda, B., Rigoux, L., & Sigaud, O. (2007). Learning postures through sensorimotor training : a human simulation case study. Edité dans Proceedings of the Seventh International Conference on Epigenetic Robotics, pages Graziano, M. S., Tyson, N. S., & Cooke, D. F. (2005). Arm movements evoked by electrical stimulation in the motor cortex of monkeys. Journal of Neurophysiology, 94 : Guigon, E., Baraduc, P., & Desmurget, M. (2008). Optimality, stochasticity and variability in motor behavior. Journal of Computational Neuroscience, 24(1) :5768. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia : new anatomical and computational perspectives. Neural Networks, 15(4-6) : Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate for prediction and reward. Science, 275 : / 42

48 Motor sequences GRS Model Shadmehr, R., Orban de Xivry, J.-J., Xu-Wilson, M., & Shih, T.-Y. (2010). Temporal discounting of reward and the cost of time in motor control. Journal of Neuroscience, 30(31) : Sutton, R. S. (1990). Integrating architectures for learning, planning, and reacting based on approximating dynamic programming. Edité dans Proceedings of the Seventh International Conference on Machine Learning ICML'90, pages , San Mateo, CA. Morgan Kaufmann. Todorov, E. & Jordan, M. I. (2002). Optimal feedback control as a theory of motor coordination. Nature Neurosciences, 5(11) : / 42

Reinforcement learning and the brain: the problems we face all day. Reinforcement Learning in the brain

Reinforcement learning and the brain: the problems we face all day Reinforcement Learning in the brain Reading: Y Niv, Reinforcement learning in the brain, 2009. Decision making at all levels Reinforcement