Intelligence moderates reinforcement learning: a mini-review of the neural evidence

Articles in PresS. J Neurophysiol (September 3, 2014). doi:10.1152/jn.00600.2014 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Neuro Forum Intelligence moderates reinforcement learning: a mini-review of the neural evidence Chong Chen 1 1 Department of Psychiatry, Hokkaido University Graduate School of Medicine, Sapporo 060-8638, Japan. Tel: (81)11-706-5973; Fax: (81)11-706- 5081; E-mail: cchen@med.hokudai.ac.jp Keywords: Intelligence; Reinforcement learning; Prediction error; Modelbased Acknowledgements: I thank Peter Dayan and Nathaniel Daw for their shared ideas and discussion, and Atsuhito Toyomaki for his comments on the previous manuscript. 1 Copyright 2014 by the American Physiological Society.

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 Abstract: Our understanding of the neural basis of reinforcement learning and intelligence, two key factors contributing to human strivings, has progressed significantly recently. However, the overlap of these two lines of research, namely how intelligence affects neural responses during reinforcement learning remains uninvestigated. A mini-review of three existing studies suggests that, higher IQ (especially fluid IQ) may enhance the neural signal of positive prediction error in dorsolateral prefrontal cortex, dorsal anterior cingulate cortex, and striatum, several brain substrates of reinforcement learning or intelligence. Computation theory of reinforcement learning (RL) has been one of the most frequently used instruments for explaining decision making and learning (Daw et al. 2011; Smittenaar et al. 2013). RL, the process of optimizing reward under the reinforcement of prediction error (PE), the difference between received and expected value, constitutes a major form of experience learning essential for all human strivings. It was discovered that dopamine neuron signals PE, so that its activity is enhanced by positive PE but suppressed by negative PE. Subjects would use these PE signals to update value functions and guide their future behaviors. This type of RL is termed model-free or habitual/automatic RL, since subjects rely entirely on PE, i.e. the historical trial and error experience to guide learning. In contrast, in model-based or goal-directed/controlled RL, upon gating signals of PE, subjects use higher order cognitive maps or learned models to predict future outcomes, thus could update value functions more flexibly (Daw et al. 2011; Smittenaar et al. 2013). Recent neuroimaging studies have greatly elucidated the neural basis of RL, in which model-free RL typically involves dopamine-rich striatum, whereas model-based RL also extends to dorsolateral prefrontal cortex (PFC) (Smittenaar et al. 2013), medial PFC (Daw et al. 2011), anterior cingulate cortex (ACC) and orbitofrontal cortex (Rushworth et al. 2012), etc. Notably, most of these areas have also been linked to intelligence, another key factor contributing to human strivings. The volume of left striatum is positively correlated with IQ (MacDonald et al. 2014), while dorsolateral and medial PFC and ACC together with several other parieto-frontal areas constitute the structural and functional substrates of intelligence (Deary et al. 2010). Indeed, further investigation does reveal a promising association between intelligence and RL. It has been suggested that RL may reflect stable individual difference as a trait (Cohen 2007). It is most likely that intelligence accounts for this stable trait, given 2

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 intelligence consistently predicts performance in almost all fields of human strivings, especially learning and education (Nisbett 2009). Further, intelligence is commonly perceived as consisting of crystallized (the whole stored knowledge which can be used to solve problems) and fluid intelligence (the acquired ability to solve novel problems that depend little on whole knowledge) (Nisbett 2009), both of which could be used to construct cognitive maps and learned models, thus contributing to model-based RL. In light of the above reasoning, the purpose of this article is to review three functional magnetic resonance imaging studies (Van den Bos et al. 2012; Schlagenhauf et al. 2013; Hawes et al. 2014) that have examined the link between intelligence and RL. General findings Van den Bos et al. (2012) studied a sample of adolescents (age 13-16 years, mean=14.39; 23 male, 22 female; IQ 70-130, as measured by the Similarities and Block design subscales of Wechsler Intelligence Scale for Children) using a probabilistic learning task. The task included 50 AB and 50 CD trials and the feedback was probabilistic in that choosing A led to positive feedback in 80% of AB trials, while choosing B led to only 20%. Similarly, choosing C and D led to positive feedback in 70% and 30 % of CD trials, respectively. Since choosing A or C led to positive feedback more often, receiving negative feedback after choosing A or C would be unexpected and generate a negative PE. As choosing B or D led to negative feedback more often, receiving positive feedback after choosing B or D would generate a positive PE. The authors then examined the blood oxygen level-dependent (BOLD) signals after positive and negative PEs. They found that higher IQ was related to accentuated activation in right dorsolateral PFC and dorsal ACC following positive PEs, however this was not the case for negative PEs. Further, educational level moderated the correlation, such that dorsolateral PFC activity was correlated with IQ in both prevocational (mean IQ 91.3±2.4; r=0.54) and pre-university (mean IQ 107±2.0; r=0.59) subjects, but dorsal ACC activity was only correlated with IQ in pre-university subjects (r=0.44). Behaviorally, the authors observed that IQ was positively correlated with the number of correct choices in the last 60 trials (r=0.48). In addition, IQ was positively correlated with win-stay choice strategy after expected positive feedback (r=0.54) and negatively 3

113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 correlated with it after unexpected positive feedback (r=-0.43). Clearly, after recognized the rules of the task, staying the same choice after expected positive feedback and shifting the choice after unexpected positive feedback would be a more optimal strategy. Schlagenhauf et al. (2013) used a reversal learning task which is frequently used to generate PEs. Subjects (age 22-61, mean 36.9±12.4; 28 male) were asked to choose one of the two abstract targets for 200 trials, reward and loss were determined by three types of rules. In rule type 1, a reward was delivered if less than 80% of the chosen target had been rewarded, and a punishment occurred otherwise. In rule type 2 the probability reversed to 20%, while in rule type 3 it switched to 50%. The type of rules changed after 16 trials or 10 trials if subjects have made 70% correct choices. A temporal sequence of PEs was calculated for each subject and BOLD neuronal activity was modeled with trial-by-trial PE as the modulator. In this way they extracted the mean neural PE signals from bilateral ventral striatum. Fluid IQ was derived from factor analysis of nine tests targeting cognitive speed, attention and executive function, working memory, episodic memory and reasoning. Crystallized IQ was measured with a verbal knowledge test. The hidden rules of the complex reversal learning task might inhibit those with higher IQ to fulfill their potential thus achieve better performance, as the authors failed to find associations between IQ scores and correct responses. However, they did find significant positive correlation between fluid IQ and mean BOLD PE signal in bilateral ventral striatum, even after controlling age. Further analysis revealed that attention and reasoning underlay this correlation. Since the authors extracted the mean PE signals without differentiating positive and negative PEs, this result might result from a positive correlation between IQ and striatal signals of positive PE, and/or a negative correlation between IQ and striatal signals of negative PE. Unlike the previous two studies, Hawes et al. (2014) employed a manipulated task so that each subject (18-38 years old, mean 22; 94 males; IQ 95.5-148.0, measured by the Wechsler Abbreviated Scale of Intelligence) underwent the same reinforcement history. Subjects had to guess whether an upcoming number would be low (1, 2, 3) or high (4, 5, 6), and received $2 reward for each correct guess and -$1 punishment for each incorrect guess. Unknown to subjects however, the computer responded in a way that each subject received the same sequence of 20 rewards and 20 losses. 4

145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 The authors found that %BOLD changes in striatum (caudate) after loss was positive predicted by IQ (beta=0.87) (less negative %BOLD change with higher IQ), which didn t change significantly after controlling caudate volume, age and working memory performance. Further, %BOLD change following loss was correlated with IQ in posterior cingulate cortex (r=0.20, p<0.051, which should be taken with caution due to statistical issue), whereas %BOLD changes following reward and loss were both correlated with IQ in ventromedial PFC and left inferior frontal cortex (r ranges from 0.30 to 0.36). Given the pseudorandom nature of the task, we simply have no idea what predictions would subjects have made. Nevertheless, since subjects would have been expecting reward before receiving feedback, reward feedback would generate a positive PE, and loss feedback would generate a negative PE. The findings by Hawes et al. (2014) thus could be interpreted as, more intelligent subjects showed enhanced neural signals of positive PE in ventromedial PFC and left inferior frontal cortex, and lessened neural signals of negative PE in ventromedial PFC, left inferior frontal cortex, striatum and posterior cingulate cortex. Alternatively, these BOLD signals might reflect subjects emotional response. Both ventromedial PFC and striatum are involved in the receipt of reward and loss, the activity of which may reflect the perceived value magnitude as well as resulted subjective experience (Diekhof et al. 2012). Moreover, posterior cingulate cortex is also implicated in happy and sad mood (Nielsen et al. 2005). It is likely that intelligence enhances positive emotions following reward and buffer negative emotions following loss. The latter is consistent with the observation that lower IQ exposes individuals to higher risk of developing posttraumatic stress disorder (Bomyea et al. 2012). However, this interpretation in terms of emotion is not in conflict with that of PE, as emotion and PE may coexist. Behaviorally, Hawes et al. (2014) demonstrated that higher IQ subjects considered more historical information. Specifically, more intelligent subjects were influenced by feedbacks one and two periods back, i.e. they tended to guess three combination (highhigh-high, high-high-low, etc.) in three trials in a row as the potential rule of the task, while less intelligent subjects were primarily influenced by feedback only one period back, i.e. they tended to guess two combination (high-high, high-low, low-high, low- 5

177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 low) in two trials in order as the rule. IQ & RL: a summary of the neural findings Despite employing different tasks among different populations, the three studies provided pioneering insights into the role of intelligence in RL. Specifically, Van den Bos et al. (2012) demonstrated that higher IQ was associated with accentuated activation in right dorsolateral PFC and dorsal ACC following positive PEs, especially in more intelligent subjects, while Schlagenhauf et al. (2013) and Hawes et al. (2014) suggested that higher IQ may be associated with enhanced neural signals following positive PE in striatum, ventromedial PFC, and left inferior frontal cortex; and lessened neural signals of negative PE (less reduced activation) in striatum, ventromedial PFC, left inferior frontal cortex, and posterior cingulate cortex. Explanation & implication Given the tasks used by these studies, we couldn t differentiate model-free and model-based RL in the above findings (Daw et al. 2011). Further, though originally it was proposed that striatal PE signals reflect exclusively model-free RL, more recent research revealed that striatum encodes model-based RL as well (Daw et al. 2011). In contrast, signals in dorsolateral and medial PFC and dorsal ACC may indicate modelbased RL (Daw et al. 2011; Rushworth et al. 2012; Smittenaar et al. 2013). Thus the above reviewed findings suggest that intelligence enhances model-based RL, though it may also improve model-free RL, confirming the observation that RL reflects stable individual difference as a trait (Cohen 2007). In other words, the enhanced brain activation following positive PEs and less reduced activation following negative PEs may reflect the fact that higher IQ subjects especially those with higher fluid IQ were actively processing information to construct cognitive maps using model-based RL. This is especially true in the context of the following literature. Dorsolateral PFC contributes to working memory, while ACC monitors conflict (Deary et al. 2010; Van den Bos et al. 2012). Left inferior frontal cortex is related to semantic search and selection among competitive representations (Bookheimer 2002), whereas posterior cingulate cortex encodes and retrieves episodic memory (Nielsen et al. 2005). This explanation fits well with the fact that when facing complex and difficult 6

208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 problems people with higher IQ generally make more effort and show higher brain activation (Deary et al. 2010). It is also in line with and may well explain the behaviors observed by Van den Bos et al. (2012) that more intelligent subjects achieved better performance and showed more optimal shifting behaviors after receiving positive feedbacks, and by Hawes et al. (2014) that more intelligent subjects considered more historical information. Moreover, based on the association between IQ and striatal signals, it is also likely that intelligence amplifies model-free RL thus both positive and negative PEs, but the enhanced negative PEs were buffered or even overrode by model-based RL, resulting into the distinct findings regarding positive and negative PEs. Finally, a most recent study (Lee et al. 2014) showed that inferior frontal cortex encodes the reliability of PE signals and may act as an arbitrator determining whether model-free or model-based RL takes control. Consequently the accentuated activation of left inferior frontal cortex in more intelligent subjects found by Hawes et al. (2014) suggests a possibility that intelligence enhances this arbitration process. Outlook Though limited, converging evidence supports a promising role of intelligence in RL. It is stimulating for future studies to confirm and further elucidate this role. As a major concern, to clarify and differentiate the effect of intelligence on model-free and modelbased RL, future studies should use tasks that dissociating these two types of RL (Daw et al. 2011). Further, since intelligence may affect positive and negative PEs in a different way, it is also preferable to analyze them separately. Finally, as the reviewed studies were correlational in nature, to reveal the causality future research could employ experimental manipulation of neural signals or intelligence training (Nisbett 2009). References Bomyea J, Risbrough V, Lang AJ. A consideration of select pre-trauma factors as key vulnerabilities in PTSD. Clin Psychol Rev 32(7):630-41, 2012 Bookheimer S. Functional MRI of language: new approaches to understanding the cortical organization of semantic processing. Annu Rev Neurosci 25:151-88, 2002 Cohen MX. Individual differences and the neural representations of reward expectation and reward prediction error. Soc Cogn Affect Neurosci 2(1):20-30, 2007 Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans' choices and striatal prediction errors. Neuron 69(6):1204-15, 2011 Deary IJ, Penke L, Johnson W. The neuroscience of human intelligence differences. Nat Rev Neurosci 11(3):201-11, 2010 7

243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 Diekhof EK, Kaps L, Falkai P, Gruber O. The role of the human ventral striatum and the medial orbitofrontal cortex in the representation of reward magnitude - an activation likelihood estimation meta-analysis of neuroimaging studies of passive reward expectancy and outcome processing. Neuropsychologia 50(7):1252-66, 2012 Hawes DR, DeYoung CG, Gray JR, Rustichini A. Intelligence moderates neural responses to monetary reward and punishment. J Neurophysiol 111(9):1823-32, 2014 Lee SW, Shimojo S, O'Doherty JP. Neural computations underlying arbitration between model-based and model-free learning. Neuron 81(3):687-99, 2014 MacDonald PA, Ganjavi H, Collins DL, Evans AC, Karama S. Investigating the relation between striatal volume and IQ. Brain Imaging Behav 8(1):52-9, 2014 Nielsen FA, Balslev D, Hansen LK. Mining the posterior cingulate: segregation between memory and pain components. Neuroimage 27(3):520-32, 2005 Nisbett RE. Intelligence and how to get it: why schools and cultures count. New York: W.W. Norton & Co., 2009 Rushworth MF, Kolling N, Sallet J, Mars RB. Valuation and decision-making in frontal cortex: one or many serial or parallel systems? Curr Opin Neurobiol 22(6):946-55, 2012 Schlagenhauf F, Rapp MA, Huys QJ, Beck A, Wüstenberg T, Deserno L, Buchholz HG, Kalbitzer J, Buchert R, Bauer M, Kienast T, Cumming P, Plotkin M, Kumakura Y, Grace AA, Dolan RJ, Heinz A. Ventral striatal prediction error signaling is associated with dopamine synthesis capacity and fluid intelligence. Hum Brain Mapp 34(6):1490-9, 2013 Smittenaar P, FitzGerald TH, Romei V, Wright ND, Dolan RJ. Disruption of dorsolateral prefrontal cortex decreases model-based in favor of model-free control in humans. Neuron 80(4):914-9, 2013 Van den Bos W, Crone EA, Guroglu B. Brain function during probabilistic learning in relation to IQ and level of education. Developmental Cognitive Neuroscience 2(Suppl.), S78 S89, 2012 8