NSCS 344, Week 4

Reinforcement Learning

In this section of the class we are going to switch our focus from perceptual decision making to learning from reward and punishments - aka reinforcement learning.

The classic example of reinforcement learning comes from one of the famous experiments in all of Psychology - the classical conditioning experiments of Pavlov and his dogs.

In this experiment Pavlov paired the ringing of a bell with the presentation of food (a reward). After a while, the dogs learned to associate the food reward with the bell and would salivate upon hearing the bell, even if no food was actually given.

Indeed, this experiment is so famous that one of the best descriptions of Classical Conditioning comes from a TV show. Check out this clip from the "The Office" (https://vimeo.com/35754924 )

This week we are going to model learning in a task like this and show how a very simple model can explain (some of) the firing of dopaminergic neurons in the VTA - probably the most important breakthrough in computational cognitive neuroscience in the last 25 years.

Modeling learning from slot machines

In Pavlov's experiments, during the learning phase, the food reward was reliably delivered on every trial. However, in more realistic situations where animals learn from feedback, the reward is less predictable. For example, consider a cheetah hunting ...

Only about 40-50% of hunts end succesfully (for the cheetah) and the reward feedback is probabilistic. In addition the success rate varies depending on the type prey the cheetah is hunting and how close it can get to the prey before starting it's attack. By learning from feedback, the cheetah can learn how to identify the type of prey and the type situations that are most likely to end in a meal.

We will model this situation in the simplest possible way by making an analogy between an animal hunting and a person playing a (very simple) slot machine, or "one-armed bandit".

In the simplest form, when you play a slot machine you put in some money, pull the lever (the one-armed part) and then get a reward delivered with some probability. For simplicity, we will assume that the reward is either $0 or $1 and the goal is to learn how rewarding the slot machine is on average.

Modeling the slot machine

First let's model the slot machine. We will assume that the true probability of winning is

and for demonstration purposes we will assume that this win rate is 0.4. In code, we start our script with ...

clear
% probability of winning
p_win = 0.4;

Then, let's imagine playing the slot machine once. We need to generate a reward of 1 with probability

and 0 with probability

. We can do this using the binornd function just like we did when we modeled the dots task ...

% simulate one play of the bandit
r = binornd(1, p_win)
r = 0

However, we don't just want one reward, we want to simulate a whole bunch of plays. Let's assume we have T plays and let's set

just to get started. We'll set T at the start of our script and write a for loop to compute the rewards. We will also store the rewards we generate in a vector so we can refer back to them later.

Note: This storing part is important because it will allow us to compare the performance of different learning algorithms on the exact same rewards.

So now our script looks like this ...

clear
% probability of winning
p_win = 0.4;
% number of plays
T = 10;
% simulate playing the bandit
for i = 1:T
    r(i) = binornd(1, p_win);
end
r
r = 1×10
     0     0     1     0     0     0     1     1     1     1

So now, r is a vector of rewards from the 10 trials.

The last thing, just to make things even cleaner, is to create a function that we can call to play T plays of a slot machine with reward probability

. This function looks like this ...

function r = playSlotMachine(p_win, T)

% simulate playing the bandit

for i = 1:T

r(i) = binornd(1, p_win);

end

and it saves us two lines in our script (not very much, but it's good to get practice making functions) ...

clear
% probability of winning
p_win = 0.4;
% number of plays
T = 10;
% simulate playing the bandit
r = playSlotMachine(p_win, T)
r = 1×10
     0     0     1     0     0     1     1     0     1     1

Running this gives us a new set of 10 reward outcomes. We can also plot these reward outcomes over time like this

clf;

plot(r, '.', 'markersize', 30)

xlabel('play number')

ylabel('reward')

set(gca, 'fontsize', 18)

This clearly shows that the outcomes are either 0 or 1.

A simple model of learning

So now we have a bunch of reward outcomes, how can we learn from these outcomes?

One very simple thing we could do is to compute the average reward. We can write this mathematically as

In Matlab, we could implement this formula explicitly like this

V = sum(r) / T
V = 0.5000

Or we could be lazy and simply use the mean function

V = mean(r)
V = 0.5000

Both give exactly the same answer because they are doing exactly the same thing.

So far so good. But it would be nice to compute what the model has learned as a function of time so we can see how the value changes over time. Specifically I'd like to know what the value

is after I've seen t data points, where t goes from 1 to 10. Written mathematically,

To get this in Matlab I need to figure out how to take the mean over a subset of elements of r. For example to compute

I need to take the mean over the first three rewards. One way to do this is like this ...

t = 3;
V(t) = mean(r(1:t));

where the 1:t let's us pull out the elements of r from time 1 to time t (which in this case I set to 3)

r(1:t)
ans = 1×3
     0     0     1

Now I just need to repear this computation until t is the same length as r (which is also equal to T). I can do this with a for loop

% compute running average
for t = 1:length(r)
    V(t) = mean(r(1:t));
end
V
V = 1×10
         0         0    0.3333    0.2500    0.2000    0.3333    0.4286    0.3750    0.4444    0.5000

Let's quickly turn this in a function just to make our final code a little cleaner ...

function V = simpleModel(r)

% compute running average

for t = 1:length(r)

V(t) = mean(r(1:t));

end

So now my script looks like this ...

clear
% probability of winning
p_win = 0.4;
% number of plays
T = 10;
% simulate playing the bandit
r = playSlotMachine(p_win, T)
r = 1×10
     1     1     1     0     1     0     0     1     0     0
% compute values from simple model
V_simple = simpleModel(r);

Finally I can plot the rewards and values on the same plot like this ...

clf;

plot(r, '.', 'markersize', 30)

hold on

plot(V_simple, '-', 'linewidth', 3)

legend({'rewards' 'V_simple'}, 'interpreter', 'none')

xlabel('play number')

ylabel('reward')

set(gca, 'fontsize', 18)

Note the

'interpreter', 'none'

part in the legend changes how the underscore is rendered in V_simple. Without it the s in simple will be rendered as an underscore.

Looking at this plot you can see how the learned value changes over time. It starts off at whatever the first reward is, then if there's a win, V_simple goes up, and if there's a loss V_simple goes down. Over time it moves towards the average payoff of 0.4. To really see this, let's do a longer simulation ...

clear

% probability of winning

p_win = 0.4;

% number of plays

T = 1000;

% simulate playing the bandit

r = playSlotMachine(p_win, T)

r = 1×1000

     0     1     0     0     0     0     0     1     0     1     0     0     0     1     0     0     1     0     0     0     0     0     1     0     1     0     0     0     1     0     0     0     1     0     0     1     1     1     0     0     0     0     1     0     0     1     0     0     0     0

% compute values from simple model

V_simple = simpleModel(r);

clf;

plot(r, '.', 'markersize', 30)

hold on

plot(V_simple, '-', 'linewidth', 3)

plot([0 T], [1 1]*p_win, 'k--')

legend({'rewards' 'V_simple' 'p_win'}, 'interpreter', 'none')

xlabel('play number')

ylabel('reward')

set(gca, 'fontsize', 18)

Where I've also added a dashed line at the true probability value. Obviously in this plot you can't really make out the individual rewards anymore! But hopefully you can see that the model gets (on average) closer and closer to the true reward probability over time.

Taking the average in a different way

OK, so we've modeled a simple learning algorithm and it seems to do OK in estimate the mean payoff (as it should since computing the mean is all it's doing!). However, there's a big conceptual problem with the model as it's currently written. Even if humans and animals are computing the mean of the rewards it is unlikely they are doing it using this equation

Before going on, see if you can see the problem with this equation. Specifically, think about how many terms are in the sum when

. How easy do you think it would be to compute a sum like that directly?

Did you get it? The problem with our current equation for

is that we have to keep track of all the rewards we've seen so far to compute it. So when

we have to keep track of 1000 rewards. If

we need to keep track of a million rewards. This doesn't seem plausible. As smart as we are, no one can remember a random string of a million 0s and 1s (the world record for remembering binary digits in 30 minutes is 7,485 by Ryu Song in 2019).

Just on this basis we can already throw out the simplest for of our model. However, it turns out that we can tweak it slightly to perform the exact same calculation by only keeping track of 3 numbers.

What I'm going to do now is show you the math for this. It's quite a bit of algebra and if you're not a math fan it can get a bit heavy. However, please try to follow along as much as you can because it's really beautiful how the mean of a series of numbers can be computed iteratively (i.e. one step at a time) in this way ...

To get our new algorithm I'm going to start with the equation for our old one

what I'm then going to do is to rearrange this expression for

in terms of three things: the reward on the last trial,

, the time t and the value I computed after

trials,

. Note that

can be written as

OK, so to begin I'm going to start from my expression for

and expand out the sum ...

I'm also going to expect out the sum for

Note the similarity between these two expressions. Both have a sum over a bunch of reward values, but the sum for

doesn't have

This allows me to replace the sum over most of the reward terms in the equation for

with

, because

So ...

This has already done the job and I've gotten down to three terms: t,

and

. However, I can go one step further to really get something cool ...

In particular, I can expand out the

term and divide through by t like this ...

If we rearrange this slightly we get this ***HUGELY IMPORTANT EQUATION***

This is a super important equation and gives a ton of intuition about how our model works. First, we have written

in terms of just three things t,

and

so we don't have to be a memory champion to implement this model.

Second, the form of the equation tells us how the values change over time. The first term on the right hand side of this equation is

we can think of this as saying we start our estimate for the value on this trial

from the value we computed for the last trial

. We then update this last value according to

The update itself has two components. The

bit tells us that the size of the update gets smaller and smaller as t gets bigger. This term is called the learning rate because it scales the overall size of the update. Roughly speaking the bigger it is, the faster we learn.

The

bit tells us that the amount that we update the value should scale with the prediction error, i.e. the difference between the value we predicted based on the evidence so far (

) and the actual reward

Implementing the prediction error model

Let's implement the prediction error model in Matlab. This is almost as simple as translating the equation into Matlab form. So our update for value will be

V(t) = V(t-1) + 1/t * (r(t) - V(t-1));

and we want to do this for all times so we might put a for loop around it to get ...

for t = 1:length(r)

V(t) = V(t-1) + 1/t * (r(t) - V(t-1));

end

However, if you try to run this you will get an error. The error comes on the first run through the loop, when

. The issue is that when

the code says to get the value of

. Unfortunately, Matlab doesn't let us index the 0th element of a vector - the zeroth element simply doesn't exist. So instead we need to tell Matlab to do something slightly different on the first run through the loop. We can do this like this ...

clear
% probability of winning
p_win = 0.4;
% number of plays
T = 10;
% simulate playing the bandit
r = playSlotMachine(p_win, T);
for t = 1:length(r)
    if t == 1
        V(t) = r(t);
    else
        V(t) = V(t-1) + 1/t * (r(t) - V(t-1));
    end
    
end

Note: I've also filled in the early elements of the script just so it's all in one place.

Note also that the line setting V(t) = r(t) for the first time point is what we would get if the predicted value on the first trial (V(0) if it existed) is zero.

Now let's put our model code in a function and compare the performance of the prediction error model with the averaging model (can you guess how they will compare?)

function V = predictionErrorModel1(r)

% prediction error form of the averaging model

for t = 1:length(r)

if t == 1

V(t) = r(t);

else

V(t) = V(t-1) + 1/t * (r(t) - V(t-1));

end

Now our script becomes

clear

% probability of winning

p_win = 0.4;

% number of plays

T = 10;

% simulate playing the bandit

r = playSlotMachine(p_win, T);

% simulate learning with the averaging model

V_simple = simpleModel(r);

% simulate learning with the prediction error version of the averaging

% model

V_predictionError = predictionErrorModel1(r);

clf;

plot(r, '.', 'markersize', 30)

hold on;

plot(V_simple, 'linewidth', 3)

plot(V_predictionError, 'linewidth', 3)

xlabel('trial number')

ylabel('values and rewards')

set(gca, 'fontsize', 18)

legend({'rewards' 'V_simple' 'V_predictionError'}, 'interpreter', 'none')

But wait! Where is the red line for V_simple?

Well, it's underneath the yellow line for V_predictionError! These two models give exactly the same values because these two models are just mathematical rearrangements of each other! We can see this if we make the linewidth for V_simple a bit bigger ...

clf;

plot(r, '.', 'markersize', 30)

hold on;

plot(V_simple, 'linewidth', 10)

plot(V_predictionError, 'linewidth', 3)

xlabel('trial number')

ylabel('values and rewards')

set(gca, 'fontsize', 18)

legend({'rewards' 'V_simple' 'V_predictionError'}, 'interpreter', 'none')

The problem with averaging

Unfortunately, there's another problem with the averaging model that the rearrangement into the prediction error model can't solve - at least not without a tweak. To see this, let's imagine a slightly different learning situation, one that involves a bait and then a switch ...

Let's imagine playing the slot machine that pays out reasonably well - 95% of the time for 1000 trials - this is the "bait". Then let's imagine switching it so that it pays out just 5% of the time and let's play this for 100 trials. What we would like our learning model to do is adjust to this change relatively quickly. Indeed, people can detect a change like this in 10 trials or less. However, when we have our averaging model try to deal with this situation we run into trouble ...

First let's set up a series of bait and switch rewards

clear
% simulate playing the bandit
r_bait = playSlotMachine(0.95, 1000);
r_switch = playSlotMachine(0.05, 100);
r = [r_bait r_switch];

Note: the last line here concatenates the two vectors r_bait and r_switch. Basically concatenate just means sticks them together.

If we plot the resulting rewards we get

clf

plot(r, '.', 'markersize', 30)

xlabel('trial number')

ylabel('reward')

set(gca, 'fontsize', 18)

If you look carefully at this plot, you can actually see the frequency of rewards going down after trial 1000. Now let's run our prediction error model on these rewards ...

clear

% simulate playing the bandit

r_bait = playSlotMachine(0.95, 1000);

r_switch = playSlotMachine(0.05, 100);

r = [r_bait r_switch];

% simulate learning with the prediction error version of the averaging

% model

V_predictionError = predictionErrorModel1(r);

clf

plot(r, '.', 'markersize', 30)

hold on;

plot(V_predictionError,'linewidth', 3)

plot([0 1000 1001 1100], [0.95 0.95 0.05 0.05],'k--')

legend({'rewards' 'values' 'true reward probabilty'}, 'location', 'west')

xlabel('trial number')

ylabel('reward')

set(gca, 'fontsize', 18)

As you can see, the model does a great job of estimating the reward probability before the change, but a really bad job of estimating the reward probability after the change. This model is susceptible to bait-and-switch scam.

Solving the bait-and-switch problem with a constant learning rate

The cause of the bait-and-switch problem in the averaging model is that the learning rate decreases over time. Remember the learning rate scales how much the prediction error changes the value, and in the averaging model, the learning rate is

, which gets very small as t increases.

One solution to the bait-and-switch problem then involves just replacing the

with a constant number between 0 and 1. By convention (at least by the convention of some people), this learning rate parameter is labeled α. So our new update equation becomes ...

Now, this tweak to the model is kind of a hack and the model is no longer equivalent to the averaging model (although note it's only kind of, there's a way to link this back to a kind of averaging that we will explore in the extra credit part of the assignment) . However it does a nice job of adapting to the change. Let's implement the model in Matlab to see (remember, we still need to do something special for the very first trial). Let's put this straight into a function

function V = fixedLearningRate(r, alpha)

for t = 1:length(r)

if t == 1

V(t) = r(t);

else

V(t) = V(t-1) + alpha * (r(t) - V(t-1));

end

and then use our script to compare the fixed learning rate model with the averaging model ...

clear

% simulate playing the bandit

r_bait = playSlotMachine(0.95, 1000);

r_switch = playSlotMachine(0.05, 100);

r = [r_bait r_switch];

% simulate learning with the prediction error version of the averaging

% model

V_predictionError = predictionErrorModel1(r);

% fixed learning rate model

alpha = 0.1;

V_fixed = fixedLearningRate(r, alpha);

clf

plot(r, '.', 'markersize', 30)

hold on;

plot(V_predictionError,'linewidth', 3)

plot(V_fixed)

plot([0 1000 1001 1100], [0.95 0.95 0.05 0.05],'k--')

legend({'rewards' 'V_predictionError' 'V_fixed' 'true reward probabilty'}, 'location', 'west', 'interpreter', 'none')

xlabel('trial number')

ylabel('reward')

set(gca, 'fontsize', 18)

As you can see, the new model does a much better job of adapting to the change. However, this appears to come at the cost of doing worse in the period of stability before the change. In the Assignment you will simulate the fixed learning rate model with a bunch of different learning rates to see how changing this parameter results in a tradeoff between the performance in periods of stability and performance in periods of change.

Prediction errors and dopamine

Somewhat amazingly, our super simple model of learning based on prediction errors may actually have some basis in the brain.

In a classic study, Wolfram Schultz trained monkeys to associate pictures with a juice reward. This experiment was basically a 1990s update on Pavlov's experiments with dogs. Instead of a bell, the monkeys saw an image of a fractal (chosen so that they could easily generate lots of different, but interesting images that the monkey had never seen before) and a short time after the image appeared, a drop of juice was delivered to the monkey through a feeding tube.

By switching the reward to juice, rather than food, Schultz was able to deliver the reward at a very precise time, and the monkeys could learn to predict not only that they would get a reward, but that they would get the reward at a particular time.

Then he measured the activity of their dopamine neurons at the time the reward was delivered (or when the reward should be delivered). Doing this he found clear evidence that dopamine activity at the time of reward looks a lot like a prediction error signal ...

First, he looked at how dopamine responded to unpredicted rewards. In this case the predicted value

= 0 and the reward

= 1 drop of juice, so the prediction error should be 1. Looking at the response of the neurons, he found an uptick in firing of the dopamine neurons right after this unexpected reward was delivered ...

Next he asked what happens after the monkey has learned to predict the reward. In this case the predicted value is

= 1 and the reward is also 1, so the prediction error should be zero. In line with this prediction, dopamine neurons did not change their firing to reward when the reward was predicted.

Finally, he asked what happens when the monkey expects a reward (

) but it is not delivered (

). In this case the prediction error should be negative and indeed we see a reduction in dopamine firing at the time when the reward is expected to be delivered

In the assignment you will model this for yourself to show how the model gives a big prediction error to an unexpected reward, no prediction error to an expected reward and a negative prediction error to an unexpected reward.