NSCS 344, Week 7

Processing your data

Last time we introduced Expected Value theory and modeled how it might behave on the survey. Today you are going to explore your own data (assuming you did the survey) as well as data from other students who have taken this class.

To get the data you need to download the file

riskyChoiceData_2020.mat

from D2L.

Once you've downloaded the data make sure you move it into the directory you are writing your scripts in for this week (i.e. the Week_07 directory in your NSCS folder on Dropbox).

Then, from a script, you can load the data like this

clear
load riskyChoiceData_2020

This loads a bunch of variables into Matlab. Let's take a look at what we've got using the "whos" command

whos
  Name        Size             Bytes  Class     Attributes

  BR          1x146             1168  double              
  P           1x12                96  double              
  QUS         1x12              1818  cell                
  V           1x12                96  double              
  rsk       146x12             14016  double              

We've got 5 variables here

QUS - a "cell" array of strings with the text of each question in the survey
P - a vector of 12 probabilities for the risky option in each question
V - a vector of the 12 winning amounts for the risky option in each question
rsk - a matrix of people's responses to each of the 12 questions. 1 denotes a risky choice, 0 denotes a safe choice.
BR - a vector containing the blink rates of all the people in the data set in blinks per minute

Note: Because I am making this example before this year's class completed the survey, I have fewer participants in my data set (146) than you will have.

The questions, QUS

We can explore these variables more in the Command Window. Let's start with the text of the questions, which we can get by typing ...

QUS'
ans = 12×1 cell
'50% chance to win $20'       
'25% chance to win $19.74'    
'26% chance to win $23.12'    
'58% chance to win $11.93'    
'50% chance to win $16.04'    
'62% chance to win $14.41'    
'28% chance to win $35.52'    
'58% chance to win $18.79'    
'71% chance to win $16.88'    
'34% chance to win $38.04'    

Each line here corresponds to the text of the risky option in each question (e.g. question 1 had a risky option of 50% chance to win $20). Remember that the safe option in this survey was always $10 for sure.

If you want to look at the text of just one question, say question 4, you can type

QUS{4}
ans = '58% chance to win $11.93'

Note that you have to use curly brackets { and } with cell arrays.

We're not actually going to do any more with QUS - it's mainly in here just to reorient you to the survey and in case you want to refer to the questions themselves.

Probabilities and values

Next let's take a closer look at P and V ...

P
P = 1×12
    0.5000    0.2500    0.2600    0.5800    0.5000    0.6200    0.2800    0.5800    0.7100    0.3400    0.3600    0.7500
V
V = 1×12
   20.0000   19.7400   23.1200   11.9300   16.0400   14.4100   35.5200   18.7900   16.8800   38.0400   38.5100   19.9700

P is a vector of 12 numbers telling you the probability of winning in risky option for each question. So

P(4)
ans = 0.5800

consistent with a 58% chance on question four.

V is a vector of 12 numbers telling you the value of winning for the risky option. So

V(4)
ans = 11.9300

consistent with a winning payout of $11.93 on question 4.

Using P and V we can compute the Expected Value of the risky option for each question. One way to do this is directly with a for loop

for i = 1:length(P)
    EV_risky(i) = P(i) * V(i);
end

Or you could do it using element-wise multiplication

EV_risky = P.*V;

Or you could reuse your EVtheory_survey function from last week. If you want to do this be sure to copy and paste the function into your directory ...

for i = 1:length(P)
    [EV_safe(i), EV_risky(i), choice(i)] = EVtheory_survey(10, P(i), V(i));
end

This approach also gives you the choice that pure EV theory would make for "free."

Let's see what those Expected Values are

Note: I used the transpose here to transform EV_risky from a row vector into a column vector for easier viewing.

To get an even better idea about EV_risky let's plot it as a function of question number

clf;

plot(EV_risky, '.', 'markersize', 50)

xlabel('question number')

ylabel('EV_risky', 'interpreter', 'none')

set(gca, 'fontsize', 18)

This reveals the design of the experiment. I started with an "easy" question

QUS{1}
ans = '50% chance to win $20'

Which has simple numbers and where the Expected Value is just 10. Then I used more complicated questions like

QUS{2}
ans = '25% chance to win $19.74'

These questions were designed such that EV_risky would go from about 5 to about 15 in steps of 1. Hence the (very near) linear increase in EV_risky with question number from question 2 to 12 in the plot.

Why did I design the experiment like this? Well, I wanted a range of differences between EV_risky and EV_safe (which remember is always 10). This will allow us to compute the choice curve for people just like we computed the choice curve for the model last week. However, before we can compute the choice curve, we need to take a closer look at the choice data ...

The choices

The actual choices for each person on each question are in the matrix rsk. For me this matrix has size 146 x 12, meaning that there are 146 rows (1 per subject) and 12 columns (1 per question).

Note you will have more subjects in your data set than I do because it will also include this year's participants.

So if I look at rsk(103, 10) I get the choice of subject number 103 on question 10

rsk(103,10)
ans = 1

A value of 1 indicates that this person chose the risky option on this question.

If we want to look at all the choices made by participant 103 we can write

rsk(103,:)
ans = 1×12
     1     0     0     0     1     1     1     1     1     1     1     1

which gives us a row vector of length 12.

Note the special use of the ":" here it's saying "take the whole of row 103."

If we want to look at an entire column (i.e. all the responses to one question, say question 5) we can write

rsk(:,5)
ans = 146×1
     1
     0
     1
     0
     0
     0
     0
     0
     0
     0

Which gives us a long column vector containing one entry per participant.

Finally, we can view the whole matrix by just typing in

rsk
rsk = 146×12
   1     1     1     1     1     1     1     1     1     1     1
   0     0     0     0     0     0     0     1     1     1     1
   1     1     1     1     1     0     1     1     0     0     1
   0     0     0     0     0     0     1     1     0     0     1
   0     0     0     0     0     1     0     1     1     1     1
   0     0     0     0     0     0     1     1     0     1     1
   0     0     1     0     1     0     1     1     1     0     1
   0     0     0     0     0     0     0     1     1     1     1
   0     0     0     0     0     1     1     1     1     1     1
   0     0     0     0     1     1     0     1     0     0     1

However, as nice as it is to look at individual raw data points (and I always recommend doing this if you have data of your own - for example in your project for this class) it would be nice if we could visualize things a bit better. One way to do this is to make an image of the matrix like this ...

imagesc(rsk)

xlabel('question number')

ylabel('participant number')

set(gca, 'fontsize', 18)

In this plot we get to see the whole matrix at once. Question number is on the x-axis and subject number is on the y-axis. 1s (risky choices) are yellow and 0s (safe choices) are blue. There's definitely some structure here - there's more yellow on the right of the plot and more blue towards the left. We'll explore this in more detail in a moment, but first a ghost story ...

The Matlab ghost

It turns out that Matlab is haunted by the ghost of a child. You can see this for yourself by typing imagesc without any input ...

clf;

imagesc

In fact it's even spookier than that and there are actually all sorts of creepy dogs, people, and objects hidden in this image at different scales ... here's a gif I made of some of them ...

You can get this for yourself (without saving as a gif) using this code ...

clear

defImage = pow2(get(0,'DefaultImageCData'),47);

imgCell = repmat({zeros(size(defImage))},8,7);

for shift = 0:52

imgCell{shift+1} = bitshift(defImage,shift);

end;

allImages = cell2mat(imgCell.');

imshow(allImages,[min(allImages(:)) max(allImages(:))]);

figure(1); clf

for i = 1:52

imagesc(imgCell{i})

title(num2str(i))

pause(1)

end

Hopefully you're not too spooked to be able to continue with Matlab! Back to the choice curves ...

Human choice curve

The next thing we'd like to do is actually compute the average choice curve over all people. Specifically, for each question we are going to compute two things:

the frequency with which people choose the safe option for that question
the difference in expected value EV_safe - EV_risky

We will then plot these things against each other to get the choice curves.

For the frequency with which people choose the safe option, we can get this from one minus the frequency with which they choose the risky option. This in turn is given by the mean of the rsk matrix over the participants, which we can compute like this

f_risk = mean(rsk);

Note: By default, mean takes the average over the first dimension of a matrix (i.e. the rows) which for us is the subjects. We will take the mean over the columns later on when we consider how likely a given person is to choose a risky option.

Then we compute f_safe as

f_safe = 1 - f_risk;

We also already compute EV_safe and EV_risky so we can plot the choice curve like this ...

clf;

plot(EV_safe - EV_risky, f_safe, '.', 'markersize', 50)

set(gca, 'fontsize', 18)

xlabel('EV_safe - EV_risky', 'interpreter', 'none')

ylabel('probability of choosing safe option')

This looks (roughly) like the kind of curve we had last week. Indeed, if we borrow our softmax function from last time (i.e. copy and paste it into this week's folder) we can try to superimpose the model curve on this plot, for a particular value of sigma. First, let's compute the softmax curve like this ...

sigma = 3;
Delta = [-6:0.1:6];
p_safe_softmax = softmax(sigma, Delta);

Then let's plot it on the same plot like this ...

clf;

plot(EV_safe - EV_risky, f_safe, '.', 'markersize', 50)

hold on;

plot(Delta, p_safe_softmax, 'linewidth', 3)

set(gca, 'fontsize', 18)

xlabel('EV_safe - EV_risky', 'interpreter', 'none')

ylabel('probability of choosing safe option')

As you can see, that fit isn't terrible as a best fitting line. Of course, if you try it with different values of sigma you will get better or worse fits. Figuring out what the "best" value of sigma is is something we will come back to in the next two classes. But for now this is pretty nice - people look like their behavior is pretty well described by a softmax choice function with a noise standard deviation around 3.

Blink rates

Now let's take a look at the blink rates. The BR variable is just a row vector with one entry - the blinks per minute - for each participant

BR'
ans = 146×1
         0
   10.0000
       NaN
   16.5000
    6.5000
   25.0000
   17.5000
   19.0000
   27.5000
   20.0000

Note that some people didn't complete this part of the experiment and so their blink rate is given as NaN - which means not a number. These NaNs are a problem if we want to compute things like the average blink rate ...

averageBR = mean(BR)
averageBR = NaN

That's because the "mean" function includes the NaNs and doesn't really know what to do with them so it just reports NaN as the output. The way to get around this is with the function nanmean

averageBR = nanmean(BR)
averageBR = 20.0911

This average blink rate - about 20 times per minute is pretty close to values found in the literature, which gives us some confidence that this very crude way of measuring blink rates (i.e. by having you count your own blinks in a video) isn't totally crazy.

If we want to know more about the blink rates of the class we could plot the distribution of blink rates using the "hist" command

clf;

hist(BR, 20)

xlabel('blink rate')

ylabel('number of participants')

set(gca, 'fontsize', 18)

Here you get a sense of the variability in blinking across people - some people barely blink at all, while others blink more than once a second! Again this distribution is consistent with the literature so we can be reasonably confident that our measure of blink rate is good.

But why do we measure blink rate in the first place? Well, it turns out that blink rate correlates with dopamine in the striatum - with more blinking associated with higher dopamine levels.

This correlation between blinks and dopamine offers us an opportuntity to do some cognitive neuroscience. Specifically, there's some literature out there to suggest that increased dopamine is associated with increased risk taking behavior. If this is true then we'd predict that in our experiment, more blinking is associated with a greater propensity to choose the risky option. In the next section we will test whether that holds ...

The relationship between blink rate and individual differences in risk taking

If blinks = dopamine and dopamine = risk taking then we predict that blinks = risk taking. To test this hypothesis, we can ask whether there's a significant correlation between the average number of risky options each person chose and their blink rate.

To get the average number of risky item people chose, we can simply take the mean over the second dimension of rsk, i.e. take the mean over the questions for each subject

meanRiskTaking = mean(rsk,2);

Then we can plot the relatinoship between blink rate and risk taking like this ...

clf;

plot(BR, meanRiskTaking, '.', 'markersize', 50)

xlabel('blink rate')

ylabel('average risk taking')

set(gca, 'fontsize', 18)

Does this look like a real relationship? It's hard to say, let's try adding a best fitting straight line - in Matlab you can get this with the lsline function

clf;

plot(BR, meanRiskTaking, '.', 'markersize', 50)

xlabel('blink rate')

ylabel('average risk taking')

set(gca, 'fontsize', 18)

l = lsline;

set(l, 'linewidth', 3, 'color', 'r')

Well, at least it's a line with a positive slope! But we can do some formal statistics in Matlab too. Let's use the "corr" function to compute the correlation coefficient and the statistical significance ...

[r,p] = corr(BR', meanRiskTaking)
r = NaN
p = NaN

Oops! We have to deal with the NaNs. Unfortunately there's no "nancorr" like there is nanmean, instead we will have to do it manually.

First find the indices of BR that are not NaNs ...

ind = find(~isnan(BR))
ind = 1×135
     1     2     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25    26    28    29    31    32    33    34    35    37    38    39    40    42    44    49    50    51    52    53    54    55    56    57    58    59    60

That's most, but not all of the data points. Then run the correlation on just these participants ...

[r,p] = corr(BR(ind)', meanRiskTaking(ind))
r = 0.0889
p = 0.3051

That's a small correlation coefficient (r = 0.0889) and a high p value. Not strong support for our hypothesis.

If you do the Extra Credit item in the Assignment you will take another look at these data with a differnet analysis that provides a hint that maybe there is a relationship after all. Indeed, in another experiment in my other classes (PSY 333 and PSY 433) I've consistently found a relationship between blinks and risk taking, although the effect size is small so it needs 100s of participants worth of data to detect.