Multiarmed Bandits: Chapter 2 IRL

Feb 10, 2018 4 min read Reinforcement Learning

Introduction

This is going to be part of series where I illustrate examples and questions from the brilliant book by Sutton and Barto Sutton and Barto (1998). You can download the pdf version of the newly updated book online, just google it. I am planning on going through each chapter and illustrating 1 or 2 examples from each chapter.

MultiArmed Bandits

What on earth is a mutliarmed bandit? It might be easier to think of it as which pokie you choose to play on down at the local. Say there are 10 pokies at the pub, then you need to decide which machine to game at. But you don’t know which machine will pay at the most so you need to explore and discover where the odds are at.

Lets as describe a metric; return per dollar spent. That is for every dollar you put in a machine how much do get as a return? (0, 1.5,-.5)

Let us setup 10 pokie machines with normal distributions but with differing mean returns.

rewardMeans <- rnorm(10)
plot(rewardMeans)

From our little plot we can see that the best pokie is machine 3.

Now let us set up our learning problem for our little agent to learn the best machine.

The algorithm

\[\begin{equation} \text{for the number of actions k, create:} \\ \\ valueFunction(action) = 0 \\ countFunction(action) = 0 \\ \\ Repeat until iteration count is reached: \\\ action <- \begin{cases} \text{The action that maxes the valueFuntion with prob (1 - epsilon)} \\ \text{A random action with prob epsilon} \\ \end{cases}\\ \\ reward <- bandit(action) \\ countFunction(action) += 1 \\ valueFunction(action) += \frac{1}{countFunction(action)}*(reward - valueFunction(action)) \\ \end{equation}\]

Programming Example

averagerewardAverage <- vector()
averageactions <- vector()


for(w in 1:500) {
# Setup our functions
# Number of pokies
k <- 10
valueFunction <- vector()
countFunction <- vector()

for(i in 1:k) {
  valueFunction[i] <- 0
  countFunction[i] <- 0
}

# Number of steps
T <- 1000
rewardTotal <- vector()
rewardAverage <- vector()
actions <- vector()
for(x in 1:T) {
  epsilon <- 0.1
  if(epsilon > sample(c(0:9), size=1)) {
    action <- sample(c(1:k), size=1)
  } else {
     action <- which(valueFunction==max(valueFunction))[1]
  }
  actions[x] <- action
  # reward <- bandit(action)
  reward <- rnorm(1, mean=rewardMeans[action])
  rewardTotal[x] <- reward
  rewardAverage[x] <- mean(rewardTotal)
  countFunction[action] <- countFunction[action] +  1
  valueFunction[action] <- valueFunction[action] + (1 / countFunction[action])*(reward - valueFunction[action])
}

averagerewardAverage <- rbind(averagerewardAverage,rewardAverage)
averageactions <- rbind(averageactions, actions)
}

averagerewardAverage <- colMeans(averagerewardAverage)
averageactions <- colMeans(averageactions)

Let us plot our results

plot(averagerewardAverage)

Now considering our best slot machine has a payout of around 1.5 it seems out model is not that great at finding the best machine. This highlights the huge problem with reinforcement learning. That is it is HARD.

Optimistic Initial values with greedy epsilon = 0

averagerewardAverage <- vector()
averageactions <- vector()


for(w in 1:500) {
# Setup our functions
# Number of pokies
k <- 10
valueFunction <- vector()
countFunction <- vector()

for(i in 1:k) {
  valueFunction[i] <- 10
  countFunction[i] <- 10
}

# Number of steps
T <- 1000
rewardTotal <- vector()
rewardAverage <- vector()
actions <- vector()
for(x in 1:T) {
  epsilon <- 0
  if(epsilon > sample(c(0:9), size=1)) {
    action <- sample(c(1:k), size=1)
  } else {
     action <- which(valueFunction==max(valueFunction))[1]
  }
  actions[x] <- action
  # reward <- bandit(action)
  reward <- rnorm(1, mean=rewardMeans[action])
  rewardTotal[x] <- reward
  rewardAverage[x] <- mean(rewardTotal)
  countFunction[action] <- countFunction[action] +  1
  valueFunction[action] <- valueFunction[action] + (1 / countFunction[action])*(reward - valueFunction[action])
}

averagerewardAverage <- rbind(averagerewardAverage,rewardAverage)
averageactions <- rbind(averageactions, actions)
}

averagerewardAverage <- colMeans(averagerewardAverage)
averageactions <- colMeans(averageactions)

Let us plot our results

plot(averagerewardAverage)

Constant Step Size

averagerewardAverage <- vector()
averageactions <- vector()


for(w in 1:500) {
# Setup our functions
# Number of pokies
k <- 10
valueFunction <- vector()
countFunction <- vector()

for(i in 1:k) {
  valueFunction[i] <- 0
  countFunction[i] <- 0
}

# Number of steps
T <- 1000
rewardTotal <- vector()
rewardAverage <- vector()
actions <- vector()
for(x in 1:T) {
  epsilon <- 1
  if(epsilon > sample(c(0:9), size=1)) {
    action <- sample(c(1:k), size=1)
  } else {
     action <- which(valueFunction==max(valueFunction))[1]
  }
  actions[x] <- action
  # reward <- bandit(action)
  reward <- rnorm(1, mean=rewardMeans[action])
  rewardTotal[x] <- reward
  rewardAverage[x] <- mean(rewardTotal)
  countFunction[action] <- countFunction[action] +  1
  valueFunction[action] <- valueFunction[action] + 0.05*(reward - valueFunction[action])
}

averagerewardAverage <- rbind(averagerewardAverage,rewardAverage)
averageactions <- rbind(averageactions, actions)
}

averagerewardAverage <- colMeans(averagerewardAverage)
averageactions <- colMeans(averageactions)

Let us plot our results

plot(averagerewardAverage)

Conclusion

Pretty cool. Next post will be on Markov Decision Processes.

Sutton, Richard S, and Andrew G Barto. 1998. Reinforcement Learning: An Introduction. Vol. 1. 1. MIT press Cambridge.

Multiarmed Bandits Reinforcement Learning Tutorial

Multiarmed Bandits: Chapter 2 IRL

Introduction

MultiArmed Bandits

The algorithm

Programming Example

Let us plot our results

Optimistic Initial values with greedy epsilon = 0

Let us plot our results

Constant Step Size

Let us plot our results

Conclusion

Related