Finite Markov Decision Processes: Chapter 3 IRL

Introduction

In Markov Decision Processes you have: * Agent: The decision maker / learner. The agent sends an action to the environment. * Environment: Everything that is not the agent. The environment sends a reward back to the agent. * Reward: The signal that agent tries to maximize.

Example GridWorld

Lets say we have a 5x5 grid. There are four possible actions: left, right, up, and down. If you reach the point (1,2) and move in any direction you recieve the reward of 10 and are moved to the point (5,2). We can include another point, B, which when reached at (1,4) and move in any direction you recieve 5 and are moved to point (3,4).

Value Function / Solving The Bellman Equation

The equation:

\[\begin{equation} v_{\pi}(s) = \sum_{a} \pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] \end{equation}\]

gridWorld <- matrix(data=0,nrow=5, ncol=5)
state <- c(1,1)
nextState <- c(1,1)
discountRate <- 0.9

policyFunction <- function(state) {
  # action: 1=left,2=right,3=up,4=down
  action <- sample.int(c(1:4), 1)
  return(action)
}

stateTransitionFunction <- function(state, action) {
  if(all(state == c(1,2))) {
    state <- c(5,2)
  } else if(all(state == c(1,4))) {
    state <- c(3,4)
  } else if(action==1) {
    state[1] <- state[1] - 1
  } else if(action==2) {
    state[1] <- state[1] + 1
  } else if(action==3) {
    state[2] <- state[2] - 1
  } else if(action==4) {
    state[2] <- state[2] + 1
  }
  return(state)
}

stateRewardFunction <- function(state, nextState) {
  if(nextState[1] < 1 || nextState[1] > 5 || nextState[2] < 1 || nextState[2] > 5) {
    return(c(-1, state))
  } else if(all(nextState == c(5,2)) && all(state == c(1,2))) {
    return(c(10, nextState))
  } else if(all(nextState == c(3,4)) && all(state == c(1,4))) {
    return(c(5, nextState))
  } else {
    return(c(0, nextState))
  }
}

valueFunction <- function(state) {
  value <- 0
  for(i in 1:4) {
    nextState <- stateTransitionFunction(state, i)
    reward <- stateRewardFunction(state, nextState)[1]
    nextState <- c(stateRewardFunction(state, nextState)[2], stateRewardFunction(state, nextState)[3])
    s <- (nextState[1] - 1) * 5 + nextState[2]
    # This doesnt work
    # value <- value + 0.25 * (reward + valueFunction(nextState))
    value <- value + 0.25 * (reward + discountRate * stateValue[s])
    }
  return(value)
}

stateValue <- matrix(data=0, nrow=25)
# Need to iterate as I dont know how to solve a system of linear equations on the computer thing
for(x in 1:100) {
  for(s in 1:25) {
  state <- c((s-1) %/% 5 + 1, (s-1) %% 5 + 1)
  stateValue[s] <- valueFunction(state)
  }
}

Related

comments powered by Disqus