Introduction
In Markov Decision Processes you have: * Agent: The decision maker / learner. The agent sends an action to the environment. * Environment: Everything that is not the agent. The environment sends a reward back to the agent. * Reward: The signal that agent tries to maximize.
Example GridWorld
Lets say we have a 5x5 grid. There are four possible actions: left, right, up, and down. If you reach the point (1,2) and move in any direction you recieve the reward of 10 and are moved to the point (5,2). We can include another point, B, which when reached at (1,4) and move in any direction you recieve 5 and are moved to point (3,4).
Value Function / Solving The Bellman Equation
The equation:
\[\begin{equation} v_{\pi}(s) = \sum_{a} \pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] \end{equation}\]
gridWorld <- matrix(data=0,nrow=5, ncol=5)
state <- c(1,1)
nextState <- c(1,1)
discountRate <- 0.9
policyFunction <- function(state) {
# action: 1=left,2=right,3=up,4=down
action <- sample.int(c(1:4), 1)
return(action)
}
stateTransitionFunction <- function(state, action) {
if(all(state == c(1,2))) {
state <- c(5,2)
} else if(all(state == c(1,4))) {
state <- c(3,4)
} else if(action==1) {
state[1] <- state[1] - 1
} else if(action==2) {
state[1] <- state[1] + 1
} else if(action==3) {
state[2] <- state[2] - 1
} else if(action==4) {
state[2] <- state[2] + 1
}
return(state)
}
stateRewardFunction <- function(state, nextState) {
if(nextState[1] < 1 || nextState[1] > 5 || nextState[2] < 1 || nextState[2] > 5) {
return(c(-1, state))
} else if(all(nextState == c(5,2)) && all(state == c(1,2))) {
return(c(10, nextState))
} else if(all(nextState == c(3,4)) && all(state == c(1,4))) {
return(c(5, nextState))
} else {
return(c(0, nextState))
}
}
valueFunction <- function(state) {
value <- 0
for(i in 1:4) {
nextState <- stateTransitionFunction(state, i)
reward <- stateRewardFunction(state, nextState)[1]
nextState <- c(stateRewardFunction(state, nextState)[2], stateRewardFunction(state, nextState)[3])
s <- (nextState[1] - 1) * 5 + nextState[2]
# This doesnt work
# value <- value + 0.25 * (reward + valueFunction(nextState))
value <- value + 0.25 * (reward + discountRate * stateValue[s])
}
return(value)
}
stateValue <- matrix(data=0, nrow=25)
# Need to iterate as I dont know how to solve a system of linear equations on the computer thing
for(x in 1:100) {
for(s in 1:25) {
state <- c((s-1) %/% 5 + 1, (s-1) %% 5 + 1)
stateValue[s] <- valueFunction(state)
}
}