85.6k views
4 votes
Consider the MDP described by the graph above. An agent can move between two states only if they are connected (it can move from A to B, B to A, but not B to E). The states with the square boxes are terminal states, were the only possible action is exit, with the reward written in the square. Every other action has a reward of −2. We will use the notation Q(A,A−>B) for the value of the agent being in state A and taking the action of moving from A to B. 1) Assume γ =1 and assume that all actions always succeed. Using the definition of the optimal V∗ value, calculate the V∗ values and list them here. Start with a one sentence explanation of how you are calculating it. 2) Assume γ = 1 and assume that all actions always succeed. Using the definition of the optimal Q∗ value, calculate the Q∗ values and list them here. Start with a one sentence explanation of how you are

1 Answer

2 votes

1) To calculate the optimal V* values, we can use the Bellman equation. Starting with the terminal states, we assign their rewards as the V* values. In this case, V*(E1) = 10 and V*(E2) = -5.

Then, for the non-terminal states, we update their V* values iteratively. For example, to calculate V*(A), we take the maximum of the Q* values for each action from state A. Since there is only one action available, which is moving from A to B, the Q* value is -2 (the reward for this action). Therefore, V*(A) = -2.

We repeat this process for each non-terminal state, updating their V* values using the Q* values. The Q* values are obtained by summing the immediate reward for an action with the discounted value of the next state. In this case, since γ = 1, the discounted value is simply the V* value of the next state.

After repeating this process for all non-terminal states, we obtain the following V* values:

V*(A) = -2

V*(B) = 0

V*(C) = -2

V*(D) = 2

2) To calculate the optimal Q* values, we can use a similar process. Starting with the terminal states, we assign their rewards as the Q* values. In this case, Q*(E1, exit) = 10 and Q*(E2, exit) = -5.

Then, for the non-terminal states, we update their Q* values iteratively. For example, to calculate Q*(A, A->B), we take the immediate reward for moving from A to B, which is -2, and add it to the discounted value of the next state, which is V*(B) = 0. Therefore, Q*(A, A->B) = -2.

We repeat this process for each action and state, updating their Q* values using the immediate reward and the discounted value of the next state. After repeating this process for all actions and states, we obtain the following Q* values:

Q*(A, A->B) = -2

Q*(A, exit) = -2

Q*(B, B->A) = 0

Q*(B, exit) = 0

Q*(C, C->D) = -2

Q*(C, exit) = -2

Q*(D, D->C) = 2

Q*(D, exit) = 2

These Q* values represent the optimal values for each action in each state, considering the rewards and discounted future rewards

User Turbo
by
8.3k points