Consider the MDP described by the graph above. An agent can move between two states only if they are connected (it can move from A to B, B to A, but not B to E). The states with…

Question

asked Sep 28, 2024 85.6k views

Consider the MDP described by the graph above. An agent can move between two states only if they are connected (it can move from A to B, B to A, but not B to E). The states with the square boxes are terminal states, were the only possible action is exit, with the reward written in the square. Every other action has a reward of −2. We will use the notation Q(A,A−>B) for the value of the agent being in state A and taking the action of moving from A to B. 1) Assume γ =1 and assume that all actions always succeed. Using the definition of the optimal V∗ value, calculate the V∗ values and list them here. Start with a one sentence explanation of how you are calculating it. 2) Assume γ = 1 and assume that all actions always succeed. Using the definition of the optimal Q∗ value, calculate the Q∗ values and list them here. Start with a one sentence explanation of how you are

Peadar Doyle asked

by Peadar Doyle

7.6k points

1 Answer

← Prev Question Next Question →

Ask a Question

Turbo · Answer 1 · 2024-10-03T02:03:18+0000

1) To calculate the optimal V* values, we can use the Bellman equation. Starting with the terminal states, we assign their rewards as the V* values. In this case, V*(E1) = 10 and V*(E2) = -5.

Then, for the non-terminal states, we update their V* values iteratively. For example, to calculate V*(A), we take the maximum of the Q* values for each action from state A. Since there is only one action available, which is moving from A to B, the Q* value is -2 (the reward for this action). Therefore, V*(A) = -2.

We repeat this process for each non-terminal state, updating their V* values using the Q* values. The Q* values are obtained by summing the immediate reward for an action with the discounted value of the next state. In this case, since γ = 1, the discounted value is simply the V* value of the next state.

After repeating this process for all non-terminal states, we obtain the following V* values:

V*(A) = -2

V*(B) = 0

V*(C) = -2

V*(D) = 2

2) To calculate the optimal Q* values, we can use a similar process. Starting with the terminal states, we assign their rewards as the Q* values. In this case, Q*(E1, exit) = 10 and Q*(E2, exit) = -5.

Then, for the non-terminal states, we update their Q* values iteratively. For example, to calculate Q*(A, A->B), we take the immediate reward for moving from A to B, which is -2, and add it to the discounted value of the next state, which is V*(B) = 0. Therefore, Q*(A, A->B) = -2.

We repeat this process for each action and state, updating their Q* values using the immediate reward and the discounted value of the next state. After repeating this process for all actions and states, we obtain the following Q* values:

Q*(A, A->B) = -2

Q*(A, exit) = -2

Q*(B, B->A) = 0

Q*(B, exit) = 0

Q*(C, C->D) = -2

Q*(C, exit) = -2

Q*(D, D->C) = 2

Q*(D, exit) = 2

These Q* values represent the optimal values for each action in each state, considering the rewards and discounted future rewards

Consider the MDP described by the graph above. An agent can move between two states only if they are connected (it can move from A to B, B to A, but not B to E). The states with…

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Categories

Other Questions