1) To calculate the optimal V* values, we can use the Bellman equation. Starting with the terminal states, we assign their rewards as the V* values. In this case, V*(E1) = 10 and V*(E2) = -5.
Then, for the non-terminal states, we update their V* values iteratively. For example, to calculate V*(A), we take the maximum of the Q* values for each action from state A. Since there is only one action available, which is moving from A to B, the Q* value is -2 (the reward for this action). Therefore, V*(A) = -2.
We repeat this process for each non-terminal state, updating their V* values using the Q* values. The Q* values are obtained by summing the immediate reward for an action with the discounted value of the next state. In this case, since γ = 1, the discounted value is simply the V* value of the next state.
After repeating this process for all non-terminal states, we obtain the following V* values:
V*(A) = -2
V*(B) = 0
V*(C) = -2
V*(D) = 2
2) To calculate the optimal Q* values, we can use a similar process. Starting with the terminal states, we assign their rewards as the Q* values. In this case, Q*(E1, exit) = 10 and Q*(E2, exit) = -5.
Then, for the non-terminal states, we update their Q* values iteratively. For example, to calculate Q*(A, A->B), we take the immediate reward for moving from A to B, which is -2, and add it to the discounted value of the next state, which is V*(B) = 0. Therefore, Q*(A, A->B) = -2.
We repeat this process for each action and state, updating their Q* values using the immediate reward and the discounted value of the next state. After repeating this process for all actions and states, we obtain the following Q* values:
Q*(A, A->B) = -2
Q*(A, exit) = -2
Q*(B, B->A) = 0
Q*(B, exit) = 0
Q*(C, C->D) = -2
Q*(C, exit) = -2
Q*(D, D->C) = 2
Q*(D, exit) = 2
These Q* values represent the optimal values for each action in each state, considering the rewards and discounted future rewards