143k views
5 votes
Consider the case of a simple MDP with a discount factor γ = 1. The MDP has three states (x, y, and z), with rewards -1, -2, 0, respectively. State z is considered a terminal state. In states x and y there are two possible actions: a1 and a2 . The transition model is as follows: • In state x, action a1 moves the agent to state y with probability 0.85 and makes the agent stay put with probability 0.15. • In state y, action a1 moves the agent to state x with probability 0.85 and makes the agent stay put with probability 0.15. • In either state x or state y, action a2 moves the agent to state z with probability 0.15 and makes the agent stay put with probability 0.85. Draw a picture of the MDP

User Yalei Du
by
7.8k points

1 Answer

1 vote

Final answer:

The Markov Decision Process described involves states x, y, and z with rewards -1, -2, and 0, actions a1 and a2, and transition probabilities between the states as well as to a terminal state z. The student should draw the states with arrows depicting the actions and transition probabilities, associating each state with its specific reward.

Step-by-step explanation:

The student is asking about a Markov Decision Process (MDP), which is a mathematical framework for modeling sequential decision-making situations where outcomes are partly random and partly under the control of a decision maker. The MDP is defined by states, actions, a transition model, and rewards. In this instance, there are three states (x, y, and z), two actions (a1 and a2), and associated rewards and transition probabilities. A terminal state (z) is a state that ends the process.

For the MDP described, a drawing would depict states x and y, with directed edges representing actions a1 and a2 leading to either transition between these states or to the terminal state z with the specified probabilities. For instance, action a1 in state x would have an arrow pointing to state y labeled with the probability 0.85, and another arrow looping back to x with the probability 0.15. A similar arrangement would exist for state y. For both states x and y, action a2 would have a branching arrow pointing to state z with probability 0.15 and a loop back on itself with probability 0.85. Rewards of -1, -2, and 0 would be associated with states x, y, and z respectively.

User DeJaVo
by
7.6k points