Recall from lecture the value iteration update rule: Vk+1(s) = max a[Σs'T(s, a, s') (R(s, a, s') + lamdaVk*(s'))] where Vk(s) is the expected reward from sta…

Question

asked Sep 24, 2024 25.0k views

Recall from lecture the value iteration update rule:

Vk+1(s) = max a[Σs'T(s, a, s') (R(s, a, s') + lamdaVk*(s'))]

where Vk(s) is the expected reward from state s after acting optimally for k steps.

Recall the example discussed in the lecture.

Agent's starting state +1

An agent is trying to navigate a one-dimensional grid consisting of 5 cells. At each step, the agent has only one action to choose from, i.e.it moves to the cell on the immediate right.

Note: The reward function is defined to be R (s, a, s') = R(S), Rs = 5) = 1 and R(S) = 0 otherwise. Note that we get the reward when we are leaving from the current state. When it reaches the rightmost cell, it stays for one more time step and then receives a reward of +1 and comes to a halt.

Let V* (i) denote the value function of state i, the įth cell starting from left.

Let Vk* (i) denote the value function estimate at state i at the kth step of the value iteration algorithm. Let V0* (i) denote the initialization of this estimate.

Use the discount factory lamda = 0.5.

We will write the functions Vk* as arrays below, i.e. as [Vk* (1) Vk* (2) Vk* (3) Vk* (4) Vk* (5)].

Initialize by setting V0* (i) = 0 for all i:

V0* = [0 0 0 0 0)

V0* = [0 0 0 0 0]

Then, using the value iteration update rule, we get

V1* = {0 0 0 0 1]

V2* = [0 0 0 0.5 1]

Note that as soon as the agent takes the first action to reach cell 5, it stays for one more step and halts and does not take any more action, so we set (5) = V (5) for all k > 1.

Another Example of Value Iteration (Software Implementation)

Consider the same one-dimensional grid with reward values as in the first few problems in this vertical. However, consider the following change to the transition probabilities: At any given grid location the agent can choose to either stay at the location or move to an adjacent grid location. If the agent chooses to stay at the location, such an action is successful with probability and

. if the agent is at the leftmost or rightmost grid location it ends up at its neighboring grid location with probability 1/2,

. if the agent is at any of the inner grid locations it has a probability 1/4 each of ending up at either of the neighboring locations.

If the agent chooses to move (either left or right) at any of the inner grid locations, such an action is successful with probability 1/3 and with probability 2/3 it fails to move, and

. if the agent chooses to move left at the leftmost grid location, then the action ends up exactly the same as choosing to stay, i.e., staying at the leftmost grid location with probability1/2 and ends up at its neighboring grid location with probability 1/2,

. if the agent chooses to move right at the rightmost grid location, then the action ends up exactly the same as choosing to stay, i.e., staying at the rightmost grid location with probability 1/2, and ends up at its neighboring grid location with probability 1/2.

Let lamda = 0.5.

Run the value iteration algorithm for 100 iterations. Use any computational software of your choice.

Enter the value of Vigo as an array (V100 (1) Við0 (2) V100 (3) V180 (4) Vibo (5)].

(For example, type [0,2,0,3,4] for the array [O 2 0 3 4]. Type at least 4 decimal digits.)

MZH asked

by MZH

8.0k points

1 Answer

← Prev Question Next Question →

Ask a Question

Mathias Vonende · Answer 1 · 2024-09-29T22:04:10+0000

Final answer:

The value iteration algorithm needs to be run for 100 iterations to solve the given problem. Starting with an initialization of the value function, the algorithm updates the values at each iteration using the value iteration update rule. The final value function can be entered as an array with decimal values.

Step-by-step explanation:

To solve the given problem, we need to run the value iteration algorithm for 100 iterations. Given the transition probabilities and reward function, we can iteratively update the value function using the value iteration update rule. This involves taking the maximum expected reward over all possible actions at each state and discounting the future rewards by the discount factor lambda.

Starting with the initialization V0* = [0, 0, 0, 0, 0], we can update the value function at each iteration until reaching 100 iterations. The final value function V100* will be an array containing the estimated values for each state.

Using any computational software of your choice, you can run the value iteration algorithm according to the given transition probabilities and reward function. The resulting value function V100* can then be entered as an array with at least 4 decimal digits for each value.

Recall from lecture the value iteration update rule: Vk+1(s) = max a[Σs'T(s, a, s') (R(s, a, s') + lamdaVk*(s'))] where Vk(s) is the expected reward from sta…

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Final answer:

Step-by-step explanation:

Please log in or register to add a comment.

Related questions

Categories

Other Questions