Final answer:
The e-greedy approach balances exploration and exploitation, the incremental mean method reduces memory usage, the MC method and the TD method have different ways of estimating utility, the TD method reduces memory usage, and the TD Q-value update can result in different results under SARSA learning and Q-Learning.
Step-by-step explanation:
- a. The e-greedy approach balances exploration and exploitation by randomly exploring new actions a certain percentage of the time (epsilon value) and choosing the action with the highest expected reward the rest of the time. This allows the agent to try different actions and learn which ones result in higher rewards.
- b. The incremental mean method updates the utility values by calculating the average of the current utility and the new observation. This reduces memory usage because it does not need to store all previous observations, only the current utility estimate.
- c. The Monte-Carlo (MC) method estimates the utility based on the average of the total rewards obtained from complete episodes. The Temporal Difference (TD) method estimates the utility by bootstrapping, using the current estimate and the estimated utility of the next state.
- d. The temporal-difference (TD) method reduces memory usage by updating the utility values iteratively, using the current estimate and the estimated utility of the next state, instead of waiting for the end of an episode to update all utility values.
- e. The TD Q-value update can result in different results under SARSA learning and Q-Learning. SARSA learning updates the Q-values based on the current action and the next action chosen by the policy. Q-Learning updates the Q-values based on the current action and the action with the highest Q-value in the next state. This can lead to differences in the learned policy and can affect the convergence and optimal policy achieved.