222k views
3 votes
If I use my testing set to define which decision rule should be chosen, I will naturally choose the decision rule that best fits my testing data. What is the problem?

1 Answer

0 votes

Final answer:

Choosing a decision rule based on how well it fits testing data alone can lead to overfitting, resulting in a model that performs poorly on new data. One should employ cross-validation or a separate validation set to balance fitting current data well and generalizing to new data.

Step-by-step explanation:

The core issue when using the testing set to define which decision rule should be chosen, and naturally selecting the decision rule that best fits this testing data lies in a concept known as overfitting. Overfitting is a scenario where a model or decision rule performs very well on the specific dataset it was evaluated on, but it may fail to generalize well to new, unseen data. This occurs because the model or rule becomes too closely tailored to the nuances and noise in the test set, rather than capturing the underlying true pattern that would apply to other data points.

This is akin to memorizing the answers to a specific set of practice questions without understanding the concepts being tested. When confronted with different questions on the same topic, the likelihood of answering correctly diminishes. To avoid overfitting, one must use proper techniques such as cross-validation, where the data is split into several subsets and the model's performance is evaluated on these. Alternatively, a completely separate validation set can be used for model selection, preserving the test set strictly for the final evaluation of the chosen model's performance.

To conclude, it is of utmost importance to ensure that the decision rule or model chosen has a good balance between fitting the current data well (low bias) and being able to generalize to new data (low variance).

User Stephanos
by
7.8k points