11.4k views
2 votes
(c) Outlier Removal (10 points)

i. Develop an algorithm (pseudocode) to remove in sequential order observations that are furthest from the data class mean.
ii. Provide the running time and total running time of your algorithm in O-notation and T(n).
iii. Implement your algorithm in your code of choice.
iv. Determine if the data contains an outlier by plotting each class individually, the key is to plot two features at a time n different combinations, e.g., feature 1 vs feature 2, etc.
v. Provide an explanation of the results:
A. Was there any class that had obvious outliers; if so how did you determine the outlier, if not, why not?
B. What was the metric used to determine separation? Explain why the metric was chosen.

User Connexo
by
7.1k points

1 Answer

4 votes

Final answer:

To remove outliers, calculate the dataset's mean and standard deviation, remove the farthest point exceeding two standard deviations from the mean, and iterate until no outliers remain. Outliers can also be identified with graphical methods using a calculator, and their removal can lead to a better fit of the regression line and a more reliable correlation coefficient.

Step-by-step explanation:

Algorithm for Outlier Removal

To remove outliers from a dataset, we first calculate the mean and standard deviation of the data. Then, we identify any points that are more than two standard deviations away from the mean. Each outlier is removed sequentially, starting with the one farthest from the mean. The removal process continues until no outliers remain.

Pseudocode for Outlier Removal

Calculate the mean (μ) and standard deviation (σ) of the dataset.

For each data point, calculate the distance from the mean.

Identify any data point farther than 2σ from the mean as an outlier.

Remove the outlier farthest from the mean.

Recalculate μ and σ for the remaining data.

Repeat steps 2-5 until no more outliers are found.

The running time of this algorithm is O(n) for calculating the mean and O(n) for calculating the standard deviation, resulting in a total running time of O(n). The specific implementation will depend on the programming language used.

Determining Outliers Using Graphical Methods

Outliers can be identified by plotting the data points and examining those that are more than two standard deviations from the best-fit line. Calculators like TI-83, 83+, or 84+ make it easy to plot these and identify outliers.

Explanation of Results

If outliers were identified, they could be determined by comparing distances from the mean or through graphical representation, showing points more than two standard deviations from the best-fit line. The metric used is often the standard deviation because it indicates how spread out the data is around the mean.

Effect on Regression Analysis

Removing outliers can affect the linear regression line's fit, the sum of squared errors (SSE), and the correlation coefficient (r). These changes can be observed by plotting new regression lines and conducting statistical tests to confirm improvements.

User Miky
by
6.4k points