109k views
1 vote
Customer Rating of Breakfast Cereals. The dataset Cereals.csv includes nutritional

information, store display, and consumer ratings for 77 breakfast cereals.

Data Preprocessing. Remove all cereals with missing values.

a. Apply hierarchical clustering to the data using Euclidean distance to the normal

ized measurements. Compare the dendrograms from single linkage and complete

linkage, and look at cluster centroids. Comment on the structure of the clusters

and on their stability. Hint: To obtain cluster centroids for hierarchical clustering,

compute the average values of each cluster members, using the aggregate() function.

b. Which method leads to the most insightful or meaningful clusters?

c. Choose one of the methods. How many clusters would you use? What distance is

used for this cutoff? (Look at the dendrogram.)

d. The elementary public schools would like to choose a set of cereals to include in

their daily cafeterias. Every day a different cereal is offered, but all cereals should

support a healthy diet. For this goal, you are requested to find a cluster of "healthy

cereals." Should the data be normalized? If not, how should they be used in the

cluster analysis?

1 Answer

4 votes

Data Loading and Preprocessing:

Load the dataset Cereals.csv using pandas.

Remove cereals with missing values using dropna().

import pandas as pd

# Load the dataset

data = pd.read_csv('Cereals.csv')

# Remove cereals with missing values

data.dropna(inplace=True)

Normalization of Data:

Normalize the dataset, ensuring all features are on the same scale (e.g., using MinMaxScaler or StandardScaler from scikit-learn).

from sklearn.preprocessing import MinMaxScaler

# Select relevant columns for clustering and normalize

columns_for_clustering = ['feature_1', 'feature_2', ...] # Replace with actual column names

scaler = MinMaxScaler()

data_normalized = scaler.fit_transform(data[columns_for_clustering])

Hierarchical Clustering:

Use hierarchical clustering from scikit-learn (AgglomerativeClustering) with both single linkage and complete linkage.

Plot dendrograms for both using scipy or matplotlib to visualize the clustering structures.

from scipy.cluster.hierarchy import linkage, dendrogram

import matplotlib.pyplot as plt

# Perform hierarchical clustering with single linkage

single_linkage = linkage(data_normalized, method='single')

# Perform hierarchical clustering with complete linkage

complete_linkage = linkage(data_normalized, method='complete')

# Plot dendrogram for single linkage

plt.figure(figsize=(10, 5))

plt.title('Dendrogram - Single Linkage')

dendrogram(single_linkage)

plt.show()

# Plot dendrogram for complete linkage

plt.figure(figsize=(10, 5))

plt.title('Dendrogram - Complete Linkage')

dendrogram(complete_linkage)

plt.show()

Interpretation:

Compare dendrograms and observe the structures of clusters formed.

Calculate cluster centroids using the aggregate() function to obtain average values of each cluster's members.

Comment on the structures, stability, and meaningfulness of clusters formed by both methods.

Selecting Number of Clusters:

Decide on the number of clusters based on the dendrogram and where the merging stops.

Determine the cutoff distance (height on the dendrogram) to obtain the desired number of clusters.

Customer Rating of Breakfast Cereals. The dataset Cereals.csv includes nutritional-example-1
User MinhNguyen
by
7.9k points