74.9k views
1 vote
Choose any dataset (from any online resource). Describe any variable on the dataset where you can apply CLT for mean and CLT for proportions.

2 Answers

3 votes
I have chosen the "Titanic: Machine Learning from Disaster" dataset from Kaggle. One variable on this dataset where we can apply the Central Limit Theorem (CLT) for mean and CLT for proportions is "Age".

For CLT for mean:
We can take random samples of age from the dataset and calculate the mean age of each sample. If we repeat this process many times, the distribution of sample means will tend towards a normal distribution, as long as the sample size is large enough. This is because the mean age of the population is likely to be normally distributed, and as we take more samples, the sample means will become more and more representative of the population mean.

For CLT for proportions:
We can look at the proportion of passengers who survived (the "Survived" variable) within different age groups. If we take random samples of passengers from the dataset, we can calculate the proportion of survivors in each sample for each age group. If we repeat this process many times, the distribution of sample proportions will tend towards a normal distribution, as long as the sample size is large enough. This is because the proportion of survivors in the population is likely to be normally distributed, and as we take more samples, the sample proportions will become more and more representative of the population proportion.
User Rajukoyilandy
by
8.4k points
4 votes

Here is a possible dataset with variables for which the central limit theorem could be applied:

A dataset of customer spending at different retail stores.

Variable for which CLT for mean could be applied:

Average monthly spending per customer. As the number of customers at each store increases, the sample mean spending will approach a normal distribution.

Variable for which CLT for proportions could be applied:

Proportion of customers who spend over $200 per month. Even with a small number of customers at each store, the proportion will be approximately normally distributed as the sample size increases due to the CLT for proportions.

The central limit theorem states that the sampling distribution of the sample mean and sample proportion will be normally distributed for large sample sizes, regardless of the original distribution. So these variables from the retail dataset would satisfy that condition. Please let me know if you would like any clarification or have a different dataset in mind. I can provide additional examples.

User Stefan Radonjic
by
8.2k points