10.1k views
3 votes
A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. an analyst decides to remove records that have missing values. about how many records would you expect would be removed?

User Untrots
by
8.9k points

2 Answers

1 vote

Final answer:

To estimate the number of records to be removed due to missing values, we must understand the pattern of missingness and the method of listwise deletion. Although exact calculations are complex, expecting more than 5% of records to be removed is reasonable due to the 5% missing values spread across 50 variables. A Monte Carlo simulation could provide a more precise estimation.

Step-by-step explanation:

The student is working with a dataset consisting of 1000 records and 50 variables, with 5% of the values missing randomly. When an analyst decides to remove records with missing values, the analyst is employing a method called listwise deletion or complete case analysis. To estimate how many records would be removed, we need to consider how the missing values are distributed and acknowledge that even a single missing value can lead to the removal of a full record.

The exact number of removed records depends on the pattern of missingness. If missingness is completely random, we can use probability to estimate the expected number of complete cases. The chance that a given record has no missing values is (1 - 0.05) raised to the power of the number of variables, which is (0.95)^50. However, computing this directly doesn't easily give us the expected number of records remaining.

To simplify the calculation, we can make an approximation: if there's a 5% chance that each value is missing, then there's also a 5% chance that any given variable in a record is missing. Assuming independence (which is a simplification and may not hold in real datasets), we'd expect about 5% of the records to have each specific variable missing. Because there are 50 variables, the probability that a record has all variables present is actually quite low, and thus we'd expect significantly more than 5% of records to be missing at least one variable. Mathematically calculating this exact number is complex due to the need to account for overlap in missingness across variables, but in practice, datasets with such a level of missingness often result in large reductions when applying listwise deletion.

A Monte Carlo simulation or similar probabilistic model could be used to estimate the expected number of records removed in such a scenario more precisely.

User Kyle Humfeld
by
7.3k points
5 votes
Answer: 50

Explanation:

Since 5% of the values are missing, the expected number of records to be removed is interpreted as the expected value of the records to be removed whose probability of having missing values is 0.05.

Hence, the number of records that are expected to be removed is given by

Expected Value = (number of records)(probability of having a missing value)
= (1000)(0.05)
Expected Value = 50
User Yoav Kadosh
by
8.4k points