120k views
1 vote
Coding Assignment

For the final assignment, you will be given 5 different types of data:
• education_data.csv: this is a biennial data containing the records of student debt from major US universities between 2010 and 2020.
• cost_data.csv: this data contains information on the net out-of-pocket costs that families pay for each university.
• graduates_income.csv: contains information on the income of students graduating in 2018. You will use this data for the write-up.
• data_description.csv: this data contains the descriptions of the variables in the 3 datasets described above.
• CPI_U_minneapolis_fed.csv: is the CPI data from the Federal Reserve of Minneapolis between 2000 and 2023. First, for the coding assignment, go to Canvas and download the CPI_U_minneapolis_fed.csv and the data_description.csv data. Then go to the data website for the class and download the other data corresponding to the final project. You should get 2 datasets from the website: education_data.csv and cost_data.csv. The graduates_income.csv data is also on Canvas but you will not need it for the coding part, only the write-up part.
As an analyst, main part of your job is to understand the data including its structure and variables. education_data.csv, cost_data.csv, and graduates_income.csv are described in the file called data_description.csv. Before starting, read this file to make sure that you understand the data.
Again, every question is private!

Part 1: Cleaning Education data
1. Import the CPI data (Make sure to not rename CPI_U_minneapolis_fed.csv when you download it from Canvas), and select only the first and second columns. Name it cpi_data
2. Import the education data. This data should have 11 variables. First, rename the variable Year to the lower case year. Then, rename the other 10 variables according to the rename column in data_description.csv. Then, convert the school names to lower case. Finally, convert the 5 varia listed below to numeric type (Do not worry about warning message of NAs introduced by coercion.)
Save it as education_data.
• median_debt_low_income, median_debt_med_income, median_debt_high_income
• default_rate, avg_family_income
3. Update the column called institution_type to be equal to "public" if the school is public; set
institution_type to "private' otherwise. Save this as education_data_clean.
4. Filter education_data_clean to include only the schools that predominantly offers a bachelor’s degree.
Name this data education_data_BA1
5. Merge education_data_BA1 with cpi_data by keeping only the values in education_data_BA1 and
save it as education_data_BA. Then convert the debt values and average family income to 2018
dollar values using the formula from the federal reserve of Minneapolis (You have done this in previous
homework.) You can find the formula here. Rename the variables such that:
• the 2018 dollar value of median_debt_low_income is called real_debt_low_income
• the 2018 dollar value of median_debt_med_income is called real_debt_med_income
• the 2018 dollar value of median_debt_high_income is called real_debt_high_income
• the 2018 dollar value of avg_family_income is called real_family_income
Finally, drop the 5 variables listed below which include the median variables, average family income, and
CPI values. Make sure to update education_data_BA. If you did everything right, the first few columns of
education_data_BA should look like Table 1.
• median_debt_low_income, median_debt_med_income, median_debt_high_income,
• avg_family_income, CPI

Part 2: Cleaning Cost Data
1. Import the cost dataset, name it cost_data1, and select the following 9 variables: UNITID , INSTNM , YEAR , NPT41_PUB, NPT43_PUB, NPT45_PUB, NPT41_PRIV, NPT43_PRIV, NPT45_PRIV
2. Using cost_data1, create cost_data2 by doing the following: rename YEAR to year and rename the
other 8 columns according to the rename column in data_description.csv. Then convert the school
names to lower case and convert the following variable to numeric values:
mean_cost_low_income_public , mean_cost_med_income_public, mean_cost_high_income_public
mean_cost_low_income_private, mean_cost_med_income_private, mean_cost_high_income_private

1 Answer

5 votes

So, to do the above data work, do Import the CPI data, selecting columns "Year" and "VALUE" as cpi_data. Rename and clean education_data, converting certain variables to numeric.

Also, Update institution_type based on public or private status. Filter for bachelor's degree programs. Merge with CPI data, creating real-dollar values for specific variables. Drop unnecessary columns.

Then, for cost_data, import and select relevant variables, renaming and converting school names to lowercase. Convert specified cost variables to numeric. Ensure column names match data_description.csv.

Hence, The Python code using the pandas library that can help to perform the above specified cleaning tasks for education_data and cost_data is shown below

python

import pandas as pd

# Part 1: Cleaning Education data

# 1. Import and select CPI data

cpi_data = pd.read_csv('CPI_U_minneapolis_fed.csv', usecols=[0, 1])

# 2. Import education data and rename variables

education_data = pd.read_csv('education_data.csv')

data_description = pd.read_csv('data_description.csv')

# Rename variables according to data_description

education_data = education_data.rename(columns=dict(zip(data_description['Variable'], data_description['rename'])))

# Convert school names to lowercase

education_data['school_name'] = education_data['school_name'].str.lower()

# Convert specific variables to numeric

numeric_vars = ['median_debt_low_income', 'median_debt_med_income', 'median_debt_high_income', 'default_rate', 'avg_family_income']

education_data[numeric_vars] = education_data[numeric_vars].apply(pd.to_numeric, errors='coerce')

# 3. Update institution_type column

education_data['institution_type'] = education_data['institution_type'].apply(lambda x: 'public' if x == 'public' else 'private')

# 4. Filter for schools offering predominantly bachelor’s degrees

education_data_clean = education_data[education_data['highest_degree_offered'] == "bachelor's"]

# 5. Merge with CPI data and convert values to 2018 dollars

education_data_BA1 = pd.merge(education_data_clean, cpi_data, left_on='year', right_on='year')

cpi_base = education_data_BA1['CPI'].iloc[0]

# Function to convert to 2018 dollars

def convert_to_2018_dollars(value, cpi_value, cpi_base):

return value * (cpi_base / cpi_value)

# Convert debt and income variables to 2018 dollars

for var in ['median_debt_low_income', 'median_debt_med_income', 'median_debt_high_income', 'avg_family_income']:

education_data_BA1[f'real_{var}'] = education_data_BA1.apply(lambda x: convert_to_2018_dollars(x[var], x['CPI'], cpi_base), axis=1)

# Drop unnecessary columns

education_data_BA = education_data_BA1.drop(columns=['median_debt_low_income', 'median_debt_med_income', 'median_debt_high_income', 'avg_family_income', 'CPI'])

# Part 2: Cleaning Cost Data

# 1. Import cost dataset and select relevant variables

cost_data1 = pd.read_csv('cost_data.csv', usecols=['UNITID', 'INSTNM', 'YEAR', 'NPT41_PUB', 'NPT43_PUB', 'NPT45_PUB', 'NPT41_PRIV', 'NPT43_PRIV', 'NPT45_PRIV'])

# 2. Rename variables and convert to numeric

cost_data2 = cost_data1.rename(columns=dict(zip(data_description['Variable'], data_description['rename'])))

cost_data2['INSTNM'] = cost_data2['INSTNM'].str.lower()

cost_data2[['mean_cost_low_income_public', 'mean_cost_med_income_public', 'mean_cost_high_income_public', 'mean_cost_low_income_private', 'mean_cost_med_income_private', 'mean_cost_high_income_private']] = cost_data2[['mean_cost_low_income_public', 'mean_cost_med_income_public', 'mean_cost_high_income_public', 'mean_cost_low_income_private', 'mean_cost_med_income_private', 'mean_cost_high_income_private']].apply(pd.to_numeric, errors='coerce')

User HaukurHaf
by
7.9k points