So, to do the above data work, do Import the CPI data, selecting columns "Year" and "VALUE" as cpi_data. Rename and clean education_data, converting certain variables to numeric.
Also, Update institution_type based on public or private status. Filter for bachelor's degree programs. Merge with CPI data, creating real-dollar values for specific variables. Drop unnecessary columns.
Then, for cost_data, import and select relevant variables, renaming and converting school names to lowercase. Convert specified cost variables to numeric. Ensure column names match data_description.csv.
Hence, The Python code using the pandas library that can help to perform the above specified cleaning tasks for education_data and cost_data is shown below
python
import pandas as pd
# Part 1: Cleaning Education data
# 1. Import and select CPI data
cpi_data = pd.read_csv('CPI_U_minneapolis_fed.csv', usecols=[0, 1])
# 2. Import education data and rename variables
education_data = pd.read_csv('education_data.csv')
data_description = pd.read_csv('data_description.csv')
# Rename variables according to data_description
education_data = education_data.rename(columns=dict(zip(data_description['Variable'], data_description['rename'])))
# Convert school names to lowercase
education_data['school_name'] = education_data['school_name'].str.lower()
# Convert specific variables to numeric
numeric_vars = ['median_debt_low_income', 'median_debt_med_income', 'median_debt_high_income', 'default_rate', 'avg_family_income']
education_data[numeric_vars] = education_data[numeric_vars].apply(pd.to_numeric, errors='coerce')
# 3. Update institution_type column
education_data['institution_type'] = education_data['institution_type'].apply(lambda x: 'public' if x == 'public' else 'private')
# 4. Filter for schools offering predominantly bachelor’s degrees
education_data_clean = education_data[education_data['highest_degree_offered'] == "bachelor's"]
# 5. Merge with CPI data and convert values to 2018 dollars
education_data_BA1 = pd.merge(education_data_clean, cpi_data, left_on='year', right_on='year')
cpi_base = education_data_BA1['CPI'].iloc[0]
# Function to convert to 2018 dollars
def convert_to_2018_dollars(value, cpi_value, cpi_base):
return value * (cpi_base / cpi_value)
# Convert debt and income variables to 2018 dollars
for var in ['median_debt_low_income', 'median_debt_med_income', 'median_debt_high_income', 'avg_family_income']:
education_data_BA1[f'real_{var}'] = education_data_BA1.apply(lambda x: convert_to_2018_dollars(x[var], x['CPI'], cpi_base), axis=1)
# Drop unnecessary columns
education_data_BA = education_data_BA1.drop(columns=['median_debt_low_income', 'median_debt_med_income', 'median_debt_high_income', 'avg_family_income', 'CPI'])
# Part 2: Cleaning Cost Data
# 1. Import cost dataset and select relevant variables
cost_data1 = pd.read_csv('cost_data.csv', usecols=['UNITID', 'INSTNM', 'YEAR', 'NPT41_PUB', 'NPT43_PUB', 'NPT45_PUB', 'NPT41_PRIV', 'NPT43_PRIV', 'NPT45_PRIV'])
# 2. Rename variables and convert to numeric
cost_data2 = cost_data1.rename(columns=dict(zip(data_description['Variable'], data_description['rename'])))
cost_data2['INSTNM'] = cost_data2['INSTNM'].str.lower()
cost_data2[['mean_cost_low_income_public', 'mean_cost_med_income_public', 'mean_cost_high_income_public', 'mean_cost_low_income_private', 'mean_cost_med_income_private', 'mean_cost_high_income_private']] = cost_data2[['mean_cost_low_income_public', 'mean_cost_med_income_public', 'mean_cost_high_income_public', 'mean_cost_low_income_private', 'mean_cost_med_income_private', 'mean_cost_high_income_private']].apply(pd.to_numeric, errors='coerce')