55.7k views
5 votes
For the final assignment, you will be given 5 different types of data:

• education_data.csv: this is a biennial data containing the records of student debt from major US universities between 2010 and 2020.
• cost_data.csv: this data contains information on the net out-of-pocket costs that families pay for each university.
• graduates_income.csv: contains information on the income of students graduating in 2018. You will use this data for the write-up.
• data_description.csv: this data contains the descriptions of the variables in the 3 datasets described above.
• CPI_U_minneapolis_fed.csv: is the CPI data from the Federal Reserve of Minneapolis between 2000 and 2023. First, for the coding assignment, go to Canvas and download the CPI_U_minneapolis_fed.csv and the data_description.csv data. Then go to the data website for the class and download the other data corresponding to the final project. You should get 2 datasets from the website: education_data.csv and cost_data.csv. The graduates_income.csv data is also on Canvas but you will not need it for the coding part, only the write-up part.
As an analyst, main part of your job is to understand the data including its structure and variables. education_data.csv, cost_data.csv, and graduates_income.csv are described in the file called data_description.csv. Before starting, read this file to make sure that you understand the data.
Again, every question is private!

... ...
3. Create a new column called mean_cost_low_income which is equal to mean_cost_low_income_public.
Then replace the NA values in mean_cost_low_income by mean_cost_low_income_private, otherwise
keep it to the original value of mean_cost_low_income. Do the same for median and high income cost:
• create a new column called mean_cost_med_income which is equal to mean_cost_med_income_public.
Then replace the NA values in mean_cost_med_income by mean_cost_med_income_private, otherwise
keep it to the original value of mean_cost_med_income.
• create a new column called mean_cost_high_income which is equal to mean_cost_high_income_public.
Then replace the NA values in mean_cost_high_income by mean_cost_high_income_private, other-
wise keep it to the original value of mean_cost_high_income.
Then remove remove the variables below:
• mean_cost_low_income_public, mean_cost_low_income_private, mean_cost_med_income_public,
• mean_cost_med_income_private, mean_cost_high_income_public, mean_cost_high_income_private.
Then save it as cost_data3. This should look like Table

4. Merge cost_data3 with cpi_data by keeping only the values in cost_data3 and save it as cost_data4.
Update cost_data4 by creating 3 new columns which are the cost values converted to 2018 dollar values
using the formula from the federal reserve of Minneapolis. You can find the formula here. The name
of the new columns should be:
• the 2018 dollar value of mean_cost_low_income is called real_cost_low_income
• the 2018 dollar value of mean_cost_med_income is called real_cost_med_income
• the 2018 dollar value of mean_cost_high_income is called real_cost_high_income
5. Finally, remove the mean and CPI variables from cost_data4 and save it as cost_data. If you did
everything right, the first few observations of cost_data should look like Table 3. The variables to
remove are:
• mean_cost_low_income, mean_cost_med_income, mean_cost_high_income , CPI

Part 3: Merging debt and cost data
1. Merge education_data_BA and cost_data by year and school_id and by keeping only the val-
ues in education_data_BA and save it as education_data_BA_cost. You should have 2 different
school_name columns after the merge, one called school_name.x and the other school_name.y. Use
this to verify if you merged correctly.
Then drop school_name.y in education_data_BA_cost. The first few observations of education_data_BA_cost
should look like Table 4.
2. Using education_data_BA_cost, replicate Table 5 and save is as debt_cost_sumstat_year.
3. Using debt_cost_sumstat_year, replicate Table 6, save it as debt_cost_data_by_year. This is harder
than the tables we usually make. (hint: One way to do it is to first make one table with the debt
column using the function pivot_longer(), here is a good tutorial on pivot_longer(), then make
another table with the cost column using the function pivot_longer(), then combine the two tables
by year, institution_type, and income_category using inner_join().)
4. Using the education_data_BA_cost, replicate Table 7 to 10 below:
• 4a. name table 7 as debt_sumstat_school_type
• 4b. table 8 as debt_sumstat_year
• 4c. table 9 as cost_sumstat_school_type
• 4d. table 10 as cost_sumstat_year
Note: The values in the tables below are just examples, the values in your correct tables should be different
because each student will receive a slightly different dataset.
Copyright UCSB 2023 4

1 Answer

2 votes

The coding assignment above, involves analyzing student debt data from the US and examining the relationship between debt levels, university costs, and student income.

So, The data cleaning and analysis process involves merging and manipulating multiple datasets, calculating inflation-adjusted costs, and generating various summary tables for comprehensive insights.

So, for Part 1, to do it, one need to read the data: Use the read.csv() function to load the education_data.csv, cost_data.csv, and data_description.csv files.

Also examine the data and use descriptive statistics and data visualization techniques to understand the structure and characteristics of each dataset.

Then Combine data and Merge the education_data.csv and cost_data.csv datasets by matching the year and school_id columns.

User Sagar Gautam
by
7.0k points