233k views
5 votes
Using CDC's Oncology SEER dataset attached below to find the total number of occurrences of various breast cancers separately for men and women in four age groups (ages 0-24; 25-49; 50-74; 75+) . Save the output in a .csv file with nine things per line separated by commas: the cancer type (remove commas from this), total number of occurrences in men aged 0-24, and total number of occurrences in women aged 0-24, and so on. Use names from ICD-O3 for cancer types. You should print only the cancer types whose codes are found in ICD-O3 and it need not be in any particular order. You need to read the SEER file and then read the ICD-03 file to update the cancer type (so the cancer name shows in the final output file). Submit your .py file in Canvas.

Character positions in SEER file starting from left:

Sex 24

Age at diagnosis 25-27

Year of birth 28-31

Histology Type ICD-O-3 53-56

Behavior Code ICD-O-3 57

User Flauwekeul
by
8.5k points

2 Answers

6 votes

Final answer:

The task involves writing a Python script to analyze the SEER dataset for breast cancer occurrences by sex and age group, and then matching these occurrences with their corresponding cancer types using the ICD-O-3 dataset. The final data must be saved in a CSV file with nine columns.

Step-by-step explanation:

The student needs to perform data analysis on the SEER dataset from the CDC to extract and calculate the total occurrences of various types of breast cancer, breaking them down by gender and age groups. The analysis must be done using Python, reading from the SEER dataset to extract relevant data based on the patient's sex, age at diagnosis, histology type, and behavior code. Subsequently, the output should be matched with the ICD-O-3 dataset to replace histology codes with the corresponding cancer type names. This output must then be saved to a .csv file, with each line containing nine values separated by commas. The values should represent the cancer type (with commas removed) and the count of occurrences in each specified age group for both men and women.

Script Overview

  • Read the SEER dataset and extract the necessary information.
  • Match the histology codes from SEER with the cancer type names from ICD-O-3.
  • Calculate the total number of occurrences by sex and age group.
  • Save the output in the required format to a .csv file.

User Yo
by
8.3k points
1 vote

The task involves reading the SEER file, extracting relevant information, mapping cancer types using the ICD-O3 file, calculating occurrences based on sex and age groups, and saving the results in a CSV file. The script should print only the cancer types with valid ICD-O3 codes.

To find the total number of occurrences of various breast cancers separately for men and women in four age groups, you will need to follow these steps:

1. Read the SEER file: Start by reading the SEER file, which contains the data you need. Make sure to locate the necessary information based on the specified character positions. In this case, you will need to extract the sex, age at diagnosis, year of birth, histology type (ICD-O-3), and behavior code (ICD-O-3).

2. Read the ICD-O3 file: Next, read the ICD-O3 file to update the cancer type using the histology type (ICD-O-3) from the SEER file. This will ensure that the cancer names are included in the final output file.

3. Calculate the occurrences: Group the data by sex and age groups (0-24, 25-49, 50-74, 75+), and then calculate the total number of occurrences for each cancer type within each group. To do this, you will need to count the occurrences for men and women separately for each age group.

4. Save the output: Save the output in a .csv file with nine items per line separated by commas. Each line should contain the following information: the cancer type, total number of occurrences in men aged 0-24, total number of occurrences in women aged 0-24, total number of occurrences in men aged 25-49, total number of occurrences in women aged 25-49, and so on.

5. Print the results: Finally, print only the cancer types whose codes are found in ICD-O3. The order of the cancer types does not need to be specific.

By following these steps, you should be able to obtain the desired output containing the total number of occurrences of various breast cancers separately for men and women in four age groups.

User Zhihao Yang
by
8.3k points