76.7k views
0 votes
Problem 8: Batting Data - Cardinals vs. Cubs

In this problem, we will calculate the batting average and the number of home runs for the St. Louis Cardinals and the Chicago Cubs during every season since 1900. We will then compare the results.
Note that within the Lahman Dataset, the teamID for the St. Louis Cardinals is listed as 'SLN' and the teamID for the Chicago Cubs is listed as 'CHN'.
Create a DataFrame named st1_batting that contains the number of hits, at-bats, and home runs for every season of the St. Louis Cardinals since 1900. You can do this as follows:
Use loc to filter the batting DataFrame, keeping only the records for which teamID is equal to 'SLN' and for which the yearID is greater than or equal to 1900.
Use loc to select the year ID, H, AB, and HR columns. You can do this at the same time as when you are selecting the rows, or with a second use of loc.

Group the results by year ID, and then calculate grouped sums for the remaining columns.
Add a new column named BA to the stl_batting DataFrame. This new column should be calculated by dividing the values in the H column by the values in the AB column. This should be done without using a loop.
Create a DataFrame named chi_batting that contains the number of hits, at-bats, home runs, and batting average for every season of the Chicago Cubs since 1900. The process is the same as above, except that you will use the teamID of
'CHN' to select records corresponding to the Cubs.If you want to temporarily display st1_batting and chi_batting to check your work, you can check that in 1900 the batting average for the Cardinals was 0.291163 and the batting average for the Cubs was 0.260037. Please remove the code for displaying these DataFrames prior to submitting your work.
Create a figure with two side-by-side line plots. Both plots should display two lines. The plot on the left should display the batting averages for the two teams for each year since 1900, and the one on the right should display the total number of home runs for the two teams for each year since 1900. Create the figure according to the following specifications:
• Set the figure size to [12,4].
Select a single named color to use for the Cardinals in both plots. Select a different named color to use for the Cubs in both plots.
The x-axis should be labeled "Year", and should show tick marks corresponding to years since 1900.
The y-axes of the two plots should be labeled "Batting Average" and "Home Runs".
The titles should be "Batting Average By Year" and "Home Runs by Year".
Both plots should include a legend with two items: "Cardinals" and "Cubs".
Display the figure using plt.show().
Use np.mean() along with an array comparison between two columns of st1_batting and chi_batting to determine the proportion of years since 1900 in which the Cardinals had a higher batting average than the cubs. Display the result rounded to four decimal places.
Use np.mean() along with an array comparison between two columns of st1_batting and chi_batting to determine the proportion of years since 1900 in which the Cardinals had more home runs than the cubs. Display the result rounded to four decimal places.

User N K
by
7.9k points

1 Answer

3 votes

Final answer:

The question deals with statistical analysis of baseball batting data, including comparing team and player performances using various statistical methods and visualization techniques. The tasks involve calculating batting averages and home runs for the St. Louis Cardinals and Chicago Cubs, plotting the data, and conducting hypothesis tests and ANOVA.

Step-by-step explanation:

The subject involves using statistical methods to analyze baseball batting data, requiring knowledge of both descriptive and inferential statistics, such as means, standard deviations, hypothesis tests, and likely some visualization techniques like line plots, histograms, and boxplots. Various statistical techniques are discussed and applied to baseball scenarios, such as comparing team averages, conducting hypothesis tests for differences in batting averages and home runs, and determining probability distributions for sports data.

The exercise tasks involve comparing the performance of the St. Louis Cardinals and the Chicago Cubs in terms of batting averages and home runs since the year 1900. It includes data manipulation within a DataFrame, plotting the data using line plots, and calculating statistical measures such as the mean and proportions.

In addition to comparing team performances, statistical analysis is also conducted to compare individual player's batting averages to those of their team's, using the league's mean and standard deviation as parameters. One-way ANOVA and hypothesis testing are mentioned as methods for assessing the significance of differences in averages and home runs between teams and league wins.

User Deyanira
by
8.5k points