71.1k views
3 votes
Consider the birthwt data from R package MASS. We will investigate the relationship between low birthweight and the predictors in the birthwt data using logistic regression and discriminant analysis.

a. Investigate the relationship between variables in the birthwt dataset. Do you see anything surprising? Use both numeric and visual summaries. Create and comment on visualizations specifically between the outcome variable and predictor/independent variables. Also, notice that qualitative/categorical variables should be visualized in an alternative manner, not just scatterplots/correlations as in the case of quantitative variables.
b. Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in birthwt to avoid including variables that are not logically acceptable for inclusion in the model.
c. What do you notice regarding the variables ptl and ftv. What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables' effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?
d. Create a new variable for ptl named ptl2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.
e. Create a new variable for ftv named ftv2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birthweight probabilities by levels of the variable in order to better understand the relationship between probability of low birthweight and the newly created variable.
f. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) (use ftv2 and pti2 in the modeling). Comment on what you find - are the new versions of these variables important in predicting low birthweight??
g. In a manner similar to the approach used in the book, split the birthwt data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f ) and b ). Which model performs the best? Remember you MUST set seed
h. Using your final model of f) interpret the all estimates for all convonates. answer h)

User Xerri
by
7.0k points

1 Answer

1 vote

Final answer:

To study the relationship between two variables, identify the independent and dependent variables, draw a scatter plot, calculate the least-squares line, find the correlation coefficient, and use the line to make predictions while assessing the fit and outliers.

Step-by-step explanation:

To explore the relationship between two variables, first determine the independent variable and the dependent variable. Typically, the independent variable is the one you think might influence the other, while the dependent variable is the one that is affected. Next, you would draw a scatter plot to visually inspect the relationship between the two variables. By looking at the scatter plot, you can surmise whether there is an apparent relationship.

After this, you would calculate the least-squares line, sometimes referred to as the line of best fit, and express it in the form ý = a + bx. This equation helps us to predict the value of the dependent variable based on the value of the independent variable. Following the line calculation, you should find the correlation coefficient, which measures the strength and direction of the linear relationship between the two variables. A significant correlation coefficient suggests that as one variable increases or decreases, the other does so as well in a predictable pattern.

Once the line of best fit is established, you can use it to make predictions for given values of the independent variable. Finally, assessing the fit of the data to the line and checking for outliers is crucial. Outliers can distort the relationship and may need special consideration. The slope of the least-squares line has a specific interpretation: it represents the expected change in the dependent variable for each one-unit change in the independent variable.

User Denisvm
by
7.0k points