53,606 views
9 votes
9 votes
An may help predict or explain changes in a response variable.

User Rahim Rahimov
by
3.0k points

1 Answer

28 votes
28 votes

Answer:

Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable; the method estimates the relationship by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable configured as a straight line. In this entry, OLS regression will be discussed in the context of a bivariate model, that is, a model in which there is only one independent variable ( X ) predicting a dependent variable ( Y ). However, the logic of OLS regression is easily extended to the multivariate model in which there are two or more independent variables.

Social scientists are often concerned with questions about the relationship between two variables. These include the following: Among women, is there a relationship between education and fertility? Do more-educated women have fewer children, and less-educated women have more children? Among countries, is there a relationship between gross national product (GNP) and life expectancy? Do countries with higher levels of GNP have higher levels of life expectancy, and countries with lower levels of GNP, lower levels of life expectancy? Among countries, is there a positive relationship between employment opportunities and net migration? Among people, is there a relationship between age and values of baseline systolic blood pressure? (Lewis-Beck 1980; Vittinghoff et al. 2005).

As Michael Lewis-Beck notes, these examples are specific instances of the common query, “What is the relationship between variable X and variable Y ?” (1980, p. 9). If the relationship is assumed to be linear, bivariate regression may be used to address this issue by fitting a straight line to a scatterplot of observations on variable X and variable Y. The simplest statement of such a relationship between an independent variable, labeled X, and a dependent variable, labeled Y, may be expressed as a straight line in this formula:

where a is the intercept and indicates where the straight line intersects the Y -axis (the vertical axis); b is the slope and indicates the degree of steepness of the straight line; and e represents the error.

The error term indicates that the relationship predicted in the equation is not perfect. That is, the straight line does not perfectly predict Y. This lack of a perfect prediction is common in the social sciences. For instance, in terms of the education and fertility relationship mentioned above, we would not expect all women with exactly sixteen years of education to have exactly one child, and women with exactly four years of education to have exactly eight children. But we would expect that a woman with a lot of education would have fewer children than a woman with a little education. Stated in another way, the number of children born to a woman is likely to be a linear function of her education, plus some error. Actually, in low-fertility societies, Poisson and negative binomial regression methods are preferred over ordinary least squares regression methods for the prediction of fertility (Poston 2002; Poston and McKibben 2003).

We first introduce a note about the notation used in this entry. In the social sciences we almost always undertake research with samples drawn from larger populations, say, a 1 percent random sample of the U.S. population. Greek letters like α and β are used to denote the parameters (i.e., the intercept and slope values) representing the relationship between X and Y in the larger population, whereas lowercase Roman letters like a and b will be used to denote the parameters in the sample.

When postulating relationships in the social sciences, linearity is often assumed, but this may not be always the case. Indeed, a lot of relationships are not linear. When one hypothesizes the form of a relationship between two variables, one needs to be guided both by the theory being used, as well as by an inspection of the data.

But given that we wish to use a straight line for relating variable Y, the dependent variable, with variable X, the independent variable, there is a question about which line to use. In any scatterplot of observations of X and Y values (see Figure 1), there would be an infinite number of straight lines that might be used to represent the relationship. Which line is the best line?

The chosen straight line needs to be the one that minimizes the amount of error between the predicted values of Y and the actual values of Y. Specifically, for each of the i th observations in the sample, if one were to square the difference between the observed and predicted values of Y, and then sum these squared differences, the best line would have the lowest sum of squared errors (SSE), represented as follows:

Explanation:

User Ezhilan Mahalingam
by
2.8k points