Lecture Notes

Business Statistics (QMB 2100)

Regression Analysis

Introduction

This chapter introduces regression analysis, studies in which we look for relationships between variables. In particular, we will be considering linear regression. Rather than just deciding if there is or is not a long-run affect of a treatment, regression analysis attempts to determine the mathematical relationship between the variables. In the case of linear regression, this relationship is expressed in the the form of the equation of a line.

Some definitions:

Independent variable
The variable believed to affect the other (in the previous chapter, the treatment or factor)
Dependent variable
The variable affected by the independent variable
Data pair
Two numbers representing one value of the the independent variable and the corresponding value of the dependent variable
Scatter Plot
A plot of all the data pairs; the pattern of the plot will give us a visual indication of whether or not the two variables are related

One important difference between ANOVA experimental designs and regression analysis is that with regression analysis, both the independent and dependent variables must be quantitative; that is, that can be measured and described using numbers.

Here is an example as an illustration of the above terms: You work at a large marina that offers covered, out-of-water storage for boats up to 30 feet in length. An employee comments that you should charge smaller boats more  per month, because they are used more (requiring their being removed from and returned to storage for each day used). You investigate this by selecting 8 boats at random and reviewing the usage records for last year. Here is the data:

Boat Usage Regression Study
Boat Length Days Used
18
28
22
24
25
23
20
17
31
12
22
20
21
16
45
32

In this study, the independent variable is boat length, and days used will be the dependent variable. The data pairs are the eight pairs of boat length and related days used (18, 31 for example). Here is the scatter plot of the data pairs (note independent variable is always plotted on the x-axis, and the dependent variable on the y-axis):

Does there appear to be a relationship between boat length and days used? If so, is the relationship linear? If so, is the relationship positive (increasing the independent variable increases the dependent variable) or negative (increasing the independent variable decreases the depended variable).


Running Regression Studies

 You can conduct a regression study either by:

  • Sampling (as in the example above)
  • Experiment

In the case of experiment, you will be controlling the independent variable, so you will have to choose the limits (high and low) and the number of observations (and hence the number of data pairs) you will have for the analysis.

For an experiment, you should select the minimum and maximum level of the independent variable you feel are realistic for the situation. In the text "Training vs. Sales" example, less than 4 hours of training (the independent variable) or more than 32 were considered unrealistic, so these were chosen as the limits. Eight equally spaced levels of the independent variable were chosen within this interval. The level of the the dependent variable (sales) was measured at each level to obtain 8 data pairs.


A Quick Visual Analysis – the Circle or the Ellipse

The first step is to plot the data pairs on a scatter plot.

Study Data
Store x (Hours
Training per
Employee
y (Monthly
Sales/Store)
1
2
3
4
5
6
7
8
4
8
12
16
20
24
28
32
10
12
15
15
19
18
23
24

It  appears there is a positive, linear relationship between the amount of training employees receive and the monthly store sales. But as with confidence intervals and analysis of variance, we need a more quantitative and specific way to make inductive inferences from the data: How sure are we there is a relationship? And what is the specific relationship?


Determining the Best Fitting Straight Line

Now that we have plotted the data, we want to find the best fitting line that will represent the data. You can do this visually, drawing a line with a ruler on top of the scatter plot:

What is the equation of this line? From algebra, you might remember the slope-intercept equation of a line (where m is the slope, and b is the y-intercept:

 The slope m can be calculated from any two points on the line, whose coordinates we can assume are (x1, y1) and (x2, y2):

Given our rough line-fitting, it looks like the y-intercept is at about 8, so one point on the line would be (0,8). Another point on the line, at x = 36, appears to be about (36, 26). So we can calculate the slope as:

m = (26 - 8) / (36 - 0) = 18 / 36 = 0.5

And the equation of the line is then approximately:

y = 0.5x + 8

Is this the best-fitting line? The best-fitting line would be the one that minimizes the differences between the value of the dependent variable (the y-coordinate of each point) and that predicted by the equation of the line. For example, for 20 hours of training, the measured value f the dependent variable is 19. Our equation would predict this would be 0.5(20) + 8, or 18. Our actual value deviates from this by 1. Here are the rest: 

Training - Sales Study Data
Store x (Hours
Training per
Employee
y (Monthly
Sales/Store)
Predicted by
y = 0.5x + 8
Deviation
(actual -
predicted)
Deviation
Squared
1
2
3
4
5
6
7
8
4
8
12
16
20
24
28
32
10
12
15
15
19
18
23
24
10
12
14
16
18
20
22
24
0
0
1
-1
1
-2
1
0
0
0
1
1
1
4
1
0

Sum:   

0 8

The best-fitting line is the one that would minimize the sum of the squared deviations. (Note: we need to square the deviations so that we have a non-zero number.) We could try vary both the slope and y-intercept in the equation of the line and see if it reduces this number. In fact this problem has a general solution. The best-fitting line is given by:

This is quite a bit to calculate by hand. Fortunately, most spreadsheets and many calculators contain software that can do this for us. We will look at this in the next section. It turns out that the best-fitting line is:

y = 0.494x + 8.1

This is very close to our "eye-ball" equation, but yields a sum of squared deviations slightly lower: 7.9848.


Inductive Inference and the Analysis of Variance

We can always come up with a linear relationship between two variables. In fact, using the equations above, we can determine the equation of the best-fitting line. The question is, does this really describe a long-run relationship between the variables? How sure are we? What remains unexplained? To answer these questions we can use the ANOVA techniques we developed in the last chapter

The Sum of Squares Decomposition and the Analysis of Variance. Using the values for the observations of the dependent variable, we can calculate SST from the data (y = 17):

SST = (10 - 17)2 + (12 - 17)2 + ... + (24 - 17)2 = 172

Of these 172 units of variation, we have explained all but 8 of them using our "eye-ball" linear relationship (or 7.9848 using the best-fitting line). This must be the variation due to extraneous factors (factors other than our treatment of hours of training).  Here is the resulting ANOVA table:

ANOVA Table - Training-Sales Study
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Variance Variance
Ratio
Treatment
(SSTR)
164.02 1 164.02 123.33
Extraneous
(SSEF)
7.98 6 1.33  
Total (SST) 172 7    

Using the Fisher table, we can see see that the variance ratio (123.33) is much greater than the Fisher critical value (F[1, 6, 99%] = 13.74), so we would reject Ho and conclude there is a long-run relationship between hours of training and sales.

But exactly what is the nature of this relationship? Our equation for the best-fitting line depicting this relationship (also called the regression line) is y = 0.494x + 8.1. This suggests that for every hour increase in training (x), we should expect a 0.494 increase in monthly sales (in thousands, or $494 per month). This number has some uncertainty associated with it, however, due to the extraneous factors our regression line does not explain. So we will need to compute a prediction interval, analogous to the confidence intervals we have predicted for estimating population parameters, like the mean.


Making Predictions

One we have the regression line, and have verified that there is a long-run relationship between independent and dependent variables, we can make predictions. To do this, simply substitute a specific value for the independent variable in the regression equation. For example, what level of sales can we expect if all stores receive 15 hours of training?

y = 0.494x + 8.1 = (0.494)(15) + 8.1 = 15.51, or $15,510

The point estimate of $15,510 is called the most likely value (MLV). But it is very unlikely that a store will realize exactly this amount of monthly sales given 15 hours of training. We are interested in determining a prediction interval in which we can have a stated level of confidence the actual sales will fall. The formula for a prediction interval for a linear regression can be calculated from this formula:

In this formula, MLV is the most likely value calculated from the regression equation, t is obtained from the Student t Table for the desired level of confidence and degrees of freedom, n is the sample size, se is the standard error (the square root of the variation due to extraneous factors), and xp the value of x for which we wish to predict y. (Note that this equation is slightly different from, but equivalent to that given in the text on pg. 254.)

Evaluating this for our data at 99% confidence (t[99%, 6 degrees of freedom] = 3.71):

We can now state we are 99% confident that a store with 15 hours of monthly training per employee should have monthly sales between $10,947 and $20,073. (Note that this wide prediction interval is due to both the relatively small sample size and the high degree of confidence we are assigning it.)

Reducing the Width of Prediction Intervals. There are three ways to reduce the prediction interval from a regression study:

  1. Increase the sample size (number of observations to which the regression line is fitted).
  2. Use a multiple regression study (section 6.7 in the text) that takes into account more than one factor.
  3. Obtain observations that have values of the independent variable are at the extremes of range for which you want to make predictions.

When Predictions Are Risky Business. One common risk is using the regression study to make predictions outside the range of the independent variable on which the regression line is based. There are both logical and statistical problems with doing so. Bottom line: the further outside the range of data used in the experiment you get, the more likely you are to be wrong with your predictions.


Using Computer Software for Regression Studies

The techniques shown above are perfectly adequate for small regression studies where the calculations needed can be made without too much effort using just a calculator. But when we have more than a few observations (as is usually the case), we need a more powerful tool.

Many high-end calculators (like the TI-83 and above), computer spreadsheets applications (like MS Excel), and dedicated statistical programs (like SPSS or MiniTab) all contain the software necessary for regression studies. But all these produce results in different formats, which have to be interpreted by the analyst.

I will describe how to use one package – MS Excel 2007 – for simple linear regression analysis. Please keep in mind that there are many, many options at each step, which I encourage you to explore, but which may get you lost until you are familiar with the procedures, menus, tabs, and choices.

Here are the steps to use Excel to analyze the Training-Sales study data given above:

  1. Start Excel.
  2. Make sure the Data Analysis Add-In is installed: Click the Data tab. If "Data Analysis" appears in the "Analysis" pane to the far right, it is installed. If it is not installed, follow these steps:
    1. Click the Office Button (top left).
    2. Click the Excel Options button (at the bottom)
    3. Click the Add-Ins tab (on the left)
    4. In the "Manage" drop-down menu (at the bottom left) select "Excel Add-ins" and click the Go button
    5. Check "Analysis Tool Pack" and click OK.
  3. Pick any starting cell on the spreadsheet (say A1) and enter the data to be analyzed in two columns, with the column headers at the top. Format the cells anyway you want, but you don't really need to take time to do this unless it is for presentation purposes. Here is the data on the spreadsheet:

  1. Select the range containing the labels and data.
  2. Click on Insert, then in the Charts pane click on Scatter and select the top left charting option. A chart will appear on the the spreadsheet. You can do a lot to format this chart (title, axis, etc.), but for now click the Layout tab and in the Analysis pane choose Trendline, then Linear Trendline. You should get a chart that looks like the one below. This should give you a good visual check if there is a strong linear relationship or not.

  1. Click the Data tab; in the Analysis pane (top left) click Data Analysis
  2. In the Data Analysis window, select Regression from the list and click OK
  3. In the Regression dialog box:
    1. select the "Input Y Range" and the "Input X Range" by clicking in the text box, then clicking and dragging on the spreadsheet to select the sales data then the training data, respectively.
    2. If the range you selected for each includes column headings, make sure the "Labels" box is checked.
    3. Click OK
  4. The regression output will be shown on a separate sheet (see the bats at the bottom of the spreadsheet. Click on the tab to view it. Change the column widths to see the entire values or row headers. The summary output contains a variety of statistics, only a few of which apply to us:

  1. The ANOVA table in the middle shows the SSTR (C12), the SSEF (C13), and the SST (C14), with the corresponding degrees of freedom (d.f.). The variances (D12 and D13) and their ratio (E12) are also calculated. You will still need to look up the required Fisher critical value.

  2. The coefficients of the the regression equation are shown in the bottom section: y-intercept (B17) and slope (B18)

  3. To construct prediction intervals, the standard error is also calculated (B7)

To test your understanding of these procedures, try them using the Boat Usage Study data given at the beginning of these notes.


Association and Cause and Effect

Just because two variables are related does not necessarily mean one is the cause of the other. Consider both the Training-Sales and Boat Usage studies. You may believe that increased training causes increased sales, but is this necessarily true? Do you believe just as strongly that increased boat length causes reduced usage?

Three factors would strengthen the case for causation:

  1. The association between the two variables is found in many studies done by different investigators
  2. You can provide a reasonable (scientific) argument as to how changes in one variable could cause changes in the other.
  3. There are no other plausible factors that explain the connection between the two variables.

As Brightman concludes (pg 266):

Regression focuses on detecting associations. Assessing causation is beyond statistics. Causation requires association and explanation.

 

 Updated