Lecture Notes
Business Statistics (QMB 2100)
Regression Analysis
Introduction
This chapter introduces regression analysis, studies
in which we look for relationships between variables. In
particular, we will be considering linear regression.
Rather than just deciding if there is or is not a long-run
affect of a treatment, regression analysis attempts to determine
the mathematical relationship between the variables. In the case
of linear regression, this relationship is expressed in the the
form of the equation of a line.
Some definitions:
- Independent variable
- The variable believed to affect the
other (in the previous chapter, the
treatment or factor)
- Dependent variable
- The variable affected by the independent
variable
- Data pair
- Two numbers representing one value of
the the independent variable and the
corresponding value of the dependent
variable
- Scatter Plot
- A plot of all the data pairs; the
pattern of the plot will give us a visual
indication of whether or not the two
variables are related
|
One important difference between ANOVA experimental designs
and regression analysis is that with regression analysis, both
the independent and dependent variables must be quantitative;
that is, that can be measured and described using numbers.
Here is an example as an illustration of the above terms: You
work at a large marina that offers covered, out-of-water storage
for boats up to 30 feet in length. An employee comments that you
should charge smaller boats more per month, because they
are used more (requiring their being removed from and returned
to storage for each day used). You investigate this by selecting
8 boats at random and reviewing the usage records for last year.
Here is the data:
| Boat Usage Regression Study |
| Boat Length |
Days Used |
18
28
22
24
25
23
20
17 |
31
12
22
20
21
16
45
32 |
In this study, the independent variable is
boat length, and days used will be the dependent variable.
The data pairs are the eight pairs of boat length and
related days used (18, 31 for example). Here is the scatter
plot of the data pairs (note independent variable is always
plotted on the x-axis, and the dependent variable on
the y-axis):

Does there appear to be a relationship
between boat length and days used? If so, is the
relationship linear? If so, is the relationship positive
(increasing the independent variable increases the dependent
variable) or negative (increasing the independent
variable decreases the depended variable).
Running Regression Studies
You can conduct a regression study either by:
- Sampling (as in the example above)
- Experiment
In the case of experiment, you will be controlling the
independent variable, so you will have to choose the limits
(high and low) and the number of observations (and hence the
number of data pairs) you will have for the analysis.
For an experiment, you should select the minimum and maximum
level of the independent variable you feel are realistic for the
situation. In the text "Training vs. Sales" example, less than 4
hours of training (the independent variable) or more than 32
were considered unrealistic, so these were chosen as the limits.
Eight equally spaced levels of the independent variable were
chosen within this interval. The level of the the dependent
variable (sales) was measured at each level to obtain 8 data
pairs.
A Quick Visual Analysis – the Circle or the Ellipse
The first step is to plot the data pairs on a scatter plot.
| Study Data |
| Store |
x (Hours
Training per
Employee |
y (Monthly
Sales/Store) |
1
2
3
4
5
6
7
8 |
4
8
12
16
20
24
28
32 |
10
12
15
15
19
18
23
24 |
|
 |
It appears there is a positive, linear
relationship between the amount of training employees receive
and the monthly store sales. But as with confidence intervals
and analysis of variance, we need a more quantitative and
specific way to make inductive inferences from the data: How
sure are we there is a relationship? And what is
the specific relationship?
Determining the Best Fitting Straight Line
Now that we have plotted the data, we want to
find the best fitting line that will represent the data. You can
do this visually, drawing a line with a ruler on top of the
scatter plot:

What is the equation of this line? From algebra,
you might remember the slope-intercept equation of a line
(where m is the slope, and b is the y-intercept:
The
slope m can be calculated from any two points on the
line, whose coordinates we can assume are (x1, y1)
and (x2, y2):
Given our rough line-fitting, it looks like the y-intercept
is at about 8, so one point on the line would be (0,8). Another
point on the line, at x = 36, appears to be about (36, 26). So
we can calculate the slope as:
m = (26 - 8) / (36 - 0) = 18 / 36 = 0.5
And the equation of the line is then approximately:
y = 0.5x + 8
Is this the best-fitting line? The best-fitting line
would be the one that minimizes the differences between the
value of the dependent variable (the y-coordinate of each
point) and that predicted by the equation of the line. For
example, for 20 hours of training, the measured value f the
dependent variable is 19. Our equation would predict this would
be 0.5(20) + 8, or 18. Our actual value deviates from
this by 1. Here are the rest:
| Training - Sales Study Data |
| Store |
x (Hours
Training per
Employee |
y (Monthly
Sales/Store) |
Predicted by
y = 0.5x + 8 |
Deviation
(actual -
predicted) |
Deviation
Squared |
1
2
3
4
5
6
7
8 |
4
8
12
16
20
24
28
32 |
10
12
15
15
19
18
23
24 |
10
12
14
16
18
20
22
24 |
0
0
1
-1
1
-2
1
0 |
0
0
1
1
1
4
1
0 |
|
Sum:
|
0 |
8 |
The best-fitting line is the one
that would minimize the sum of the squared
deviations. (Note: we need to square the
deviations so that we have a non-zero number.)
We could try vary both the slope and y-intercept
in the equation of the line and see if it
reduces this number. In fact this problem has a
general solution. The best-fitting line is given
by:
This is quite a bit to calculate
by hand. Fortunately, most spreadsheets and many
calculators contain software that can do this
for us. We will look at this in the next
section. It turns out that the best-fitting line
is:
y = 0.494x +
8.1
This is very close to our
"eye-ball" equation, but yields a sum of squared
deviations slightly lower: 7.9848.
Inductive Inference and the
Analysis of Variance
We can always come up with a
linear relationship between two variables. In
fact, using the equations above, we can
determine the equation of the best-fitting line.
The question is, does this really describe a
long-run relationship between the variables? How
sure are we? What remains unexplained? To answer
these questions we can use the ANOVA techniques
we developed in the last chapter
The Sum of Squares
Decomposition and the Analysis of Variance.
Using the values for the observations of the
dependent variable, we can calculate SST from
the data (y
= 17):
SST = (10 - 17)2
+ (12 - 17)2 + ... + (24 - 17)2
= 172
Of these 172 units of variation,
we have explained all but 8 of them using our
"eye-ball" linear relationship (or 7.9848 using
the best-fitting line). This must be the
variation due to extraneous factors
(factors other than our treatment of
hours of training). Here is the resulting
ANOVA table:
| ANOVA Table - Training-Sales Study |
Sources of
Variation |
Sum of
Squares |
Degrees of
Freedom |
Variance |
Variance
Ratio |
Treatment
(SSTR) |
164.02 |
1 |
164.02 |
123.33 |
Extraneous
(SSEF) |
7.98 |
6 |
1.33 |
|
| Total (SST) |
172 |
7 |
|
|
Using the Fisher table, we can
see see that the variance ratio (123.33) is much
greater than the Fisher critical value (F[1, 6,
99%] = 13.74), so we would reject Ho and
conclude there is a long-run relationship
between hours of training and sales.
But exactly what is the nature
of this relationship? Our equation for the
best-fitting line depicting this relationship
(also called the regression line) is y
= 0.494x + 8.1. This suggests that for
every hour increase in training (x), we
should expect a 0.494 increase in monthly sales
(in thousands, or $494 per month). This number
has some uncertainty associated with it,
however, due to the extraneous factors our
regression line does not explain. So we will
need to compute a prediction interval,
analogous to the confidence intervals we have
predicted for estimating population parameters,
like the mean.
Making Predictions
One we have the regression line, and have
verified that there is a long-run relationship
between independent and dependent variables, we can
make predictions. To do this, simply substitute a
specific value for the independent variable in the
regression equation. For example, what level of
sales can we expect if all stores receive 15 hours
of training?
y = 0.494x + 8.1 = (0.494)(15)
+ 8.1 = 15.51, or $15,510
The point estimate of $15,510 is called the
most likely value (MLV). But it is very
unlikely that a store will realize exactly
this amount of monthly sales given 15 hours of
training. We are interested in determining a
prediction interval in which we can have a
stated level of confidence the actual sales will
fall. The formula for a prediction interval for a
linear regression can be calculated from this
formula:
In this formula, MLV is the most likely
value calculated from the regression equation, t
is obtained from the Student
t Table for the desired level of confidence
and degrees of freedom, n is the sample
size, se is the standard error
(the square root of the variation due to
extraneous factors), and xp the
value of
x for which we wish to predict y. (Note
that this equation is slightly different from, but
equivalent to that given in the text on pg. 254.)
Evaluating this for our data at 99% confidence (t[99%,
6 degrees of freedom] = 3.71):

We can now state we are 99% confident that a
store with 15 hours of monthly training per employee
should have monthly sales between $10,947 and
$20,073. (Note that this wide prediction interval is
due to both the relatively small sample size and the
high degree of confidence we are assigning it.)
Reducing the Width of Prediction Intervals.
There are three ways to reduce the prediction
interval from a regression study:
- Increase the sample size (number of
observations to which the regression line is
fitted).
- Use a multiple regression study (section 6.7
in the text) that takes into account more than
one factor.
- Obtain observations that have values of the
independent variable are at the extremes of
range for which you want to make predictions.
When Predictions Are Risky Business. One
common risk is using the regression study to make
predictions outside the range of the independent
variable on which the regression line is based.
There are both logical and statistical problems with
doing so. Bottom line: the further outside the range
of data used in the experiment you get, the more
likely you are to be wrong with your predictions.
Using Computer Software for Regression Studies
The techniques shown above are perfectly adequate for small
regression studies where the calculations needed can be made
without too much effort using just a calculator. But when we
have more than a few observations (as is usually the case), we
need a more powerful tool.
Many high-end calculators (like the TI-83 and above),
computer spreadsheets applications (like MS Excel), and
dedicated statistical programs (like SPSS or MiniTab) all
contain the software necessary for regression studies. But all
these produce results in different formats, which have to be
interpreted by the analyst.
I will describe how to use one package
– MS Excel 2007 – for simple linear
regression analysis. Please keep in mind that there are many,
many options at each step, which I encourage you to explore, but
which may get you lost until you are familiar with the
procedures, menus, tabs, and choices.
Here are the steps to use Excel
to analyze the Training-Sales study data given above:
- Start Excel.
- Make sure the Data Analysis
Add-In is installed: Click the Data tab. If "Data Analysis"
appears in the "Analysis" pane to the far right, it is
installed. If it is not installed, follow these steps:
- Click the Office Button
(top left).
- Click the Excel Options
button (at the bottom)
- Click the Add-Ins tab
(on the left)
- In the "Manage"
drop-down menu (at the bottom left) select "Excel
Add-ins" and click the Go button
- Check "Analysis Tool
Pack" and click OK.
- Pick any starting cell on
the spreadsheet (say A1) and enter the data to be analyzed
in two columns, with the column headers at the top. Format
the cells anyway you want, but you don't really need to take
time to do this unless it is for presentation purposes. Here
is the data on the spreadsheet:

- Select the range containing
the labels and data.
- Click on Insert, then in
the Charts pane click on Scatter and select the top left
charting option. A chart will appear on the the spreadsheet.
You can do a lot to format this chart (title, axis, etc.),
but for now click the Layout tab and in the Analysis pane
choose Trendline, then Linear Trendline. You should get a
chart that looks like the one below. This should give you a
good visual check if there is a strong linear relationship
or not.

- Click the Data tab; in the
Analysis pane (top left) click Data Analysis
- In the Data Analysis
window, select Regression from the list and click OK
- In the Regression dialog
box:
- select the "Input Y
Range" and the "Input X Range" by clicking in the text
box, then clicking and dragging on the spreadsheet to
select the sales data then the training data,
respectively.
- If the range you
selected for each includes column headings, make sure
the "Labels" box is checked.
- Click OK
- The regression output will
be shown on a separate sheet (see the bats at the bottom of
the spreadsheet. Click on the tab to view it. Change the
column widths to see the entire values or row headers. The
summary output contains a variety of statistics, only a few
of which apply to us:

-
The ANOVA table in the middle shows the
SSTR (C12), the SSEF (C13), and the SST (C14), with the
corresponding degrees of freedom (d.f.). The
variances (D12 and D13) and their ratio (E12) are also
calculated. You will still need to look up the required
Fisher critical value.
-
The coefficients of the the regression
equation are shown in the bottom section: y-intercept
(B17) and slope (B18)
-
To construct prediction intervals, the
standard error is also calculated (B7)
To test your understanding of these procedures, try them
using the Boat Usage Study data given at the beginning of these
notes.
Association and Cause and Effect
Just because two variables are related does not necessarily
mean one is the cause of the other. Consider both the
Training-Sales and Boat Usage studies. You may believe that
increased training causes increased sales, but is this
necessarily true? Do you believe just as strongly that increased
boat length causes reduced usage?
Three factors would strengthen the case for causation:
- The association between the two variables is found in
many studies done by different investigators
- You can provide a reasonable (scientific) argument as to
how changes in one variable could cause changes in the
other.
- There are no other plausible factors that explain the
connection between the two variables.
As Brightman concludes (pg 266):
| Regression focuses on detecting associations.
Assessing causation is beyond statistics. Causation
requires association and explanation. |
|