Lecture Notes
Business Statistics (QMB 2100)
Experimentation and Analysis of Variance
Introduction
This chapter introduces experimentation to make
inductive inferences, as opposed to sampling. Basically:
- Sampling
- Take a simple random sample
- Count or measure the characteristic under study
- Compute sample statistics (like the sample mean)
- Make inductive inferences (using confidence
intervals or a hypothesis test)
- Experimentation
- Design an experiment to vary the treatment (factor)
under study
- Conduct the experiment and observe the result
- Compute sample statistics (sum of squares)
- Make inductive inferences (using ANOVA and the
Fisher Table)
Important terminology introduced in this chapter:
- Experimental units
- The units on which the experiment is done.
When the units are people, we call them
subjects.
- Experimental treatment (or factor)
- What is varied in the experiment.
- Dependent variable
- What is measured as we vary the factor.
|
Three types of experimental design are discussed in this
chapter:
- One factor, completely random design
- One factor, randomized block design
- Multifactor design
The One Factor Completely Random Design
Three brands of gasoline are tested to determine which, if
any, delivers the highest miles per gallon. In this experiment
we have:
- Experimental unit: One car and one driver
- Factor: The brand of gasoline
- Dependent variable: Mileage (miles per gallon)
- Null hypothesis: There is no difference in the long-run
mpg of the three brands
- Alternative hypothesis: There is a difference
Here are two possible results (Data Set 1 and Data Set 2) of
the experiment (see pg. 177):
| Gasoline Mileage Experiment Results |
| Data Set 1 |
Data Set 2 |
| Brand 1 |
Brand 2 |
Brand 3 |
Brand 1 |
Brand 2 |
Brand 3 |
20
22
21
22
20 |
25
27
26
26
26 |
28
28
27
29
28 |
18
24
17
22
24 |
27
20
29
31
23 |
17
37
29
21
36 |
| avg = 21 |
26 |
28 |
21 |
26 |
28 |
Notice the average mileage for each brand is the
same in both data sets (e.g., Brand 1 = 21 mpg); which
Data Set would lead you to:
-
Fail to reject the null hypothesis (conclude
there is no difference in the long-run mpg of the three
brands)
-
Reject the null hypothesis (conclude there
is a difference)
The difference between the data sets is not
their means, but the variation within them.
A Quick Visual Analysis for a One Factor
Completely Random Design. The author uses spread charts
to help you visualize and compare the variation between data
sets. Here is another visualization that depicts the same thing:

The data from Data Set 1 look stable; we
have no reason to expect future results to differ from the mean
any more than these five results did. There is an obvious
variation between the brand averages, but not much
difference within each brand .We are not
uncomfortable it rejecting the null hypothesis and conclude
there is a difference in the long-run mpg of the three brands.
The data from Data Set 2 look unstable –
we are not sure what to expect in the future except that the
variation in results will be high. A few more high or low
results for any brand would significantly change the mean. Any
variation between brands' averages is almost lost in the
variation within each brand. We would not reject the null
hypothesis, and conclude there is no difference in the long-run
mpg of the three brands.
Here is the "visual rule of thumb" for
determining if an experimental factor does or does not affect
the dependent variable (see pg. 181):
| ANOVA "Rule of Thumb" |
Spread between
Observation Averages |
Spread within a set
of Observations |
| Little |
Much |
| Little |
Hard to tell |
Factor has no impact:
don't reject Ho |
| Much |
Factor has an impact;
reject Ho |
Hard to tell |
An Intuitive Sum of Squares Decomposition
for the One Factor Completely Random Design. We need a
more formal way test hypotheses using experiments. The way
we will do it for the one factor, completely random design
is to focus on the variation. We can quantify the variation
(s2) using this formula:
Note the numerator of the fraction above is called
the sum of squares, and the denominator the
degrees of freedom. For Data Set 1 above, the
overall mean of the 15 observations is 25 mpg. We could
calculate the sum of squares total (SST) as:
SST = (20 - 25)2 + (22 - 25)2
+ ... + (28 - 25)2 = 138 units of
variation
These 138 units of variation can be attributed to
only one of two causes:
- Sum of squares due to the treatment (SSTR):
This the variation between data sets that we
would expect if the treatment (brand) did have an
impact on the dependent variable (mileage).
- Sum of squares due to extraneous factors
(SSEF): This is the variation within each
data set attributable to all other possible factors
(driver, time-of-day, etc.) that may have had an
impact on the dependent variable.
In short:
We will first compute the SSEF for each brand by
calculating the sum of squares for that brand with
that brands' average (not the average for all 15
observations):
SSEF(Brand 1) = (20 - 21)2 + (22 - 21)2
+ ... + (20 - 21)2 = 4
SSEF(Brand 2) = (25 - 26)2 + (27 - 26)2
+ ... + (26 - 26)2 = 2
SSEF(Brand 3) = (28 - 28)2 + (28 - 28)2
+ ... + (28 - 28)2 = 2
So SSEF (Total) = 4 + 2 + 2 = 8, which means SSTR =
130 (138 - 8).
The degrees of freedom is always the relevant sample
size minus one. For the SST, this is 14 (15 total
observations - 1). For the SSTR, the relevant sample
size is 3 (3 different brands of gasoline are the
treatment), so degrees of freedom is 2. For SSEF, this
leaves 12 degrees of freedom (also calculated as 5
observations for each brand - 1 = 4 degrees of freedom
for each brand, times 3 brands = 12). We will put this all together in an
ANOVA Table (see pg. 185):
ANOVA Table
Gasoline Mileage Experiment (Data Set 1) |
Sources of
Variation |
Sum of
Squares |
Degrees of
Freedom |
Variance |
Variance
Ratio |
Treatment
(SSTR) |
130 |
2 |
65.0 |
97.59 |
Extraneous
(SSEF) |
8 |
12 |
0.666 |
|
| Total (SST) |
138 |
14 |
|
|
This table makes it easy to see that the
variation due to the treatment (brand) is very much greater
than the variation due to extraneous factors. In fact, it is
97.59 times greater (65.0 / 0.666). This suggests that the
brand of gasoline probably does make a difference in the
long-run mpg, and we reject Ho.
The Fisher Distribution. Just as
sample means have a sampling distribution, so do
sample variances. But, unlike the sample means, the
sampling distribution for variances is not the normal
distribution. We are interested in the distribution of the
variance ratio:
The larger this ratio, the more likely we
are to reject Ho (and conclude the treatment did make a
difference. The Fisher distribution quantifies what this
ratio must exceed (the critical value) for us to
reject Ho, at different confidence levels and degrees of
freedom. Like the Student t
distribution, the Fisher Distribution is really a family of
distributions, each with it's own curve. A Fisher Table
provides the critical values of the ratio of the variances
at different confidence levels. Here is a small one (pg.
188); for a much more complete table, see
here.
| Table of Critical Fisher Values |
d.f. for
Denom (EF) |
Degree
Conf (%) |
d.f. for Numerator (TR) |
| 1 |
2 |
3 |
| 2 |
90
95
99 |
8.53
18.51
98.50 |
9.00
19.00
99.00 |
9.16
19.16
99.17 |
| 4 |
90
95
99 |
4.54
7.71
21.20 |
4.32
6.94
18.00 |
4.19
6.59
16.69 |
| 6 |
90
95
99 |
3.78
5.99
13.74 |
3.46
5.14
10.92 |
3.29
4.76
9.78 |
| 9 |
90
95
99 |
3.36
5.12
10.56 |
3.01
4.26
8.02 |
2.81
3.86
6.99 |
| 12 |
90
95
99 |
3.18
4.75
9.33 |
2.81
3.89
6.93 |
2.61
3.49
5.95 |
To use this table, find the Fisher value
for the appropriate degrees of freedom and desired
confidence level. In our gasoline mileage example, the
degrees of freedom are 2 in the numerator (TR) and 12
in the denominator (EF). At 99% confidence, the Critical Fisher
value is 6.93. This means that, if there were no
long-run difference in the mpg of the three brands,
we would expect to obtain a variance ratio higher than
this less than 1% of the time. Our variance ratio is
97.59, far in excess of this, so we reject Ho, and
conclude there is a difference.
Example: You are conducting an educational
research experiment. You have 14 students' scores on the QRA
(Quick Reading Assessment). Each of the students had been
taught that year by one of two teachers: Mr. Smith or Ms.
Tanzig. Here are the results:
| Reading Experiment Results |
| Mr. Smith |
Ms. Tanzig |
87
56
76
78
74
85
83 |
91
88
78
88
85
72
79 |
| avg = 77 |
avg = 83 |
At 95% confidence, do we expect Ms. Tanzig's students
will perform better in the long-run on the QRA than Mr. Smith's? Here is the setup
for analysis:
- Experimental unit: 14 students and one test (the QRA)
- Factor: The teacher
- Dependent variable: QRA score
- Null hypothesis: There is no long-run difference in the
performance of the Ms. Tanzig's students and Mr. Smith's
students
- Alternative hypothesis: There is a difference
First we compute SST. The overall mean from the 14
observations is 80, so:
SST = (87 - 80)2 + (56 - 80)2 + ...
+ (79 - 80)2 = 1,058
We could compute either SSTR or SSEF next, but
SSTR looks shorter. Here is the SSTR for the experiment:
SSTR = 7 x (77 - 80)2 + 7 x (83 -
80)2 = 126
And SSEF = SST - SSTR = 1,058 - 126 = 932; so here is the ANOVA table for the experiment:
| ANOVA Table -
Reading Experiment |
Sources of
Variation |
Sum of
Squares |
Degrees of
Freedom |
Variance |
Variance
Ratio |
Treatment
(SSTR) |
126 |
1 |
126 |
1.622 |
Extraneous
(SSEF) |
932 |
12 |
77.667 |
|
| Total (SST) |
1,058 |
13 |
|
|
The critical Fisher value from the table (95% confidence, 1
and 12 degrees of freedom) is 4.75. Since the variance
ration does not exceed the critical Fisher value, we do not
reject Ho, and conclude there is no difference in the
performance of the Ms. Tanzig's students and Mr. Smith's
students in the long-run. A visual check helps confirm this:

Running a Valid Experiment. A valid
experiment is one that isolates the factor being
tested. Other factors that could affect the dependent
variable must be ruled out. One way to do this is by
randomizing these extraneous factors across both groups by,
for example, assigning drivers randomly to each run for each
car in the gasoline study. Another way to do this is by
minimizing an extraneous factor by blocking, the
subject of the next section.
When to Consider a One Factor Randomized
Block Design
When designing an experiment, it may happen
that some extraneous factors cannot be ruled out by
randomization. Then we need to try to minimize them by
extracting the impact of an extraneous factor from the
sum of squares. We can do this with a technique called
two-way ANOVA, which is used with data partitioned into
categories according to two factors. One factor is the
treatment we want to test, the other factor is the one whose
effects we want to minimize, called the blocking factor.
Example: We are interested in the impact of
two different television advertisements we have shown on
television in three different markets. The ads were
shown for a one-week period one month apart, at the same
time on the same days. Their
effect was judged based on the number of sales calls
generated during the week they were shown. Is there a
long-run difference in the effectiveness of the two
advertisements? Here are the results:
| Television Advertisement
Experiment |
| Market |
Advertisement |
| A |
B |
Average |
| Orlando |
656 |
710 |
683 |
| Norfolk |
732 |
812 |
772 |
| San Diego |
328 |
386 |
357 |
| Average |
572 |
636 |
604 |
Here is the experimental setup for the
one-factor
-
Experimental unit: Three weekly airings of
each advertisement
-
Factor: Advertisement type (A or B)
-
Dependent variable: Sales calls received
-
Null hypothesis: There is no difference in
the long-run effectiveness of advertisement A or B on sales
-
Alternative hypothesis: There is a difference
Here is the single-factor ANOVA table
(check and see if you get the same results):
Single-factor ANOVA Table
Advertisement Experiment |
Sources of
Variation |
Sum of
Squares |
Degrees of
Freedom |
Variance |
Variance
Ratio |
Treatment
(SSTR) |
6,144 |
1 |
6,144 |
0.129 |
Extraneous
(SSEF) |
191,144 |
4 |
47,768 |
|
| Total (SST) |
197,288 |
5 |
|
|
The variance ration is 0.129;
from the Fisher table, the critical value (at 95%
confidence, 1 and 4 degrees of freedom) is 7.71, so we do
not reject Ho, and conclude there is no long-run difference
in the effectiveness of advertisement A or B on sales.
But looking at the data, it certainly seems
that advertisement B is more effective than A, and in all
three test markets. Why doesn't our ANOVA calculation bear
this out? Because our experiment is poorly designed. One of
the extraneous factors that could affect the results is the
test market. The big differences in sales calls for both
advertisements vary quite a bit between test markets. This
variation within groups (the SSEF) overwhelms the variation
between groups (the SSTR). This is easily seen with a
profile chart:

This chart clearly shows that much of the
variation is due one extraneous factor: the test market. The
lines are:
-
Widely spaced, indicating there is a
substantial effect of location on sales
-
Parallel, indicating the factor (advertisement A or B)
has a similar effect in all three test markets.
Given these profiles, we can correct for the variation
due to location in this case by blocking.
An Intuitive Sum of Squares Decomposition
for a Randomized Block Design. What we would like to do
is to decompose SSEF into two parts: the SSEF due to
blocking (the test market effect), and all other extraneous
factors. We have already computed SST and SSTR in the ANOVA
table above. We want to calculate SSBL (sum of squares due
to the blocking factor):
SSBL = 2 [ (683 - 604)2 +
(772 - 604)2 + (357 - 604)2 ] =
190,948
Here is the resulting ANOVA table that
summarizes this:
Two-way ANOVA Table
Advertisement Experiment |
Sources of
Variation |
Sum of
Squares |
Degrees of
Freedom |
Variance |
Variance
Ratio |
Block
(SSBL) |
190,948 |
2 |
-- |
|
Treatment
(SSTR) |
6,144 |
1 |
6,144 |
62.69 |
Extraneous
(SSEF) |
196 |
2 |
98 |
|
| Total (SST) |
197,288 |
5 |
|
|
Now the variance ratio is 62.69,
which greatly exceeds the critical Fisher value of 18.51
(95% confidence, 1 and 2 degrees of freedom), so we reject Ho and conclude there is a
long-run difference in the effectiveness of advertisement A
and B on sales. We have minimized the effect of an
extraneous factor by blocking it, which let us compare the
variation due to treatment with variation due to all other
(non-blocked) extraneous factors.
What can Go Wrong When You Choose the
Randomized Block Design. If you construct the profiles
for each significant factor, you may a find profiles where
the factors are not widely spaced and parallel. In this
case, blocking can may not reduce the sum of squares due to
extraneous factors. but it certainly will decrease the
degrees of freedom. In the advertisement experiment, for
example, the degrees of freedom for non-blocked extraneous
factors decreased from 4 to 2. Since this value is in the
denominator, reducing it increases the variance – which is
what we were trying to correct for in the first place!
When Is the Multifactor Design Appropriate?
A multifactor design has at least two
experimental factors and does not use blocking, relying
instead on randomization to minimize the impact of
extraneous factors. Formal definitions:
- Experimental factors
- The factors or treatments that we wish to
test so that we can make timely decisions
- Extraneous factors
- All other factors that could affect the the
dependent variable, but which we choose to
ignore.
|
Use a multifactor design when:
-
We are interested in obtaining
information (and making decisions) about two or more
experimental factors; or
-
We believe there will be an interaction
between all or some of the experimental factors (because
their profiles are non-parallel).
Here is the example from our text (pg. 210):
Determine whether the type teaching approach or demonstrated
math ability has an effect on final exam scores. The setup:
-
Experimental unit: 24 students, four
teaching approaches
-
Factors: Teaching approach and SAT math
score
-
Dependent variable: Final exam grade
-
Null hypothesis: There is no long-run
difference in the math final exam scores of students due
either to teaching approach or SAT math score
-
Alternative hypothesis: There is a difference
Here is the data (pg. 215):
| Math Study |
| Factor B |
Factor A |
| Lecture |
Discussion |
Discovery |
Classroom |
Average |
| 400-500 |
58
62 |
60
60 |
64
66 |
59
61 |
61.25 |
| 500-550 |
99
99 |
89
91 |
80
80 |
75
75 |
86 |
| 600-650 |
74
76 |
80
80 |
85
85 |
94
96 |
83.75 |
| Average |
78 |
76.67 |
76.67 |
76.67 |
77 |
Let us first take a look at the factor
profiles:

The profiles are widely spaced, indicating
there may be some effect of teaching approach on final exam
grades. But the profiles are not parallel, suggesting that
there may be an interaction between the two factors. (It
looks like classroom may be the best teaching approach for
the students with the highest math ability, but lecture
would be best for students of average ability. For students
of lowest ability, all teaching approaches seem to work
equally well.
Quantitatively, this is borne out by the
ANOVA table for the results shown above:
| ANOVA for the Math Study |
Sources of Variation |
Sum of Squares |
Degrees of Freedom |
Variance |
Variance Ratio |
Treatment A (approach) |
7.96 |
3 |
2.65 |
2.65 |
Treatment B (ability) |
2,997 |
2 |
1,498.5 |
1,498.5 |
| Interaction |
1,157.04 |
6 |
192.8 |
192.8 |
| Extraneous |
12 |
12 |
1 |
|
| Total |
4,174 |
23 |
|
|
What would we conclude based on these
results?
-
First, there is no best teaching
approach for all students. We do not reject Ho, because
the variance ration for this treatment is 2.65, which is
less than the Fisher critical value (5.95 at 3 and 12
degrees of freedom at 95% confidence).
-
Second, the study bears out that math
ability (treatment B) accounts for nearly 75% of the
variation in the final exam grade. But this is really
not something we can control!
-
Third, there does seem to be a
difference in result when the interaction between
teaching approach and math ability is taken into
account. The interaction between these two factors
yields a variance ratio of 192.8, much exceeding the
Fisher critical value of 4.82 (6 and 12 degrees of
freedom at 99% confidence): The highest ability students
will, in the long-run, learn the most with a classroom
teaching approach, while students of average math
ability will learn the most with a lecture teaching
approach.
|