Lecture Notes

Business Statistics (QMB 2100)

Experimentation and Analysis of Variance

Introduction

This chapter introduces experimentation to make inductive inferences, as opposed to sampling. Basically:

  • Sampling
    • Take a simple random sample
    • Count or measure the characteristic under study
    • Compute sample statistics (like the sample mean)
    • Make inductive inferences (using confidence intervals or a hypothesis test)
  • Experimentation
    • Design an experiment to vary the treatment (factor) under study
    • Conduct the experiment and observe the result
    • Compute sample statistics (sum of squares)
    • Make inductive inferences (using ANOVA and the Fisher Table)

Important terminology introduced in this chapter:

Experimental units
The units on which the experiment is done. When the units are people, we call them subjects.
Experimental treatment (or factor)
What is varied in the experiment.
Dependent variable
What is measured as we vary the factor.

Three types of experimental design are discussed in this chapter:

  1. One factor, completely random design
  2. One factor, randomized block design
  3. Multifactor design

The One Factor Completely Random Design

Three brands of gasoline are tested to determine which, if any, delivers the highest miles per gallon. In this experiment we have:

  • Experimental unit: One car and one driver
  • Factor: The brand of gasoline
  • Dependent variable: Mileage (miles per gallon)
  • Null hypothesis: There is no difference in the long-run mpg of the three brands
  • Alternative hypothesis: There is a difference

Here are two possible results (Data Set 1 and Data Set 2) of the experiment (see pg. 177):

Gasoline Mileage Experiment Results
Data Set 1 Data Set 2
Brand 1 Brand 2 Brand 3 Brand 1 Brand 2 Brand 3
20
22
21
22
20
25
27
26
26
26
28
28
27
29
28
18
24
17
22
24
27
20
29
31
23
17
37
29
21
36
avg = 21 26 28 21 26 28

Notice the average mileage for each brand is the same in both data sets (e.g., Brand 1 = 21 mpg); which Data Set would lead you to:

  • Fail to reject the null hypothesis (conclude there is no difference in the long-run mpg of the three brands)
  • Reject the null hypothesis (conclude there is a difference)

The difference between the data sets is not their means, but the variation within them.

A Quick Visual Analysis for a One Factor Completely Random Design. The author uses spread charts to help you visualize and compare the variation between data sets. Here is another visualization that depicts the same thing:

The data from Data Set 1 look stable; we have no reason to expect future results to differ from the mean any more than these five results did. There is an obvious variation between the brand averages, but not much difference within each brand .We are not uncomfortable it rejecting the null hypothesis and conclude there is a difference in the long-run mpg of the three brands.

The data from Data Set 2 look unstable – we are not sure what to expect in the future except that the variation in results will be high. A few more high or low results for any brand would significantly change the mean. Any variation between brands' averages is almost lost in the variation within each brand. We would not reject the null hypothesis, and conclude there is no difference in the long-run mpg of the three brands.

Here is the "visual rule of thumb" for determining if an experimental factor does or does not affect the dependent variable (see pg. 181):

ANOVA "Rule of Thumb"
Spread between
Observation Averages
Spread within a set
of Observations
Little Much
Little Hard to tell Factor has no impact:
don't reject Ho
Much Factor has an impact;
reject Ho
Hard to tell

An Intuitive Sum of Squares Decomposition for the One Factor Completely Random Design. We need a more formal way test hypotheses using experiments. The way we will do it for the one factor, completely random design is to focus on the variation. We can quantify the variation (s2) using this formula:

Note the numerator of the fraction above is called the sum of squares, and the denominator the degrees of freedom. For Data Set 1 above, the overall mean of the 15 observations is 25 mpg. We could calculate the sum of squares total (SST) as:

SST = (20 - 25)2 + (22 - 25)2 + ... + (28 - 25)2 = 138 units of variation

These 138 units of variation can be attributed to only one of two causes:

  • Sum of squares due to the treatment (SSTR): This the variation between data sets that we would expect if the treatment (brand) did have an impact on the dependent variable (mileage).
  • Sum of squares due to extraneous factors (SSEF): This is the variation within each data set attributable to all other possible factors (driver, time-of-day, etc.) that may have had an impact on the dependent variable.

In short:

SST = SSTR + SSEF

We will first compute the SSEF for each brand by calculating the sum of squares for that brand with that brands' average (not the average for all 15 observations):

SSEF(Brand 1) = (20 - 21)2 + (22 - 21)2 + ... + (20 - 21)2 = 4
SSEF(Brand 2) = (25 - 26)2 + (27 - 26)2 + ... + (26 - 26)2 = 2
SSEF(Brand 3) = (28 - 28)2 + (28 - 28)2 + ... + (28 - 28)2 = 2

So SSEF (Total) = 4 + 2 + 2 = 8, which means SSTR = 130 (138 - 8).

The degrees of freedom is always the relevant sample size minus one. For the SST, this is 14 (15 total observations - 1). For the SSTR, the relevant sample size is 3 (3 different brands of gasoline are the treatment), so degrees of freedom is 2. For SSEF, this leaves 12 degrees of freedom (also calculated as 5 observations for each brand - 1 = 4 degrees of freedom for each brand, times 3 brands = 12).

We will put this all together in an ANOVA Table (see pg. 185):

ANOVA Table
Gasoline Mileage Experiment (Data Set 1)
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Variance Variance
Ratio
Treatment
(SSTR)
130 2 65.0 97.59
Extraneous
(SSEF)
8 12 0.666  
Total (SST) 138 14    

This table makes it easy to see that the variation due to the treatment (brand) is very much greater than the variation due to extraneous factors. In fact, it is 97.59 times greater (65.0 / 0.666). This suggests that the brand of gasoline probably does make a difference in the long-run mpg, and we reject Ho.

The Fisher Distribution. Just as sample means have a sampling distribution, so do sample variances. But, unlike the sample means, the sampling distribution for variances is not the normal distribution. We are interested in the distribution of the variance ratio:

The larger this ratio, the more likely we are to reject Ho (and conclude the treatment did make a difference. The Fisher distribution quantifies what this ratio must exceed (the critical value) for us to reject Ho, at different confidence levels and degrees of freedom. Like the Student t distribution, the Fisher Distribution is really a family of distributions, each with it's own curve. A Fisher Table provides the critical values of the ratio of the variances at different confidence levels. Here is a small one (pg. 188); for a much more complete table, see here.

Table of Critical Fisher Values
d.f. for
Denom (EF)
Degree
Conf (%)
d.f. for Numerator (TR)
1 2 3
2 90
95
99
8.53
18.51
98.50
9.00
19.00
99.00
9.16
19.16
99.17
4 90
95
99
4.54
7.71
21.20
4.32
6.94
18.00
4.19
6.59
16.69
6 90
95
99
3.78
5.99
13.74
3.46
5.14
10.92
3.29
4.76
9.78
9 90
95
99
3.36
5.12
10.56
3.01
4.26
8.02
2.81
3.86
6.99
12 90
95
99
3.18
4.75
9.33
2.81
3.89
6.93
2.61
3.49
5.95

To use this table, find the Fisher value for the appropriate degrees of freedom and desired confidence level. In our gasoline mileage example, the degrees of freedom are 2 in the numerator (TR) and 12 in the denominator (EF). At 99% confidence, the Critical Fisher value is 6.93. This means that, if there were no long-run difference in the mpg of the three brands, we would expect to obtain a variance ratio higher than this less than 1% of the time. Our variance ratio is 97.59, far in excess of this, so we reject Ho, and conclude there is a difference.

Example: You are conducting an educational research experiment. You have 14 students' scores on the QRA (Quick Reading Assessment). Each of the students had been taught that year by one of two teachers: Mr. Smith or Ms. Tanzig. Here are the results:

Reading Experiment Results
Mr. Smith Ms. Tanzig
87
56
76
78
74
85
83
91
88
78
88
85
72
79
avg = 77 avg  = 83

At 95% confidence, do we expect Ms. Tanzig's students will perform better in the long-run on the QRA than Mr. Smith's? Here is the setup for analysis:

  • Experimental unit: 14 students and one test (the QRA)
  • Factor: The teacher
  • Dependent variable: QRA score
  • Null hypothesis: There is no long-run difference in the performance of the Ms. Tanzig's students and Mr. Smith's students
  • Alternative hypothesis: There is a difference

First we compute SST. The overall mean from the 14 observations is 80, so:

SST = (87 - 80)2 + (56 - 80)2 + ... + (79 - 80)2 = 1,058

We could compute either SSTR or SSEF next, but SSTR looks shorter. Here is the SSTR for the experiment:

SSTR = 7 x (77 - 80)2 + 7 x (83 - 80)2 = 126

And SSEF = SST - SSTR = 1,058 - 126 = 932; so here is the ANOVA table for the experiment:

ANOVA Table - Reading Experiment
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Variance Variance
Ratio
Treatment
(SSTR)
126 1 126 1.622
Extraneous
(SSEF)
932 12 77.667  
Total (SST) 1,058 13    

The critical Fisher value from the table (95% confidence, 1 and 12 degrees of freedom) is 4.75. Since the variance ration does not exceed the critical Fisher value, we do not reject Ho, and conclude there is no difference in the performance of the Ms. Tanzig's students and Mr. Smith's students in the long-run. A visual check helps confirm this:

Running a Valid Experiment. A valid experiment is one that isolates the factor being tested. Other factors that could affect the dependent variable must be ruled out. One way to do this is by randomizing these extraneous factors across both groups by, for example, assigning drivers randomly to each run for each car in the gasoline study. Another way to do this is by minimizing an extraneous factor by blocking, the subject of the next section.


When to Consider a One Factor Randomized Block Design

When designing an experiment, it may happen that some extraneous factors cannot be ruled out by randomization. Then we need to try to minimize them by extracting the impact of an extraneous factor from the sum of squares. We can do this with a technique called two-way ANOVA, which is used with data partitioned into categories according to two factors. One factor is the treatment we want to test, the other factor is the one whose effects we want to minimize, called the blocking factor.

Example: We are interested in the impact of two different television advertisements we have shown on television in three different markets. The ads were shown for a one-week period one month apart, at the same time on the same days. Their effect was judged based on the number of sales calls generated during the week they were shown. Is there a long-run difference in the effectiveness of the two advertisements? Here are the results:

Television Advertisement Experiment
Market Advertisement
A B Average
Orlando 656 710 683
Norfolk 732 812 772
San Diego 328 386 357
Average 572 636 604

Here is the experimental setup for the one-factor

  • Experimental unit: Three weekly airings of each advertisement
  • Factor: Advertisement type (A or B)
  • Dependent variable: Sales calls received
  • Null hypothesis: There is no difference in the long-run effectiveness of advertisement A or B on sales
  • Alternative hypothesis: There is a difference

Here is the single-factor ANOVA table (check and see if you get the same results):

Single-factor ANOVA Table
Advertisement Experiment
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Variance Variance
Ratio
Treatment
(SSTR)
6,144 1 6,144 0.129
Extraneous
(SSEF)
191,144 4 47,768  
Total (SST) 197,288 5    

The variance ration is 0.129; from the Fisher table, the critical value (at 95% confidence, 1 and 4 degrees of freedom) is 7.71, so we do not reject Ho, and conclude there is no long-run difference in the effectiveness of advertisement A or B on sales.

But looking at the data, it certainly seems that advertisement B is more effective than A, and in all three test markets. Why doesn't our ANOVA calculation bear this out? Because our experiment is poorly designed. One of the extraneous factors that could affect the results is the test market. The big differences in sales calls for both advertisements vary quite a bit between test markets. This variation within groups (the SSEF) overwhelms the variation between groups (the SSTR). This is easily seen with a profile chart:

This chart clearly shows that much of the variation is due one extraneous factor: the test market. The lines are:

  • Widely spaced, indicating there is a substantial effect of location on sales
  • Parallel, indicating the factor (advertisement A or B) has a similar effect in all three test markets.

Given these profiles, we can correct for the variation due to location in this case by blocking.

An Intuitive Sum of Squares Decomposition for a Randomized Block Design. What we would like to do is to decompose SSEF into two parts: the SSEF due to blocking (the test market effect), and all other extraneous factors. We have already computed SST and SSTR in the ANOVA table above. We want to calculate SSBL (sum of squares due to the blocking factor):

SSBL = 2 [ (683 - 604)2 + (772 - 604)2 + (357 - 604)2 ] = 190,948

Here is the resulting ANOVA table that summarizes this:

Two-way ANOVA Table
Advertisement Experiment
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Variance Variance
Ratio
Block
(SSBL)
190,948 2 --  
Treatment
(SSTR)
6,144 1 6,144 62.69
Extraneous
(SSEF)
196 2 98  
Total (SST) 197,288 5    

Now the variance ratio is 62.69, which greatly exceeds the critical Fisher value of 18.51 (95% confidence, 1 and 2 degrees of freedom), so we reject Ho and conclude there is a long-run difference in the effectiveness of advertisement A and B on sales. We have minimized the effect of an extraneous factor by blocking it, which let us compare the variation due to treatment with variation due to all other (non-blocked) extraneous factors.

What can Go Wrong When You Choose the Randomized Block Design. If you construct the profiles for each significant factor, you may a find profiles where the factors are not widely spaced and parallel. In this case, blocking can may not reduce the sum of squares due to extraneous factors. but it certainly will decrease the degrees of freedom. In the advertisement experiment, for example, the degrees of freedom for non-blocked extraneous factors decreased from 4 to 2. Since this value is in the denominator, reducing it increases the variance – which is what we were trying to correct for in the first place!


When Is the Multifactor Design Appropriate?

A multifactor design has at least two experimental factors and does not use blocking, relying instead on randomization to minimize the impact of extraneous factors. Formal definitions:

Experimental factors
The factors or treatments that we wish to test so that we can make timely decisions
Extraneous factors
All other factors that could affect the the dependent variable, but which we choose to ignore.

 Use a multifactor design when:

  1. We are interested in obtaining information (and making decisions) about two or more experimental factors; or
  2. We believe there will be an interaction between all or some of the experimental factors (because their profiles are non-parallel).

Here is the example from our text (pg. 210): Determine whether the type teaching approach or demonstrated math ability has an effect on final exam scores. The setup:

  • Experimental unit: 24 students, four teaching approaches
  • Factors: Teaching approach and SAT math score
  • Dependent variable: Final exam grade
  • Null hypothesis: There is no long-run difference in the math final exam scores of students due either to teaching approach or SAT math score
  • Alternative hypothesis: There is a difference

Here is the data (pg. 215):

Math Study
Factor B Factor A
  Lecture  Discussion Discovery Classroom Average
400-500 58
62
60
60
64
66
59
61
61.25
500-550 99
99
89
91
80
80
75
75
86
600-650 74
76
80
80
85
85
94
96
83.75
Average 78 76.67 76.67 76.67 77

Let us first take a look at the factor profiles:

The profiles are widely spaced, indicating there may be some effect of teaching approach on final exam grades. But the profiles are not parallel, suggesting that there may be an interaction between the two factors. (It looks like classroom may be the best teaching approach for the students with the highest math ability, but lecture would be best for students of average ability. For students of lowest ability, all teaching approaches seem to work equally well.

Quantitatively, this is borne out by the ANOVA table for the results shown above:

ANOVA for the Math Study
Sources of
Variation
Sum of
Squares
Degrees of
Freedom
Variance Variance
Ratio
Treatment A
(approach)
7.96 3 2.65 2.65
Treatment B
(ability)
2,997 2 1,498.5 1,498.5
Interaction 1,157.04 6 192.8 192.8
Extraneous 12 12 1  
Total 4,174 23    

What would we conclude based on these results?

  • First, there is no best teaching approach for all students. We do not reject Ho, because the variance ration for this treatment is 2.65, which is less than the Fisher critical value (5.95 at 3 and 12 degrees of freedom at 95% confidence).
  • Second, the study bears out that math ability (treatment B) accounts for nearly 75% of the variation in the final exam grade. But this is really not something we can control!
  • Third, there does seem to be a difference in result when the interaction between teaching approach and math ability is taken into account. The interaction between these two factors yields a variance ratio of 192.8, much exceeding the Fisher critical value of 4.82 (6 and 12 degrees of freedom at 99% confidence): The highest ability students will, in the long-run, learn the most with a classroom teaching approach, while students of average math ability will learn the most with a lecture teaching approach.

 

 Updated 04.22.2010