Lecture Notes

Business Statistics (QMB 2100)

Introduction and Descriptive Statistics

A Note to the Reader (Student)

The author (Harvey Brightman) tries hard to avoid formulas, relying on pictures and "plain English" to describe the basic concepts of probability and statistics.

Most textbooks use a more traditional approach, with mathematics, Greek symbols and proofs, which provide the foundation necessary for further study. Brightman's approach is to provide you a good understanding of the most useful statistical methods, so you will be ready to use them in your everyday thinking and work. I've added a few formulas and symbols because they are often the most succinct way of expressing a concept or procedure, but I I've kept these to a minimum. Just because the text is less mathematical than most, don't skim over the concepts – they are just as deep and useful as they would be if presented in more formal, mathematical detail.
 

Be an active learner

  • Actually write down the answers to the questions in the space provided!
  • Do all the exercises and check your work
  • Don't leave a section until you feel confident that you have met the objectives stated at the beginning of that section

The best advice (from Harvey Brightman):

If you don't let the math get in your way, you'll find that statistics is not sadistics. The key is to translate what you are learning into your own words. (p. 2)

Introduction

There are three areas of statistics:

  • Collecting data
  • Organizing and summarizing data
  • Analyzing data to make decisions

Collecting data means obtaining a representative sample of the population under study; much of statistics is about understanding how to do this in different situations:

  • To predict election results
    • Sampling procedures (Chapter 4)
  • To estimate a city's average income level
    • Inductive inference (Chapter 4)
  • To evaluate the effect of vitamin C on health
    • Designing formal experiments (Chapters 5 and 6)

Organizing and summarizing data means to reduce the raw data collected into descriptive statistics or develop charts or other pictures that help understand the data

  • A sample proportion is determined from a representative sample:
  • Average (mean) income and the sample standard deviation give us some insight to the typical income and the variability of income; we'll see how to calculate these shortly

  • A frequency table can help organize the data for further analysis; here is an example of the results of a study of the pulse rates (beats per minute) of 10 conditioned athletes:
Pulse Rates of Conditioned Athletes
Pulse Rate Classes Frequency (# of athletes)
30.0 to 34.9 2
35.0 to 39.9 5
40.0 to 44.9 2
45.0 to 49.9 1
Total observations 10

Note: (1) All classes have the same width; (2) All observations fall into one and only one class; (3) Classes to not overlap

  • A histogram can provide visual clues about the distribution of the data; here is a histogram of the above frequency table

Analyzing data means making educated guesses, based on the data, officially called inferential statistics. "Educated" means by applying what we will learn of probability and statistics theory; "guess" means that no matter how educated our estimates are, there is still a chance we will be wrong. Here are two techniques we will use to make educated guesses:

  • Confidence intervals. Despite our best efforts at obtaining a representative sample, it is not very likely that exactly 16.7% of all cereal eaters prefer Cheerios®; but we can use an inference procedure to construct a confidence interval – a range in which the true proportion probably lies. For example, applying these procedures (which we will cover in chapters 4 and 5), we might determine that we are 95% confident the true proportion of cereal-eaters who prefer Cheerios is between 15.0% to 18.4%.

  • Correlation and regression analysis. To determine the relationship between two variables requires requires inference procedures called correlation and regression analysis. For example, based on this data, would you conclude that taking Vitamin C reduced the number of colds?
Summary of Vitamin C Sample Data
Amt of Vitamin C per Day (mg) Avg Number of Colds
0
1,000
2,000
3,000
4,000
4.5
3.0
2.5
2.0
0.5

The procedures to determine correlation between two variables is covered in Chapter 6.


Central Tendency

  • Calculating the mean from raw data. One of the most common and useful statistics we compute is the average (mean) of a set of data; this gives us an idea around what value the data are centered.

For example, here are the raw data for the recorded pulse rates of a sample of conditioned athletes:

Pulse Rates of Conditioned Athletes
(beats per minute)
31
33
36
36
37
38
39
41
44
47

To compute the mean, add up each observation and divide by the number of observations; the mean is called x-bar (usually written x); in this example:

   x = (31 + 33 + 36 + 36 + 37 + 38 + 39 + 41 + 44 + 47) / 10
            = 382 / 10
            = 38.2 beats per minute

Round x to one decimal place more than the data used to calculate it.

The calculation of the mean is usually given as a formula: n is used in statistics to mean the number of observations, each individual observation is called x, and the Greek symbol Σ means summation (to add up):

  • Computing the mean from tabled data. If you don't have the raw data available, you can still come up with a good approximation of the mean by using a histogram or frequency table.
  • If you have a histogram, try to visualize the point where the histogram would "balance" (see the histogram above; high 30's?)
  • If you have a frequency table, you can approximate the mean by multiplying the midpoint of each class by the number of values in that class, adding them up, and dividing by the total number of observations:
Pulse Rates of Conditioned Athletes
(estimating the mean)
Classes Frequency Class
Midpoint
Frequency
x Midpoint

Miguel Indurain (1964- )
30.0 to 34.9 2 32.5 65.0
35.0 to 39.9 5 37.5 187.5
40.0 to 44.9 2 42.5 85.0
45.0 to 49.9 1 47.5 47.5
Totals 10   385.0

Using this technique, we estimate the mean to be 385.0 / 10 = 38.5 bpm, slightly higher than the actual mean (38.2 bpm), and is probably pretty close to what you estimated from visually balancing the histogram.

  • Calculating the median from raw data. Another measure of central tendency is the median. The median is the "middle" value of the data set, the value that divides the data into two equal parts if the observations were listed in order.

There is no formula for computing the median, but there is a procedure:

  1. Sort the data into ascending order; for the 10 pulse rate observations:

31, 33, 36, 36, 37, 38, 39, 41, 44, 47

  1. If there are an odd number of observations (not the case here), the median is the middle value, the one that divides the data set into two equal parts.
  2. If there are an even number of observations (the case above), the median is the average of the two middle values, the pair that divides the data set into two equal parts. Here that pair is 37, 38, so the median is the average of these two values: (37 + 38) / 2 or 37.5
  3. As with the mean, round the median to one decimal place more than the data from which it was calculated

The median is often better indicator of the "middle" than the mean when the data set is skewed, having a few values that are very much higher than or lower than the mean. These values are outliers. For example, if the highest observation in the pulse rate study was 72 (rather than 47), the mean would increase to 40.7 (from 38.2), but the median would be unchanged at 37.5.

The median is often used for demographic data (household income, for example), because a few outliers (like Bill Gates) can greatly affect the mean.

If there are no outliers, the mean and median will be very close; but be wary of presentations that use one or the other of these measures of central tendency to deliberately distort your perception of "middle".

If the data is skewed, it's histogram will not be very symmetric; here are two possibilities:

 

Data sets that are skewed left are said to have a long right-hand tail; those skewed right have a long left-hand tail.

  • Calculating the median from tabled data. If you only have tabled (not raw) data, you can still approximate the median; here is the procedure:
  1. Determine the total number of observations (10 for the pulse rate study)
  2. Determine which observation number(s) would represent the median (number 5 and 6 in our case)
  3. Determine in which class these observations would fall (for us, there are 2 observations in the first class – 1 and 2 – and 5 in the second – 3, 4, 5, 6 and 7 – so observations 5 and 6 fall into the class 35-39.9
  4. Divide the class into intervals of equal length, so that there are as many intervals as observations in that class (5 for our class)
  5. Assume each observation is exactly at the midpoint of its interval (see Figure 1-5, p.13)
  6. Calculate the median using the appropriate observation number(s) and its assumed value (for us, the average of observation 5 and 6 is 38.0)

Measures of Spread, Dispersion, or Variability

Just as important as measures of central tendency (mean and median) are to describing a set of data, measures of dispersion discussed in this chapter illustrate another dimension: how varied is the data within the data set. The measures of dispersion described below are range, variance, and, the most useful, standard deviation.

  • The range. The range of a data set is the difference between its minimum and maximum observation.

Assume you have been asked to review different department's behavior with regard to hours of sick leave taken. Consider this data:

Annual Hours of Sick Leave Taken
(by employee)
Department A Department B
25
0
57
102
89
72
145
74
68
86
52
78
72
60
x = 70.0
median = 72.0
x = 70.0
median = 72.0

Although the measures of central tendency are the same for both departments, it appears that the employees in Department A are much less consistent in the hours of sick leave they take annual than those in Department B

Calculating the range for each department bears this out:

Department A range = 145 - 0 = 145
Department B range = 86 - 52 = 34

Range is useful because it can be quickly calculated, but ignores most of the data – it only takes into account two observations in a data set

  • The variance and the standard deviation for raw data. The variance is the mean of the squared deviations between each observation and the mean. For the above data, here are the calculations for variance:

Calculating Variance in Annual Hours of Sick Leave
Taken by Department
Department A ( x = 70.0 ) Department B ( x = 70.0 )
x x - x (x - x)2 x x - x (x - x)2
25
0
57
102
89
72
145
-45
-70
-13
32
19
2
75
2,025
4,900
169
1,024
361
4
5,625
74
68
86
52
78
72
60
4
-2
16
-18
8
2
-10
16
4
256
324
64
4
100

Σ(x - x)2 =

14,108

Σ(x - x)2 =

768

Σ(x - x)2 / n =

2,015.4

Σ(x - x)2 / n =

109.7

Round the variance to one more decimal place than the data used to calculate it.

The calculations of variance bear out our observations with the range of each data set: that the observations in Department A are much more disperse than those in Department B (Why might this be?).

Variance is a better measure of dispersion than range, because:

  • all of the data are used in its calculation
  • it measures dispersion of the data about the mean, not just the spread of the data

At this point we need to distinguish between a population and a sample:

Population
The complete collection of all objects in some set under study (all U.S. citizens, all FSCJ students, all GE light bulbs). If the objects are people, they are usually called subjects.
Sample
A selected subset of a population (1,000 U.S. citizens selected at random, FSCJ students whose SSNs end in 2; every 100th light bulb off the GE production line)

The calculations above are for populations: We had the data from every object (subject) in the collection (every employee in Department A and every employee in Department B). When that is the case, the procedure shown is appropriate. But if we are calculating the variance of a sample, called s2 (a representative selection of employees from each department), we use:

We will use this formula for the calculation of the variance, since we will almost always be working with a sample rather than an entire population.

The standard deviation is the positive square root of the variance. So, the standard deviation for Departments A and B would be 44.9 and 10.5, respectively. Round standard deviation to one more decimal place than the data used to calculate it.

The sample standard deviation is called s; its formula is:

The standard deviation plays a critical role in inferential statistics: generally, observations that are more than two standard deviations from the mean are considered unusual.

Here is a table describing a "rule of thumb" (also called a heuristic) we can use for data sets that bell-shaped and symmetric (those that approximate a normal distribution, which we will discuss in much greater detail later):

Percent of Data within n Standard Deviations of the Mean
(for distributions approximately normal)

n Percent
1.0 about 68 %
1.5 about 87 %
2.0 about 95 %
3.0 about 99.7 %

From this table, you can see that an "unusual" value – one that is more than two standard deviations from the mean – occurs less than 5% of the time. This is the standard statistical sense of the word "unusual."

Here is how to set up the calculation for the standard deviation of the pulse rate data (the mean was calculated above as 38.2):

Pulse Rates of Conditioned Athletes
(calculating the standard deviation)

x x - x (x - x)2
31
33
36
36
37
38
39
41
44
47
31 - 38.2 = -7.2
-5.2
-2.2
-2.2
-1.2
-0.2
0.8
2.8
5.8
8.8
51.84
27.04
4.84
4.84
1.44
0.04
0.64
7.84
33.64
77.44

Σ(x - x)2 =

209.6

s2 = Σ(x - x)2 / (n -1) =

23.3

s =

4.8

Are any of the observations unusual? They would be if they fell outside the range [x - 2s, x + 2s] = [38.2 - 2(4.8), 38.2 + 2(4.8)] = [28.6, 47.8].

Another way to obtain a quick estimate of the standard deviation is:

For the above data, this yields an estimate of (47 - 31) / 4 = 4.0; quick, but significantly underestimating the actual value of 4.8

  • Variance and standard deviation for tabled data. If you have only tabled, rather than raw, data you can still calculate the standard deviation by following these steps:

  1. Determine the mean for tabled data as given above

  2. Use the class midpoint to represent all the observations in that class; calculate the deviations by subtracting the mean from it, squaring it, and multiply by the number of observations in that class

  3. Add up all the deviations calculated in step 2

  4. Divide by n - 1 to get an estimate of the variance

  5. Take the square root of the variance to estimate the standard deviation

Here is how to calculate standard deviation for the pulse rate study, using the frequency table (x has already been estimated as 38.5):

Pulse Rates of Conditioned Athletes
(estimating the standard deviation)
Classes Freq Class
Midpoint
Dev =
Midpoint - x
Dev2 Dev2
x Freq
30.0 to 34.9 2 32.5 32.5 - 38.5 = -6.0 36.0 72.0
35.0 to 39.9 5 37.5 -1.0 1.0 5.0
40.0 to 44.9 2 42.5 4.0 16.0 32.0
45.0 to 49.9 1 47.5 9.0 81.0 81.0
Totals 10       190.0
Estimated variance = 136.0 / (10 - 1) = 21.1

Estimated standard deviation =

4.6

The standard deviation calculated by this method is close to the actual calculation of the standard deviation using the raw data (4.6 v. 4.8).


Tchebycheff's Theorem: A Few Numbers in Place of Many

The above discussion of dispersion makes the assumption that we are looking at data from a population whose histogram would be approximately bell-shaped and symmetrical. While this is true of many populations, the Russian statistician Tchebycheff (1821-1894) proved a theorem that is true for any population, no matter how the data is actually distributed. His theorem is:

At least 100 - 100 / h2 percent of the observations must lie within h standard deviations of the mean.

This yields these results for for some common standard deviations:

Percent of Data within h Standard Deviations of the Mean
(Tchebycheff's Theorem)

h at least
1.0 0.0 %
1.5 55.5 %
2.0 75.0 %
3.0 88.9 %

Here are the ages of ten Oscar winners for best actresses when they won their award  (1998-2007):

34, 26, 25, 33, 35, 35, 28, 30, 29, 61

Using the methods above, you can compute x = 33.6 and s = 10.3. Here are the Tchebycheff estimates of the dispersion of the data set. compared to the "rule of thumb" estimates and the actual percentages:

Percent of Actress' Ages within h Standard Deviations
of the Mean

h x ± h · s Tchebycheff
Estimate
"Rule-of-Thumb"
Estimate
Actual
Percentage
1.0 [23.3, 43.9] > 0.0 % ~ 68 % 90 %
1.5 [18.2, 49.1] > 55.5 % ~ 87 % 90 %
2.0 [13.0, 54.2] > 75.0 % ~ 95 % 90 %
3.0 [2.7, 64.5] > 88.9 % ~ 99.7 % 100 %

Are any of the ages in the data set unusual? In what way?


Summary: Terminology of some descriptive statistics

Here are the common symbols for population parameters and sample statistics that we will be working with in this course.

  Population
Parameter
Sample
Statistic
Size N n
Mean
(mu)

(x-bar)
Variance
(sigma-squared)

(s-squared)
Standard
Deviation

(sigma)

(s)
 Updated 10.28.2009