Lecture Notes
Business Statistics (QMB 2100)
Introduction and Descriptive Statistics
A Note to the Reader (Student)
The author (Harvey Brightman)
tries hard to avoid formulas, relying on pictures and
"plain English" to describe the basic concepts of
probability and statistics.
Most textbooks use a more traditional approach, with
mathematics, Greek symbols and proofs, which provide the foundation
necessary for further study. Brightman's approach is to provide you a good understanding of
the most useful statistical methods, so you will be ready to use
them in your everyday thinking and work. I've added a few formulas
and symbols because they are often the most succinct
way of expressing a concept or procedure, but I I've
kept these to a minimum. Just because the text is less
mathematical than most, don't
skim over the concepts – they are just as deep and useful as they
would be if presented in more formal, mathematical detail.
Be an active learner –
- Actually write down the answers to the questions in the
space provided!
- Do all the exercises and
check your work
- Don't leave a section until you feel confident that you have
met the objectives stated at the beginning of that section
The best advice (from Harvey Brightman):
| If you don't let the math get in your way, you'll find
that statistics is not sadistics. The key is to translate what
you are learning into your own words. (p. 2) |
Introduction
There are three areas of statistics:
- Collecting data
- Organizing and summarizing data
- Analyzing data to make decisions
Collecting data means obtaining a representative
sample of the population under study; much of statistics is about
understanding how to do this in different situations:
- To predict election results
- Sampling procedures (Chapter 4)
- To estimate a city's average income level
- Inductive inference (Chapter 4)
- To evaluate the effect of vitamin C on health
- Designing formal experiments (Chapters 5 and 6)
Organizing and summarizing data means to reduce the raw
data collected into descriptive statistics or develop charts
or other pictures that help understand the data
- A sample proportion is determined from a
representative sample:

| Pulse Rates of
Conditioned Athletes |
| Pulse Rate Classes |
Frequency (# of athletes) |
| 30.0 to 34.9 |
2 |
| 35.0 to 39.9 |
5 |
| 40.0 to 44.9 |
2 |
| 45.0 to 49.9 |
1 |
| Total observations |
10 |
Note: (1) All classes have the same width; (2) All
observations fall into one and only one class; (3) Classes to not overlap
- A histogram can provide visual clues about the
distribution of the data; here is a histogram of the above
frequency table

Analyzing data means making educated guesses,
based on the data, officially called inferential statistics. "Educated" means by applying what we will learn
of probability and statistics theory; "guess" means that no matter how
educated our estimates are, there is still a chance we will be
wrong. Here are two techniques we will use to make educated guesses:
-
Confidence
intervals. Despite our best efforts at obtaining a representative
sample, it is not very likely that exactly 16.7% of all
cereal eaters prefer Cheerios®; but
we can use an inference procedure to construct a confidence interval
– a range in which the true proportion probably
lies. For example, applying these procedures (which
we will cover in chapters 4 and 5), we might
determine that we are 95% confident the true
proportion of cereal-eaters who prefer Cheerios is
between 15.0% to 18.4%.
- Correlation and regression analysis. To determine the relationship between two variables requires
requires inference procedures called correlation and
regression analysis. For example, based on this data, would
you conclude that taking Vitamin C reduced the number of colds?
| Summary of Vitamin
C Sample Data |
| Amt of Vitamin C per Day (mg) |
Avg Number of Colds |
0 1,000 2,000 3,000 4,000 |
4.5 3.0 2.5 2.0 0.5 |
The procedures to
determine correlation between two variables is
covered in
Chapter 6.
Central Tendency
- Calculating the mean from raw data. One of the most
common and useful statistics we compute is the average (mean)
of a set of data; this gives us an idea around what value the data are centered.
For example, here are the raw data for the
recorded pulse rates
of a sample of conditioned athletes:
Pulse Rates of
Conditioned Athletes (beats per minute) |
31 33 36 36 37 |
38 39 41 44 47 |
To compute the mean, add up each observation and divide
by the number of observations; the mean is called x-bar
(usually written
x); in
this example:
x = (31 + 33 + 36 + 36 + 37 + 38 + 39 + 41 + 44 + 47) /
10 = 382 / 10 = 38.2 beats
per minute
Round
x
to one decimal place more than the data used to
calculate it.
The calculation of the mean is usually given as a
formula: n is used in
statistics to mean the number of observations, each
individual observation is called x, and
the Greek symbol Σ means
summation (to add up):
- Computing the
mean from tabled data. If you don't have the raw data
available, you can still come up with a good approximation of
the mean by using a histogram or frequency table.
- If you
have a histogram, try to visualize the point where the histogram would
"balance" (see the histogram above; high 30's?)
- If you have a
frequency table, you can approximate the mean by multiplying
the midpoint of each class by the number of values in that
class, adding them up, and dividing by the total number of observations:
Pulse Rates of
Conditioned Athletes (estimating the mean) |
| Classes |
Frequency |
Class Midpoint |
Frequency x Midpoint |

Miguel Indurain
(1964- ) |
| 30.0 to 34.9 |
2 |
32.5 |
65.0 |
| 35.0 to 39.9 |
5 |
37.5 |
187.5 |
| 40.0 to 44.9 |
2 |
42.5 |
85.0 |
| 45.0 to 49.9 |
1 |
47.5 |
47.5 |
| Totals |
10 |
|
385.0 |
Using this technique, we estimate the mean to be 385.0 /
10 = 38.5 bpm, slightly higher than the actual mean (38.2
bpm), and is probably pretty close to what you estimated from
visually balancing the histogram.
- Calculating the median from raw data. Another measure of central
tendency is the median. The median is the "middle" value
of the data set, the value that divides the data into two equal
parts if the observations were listed in order.
There is no formula for computing the median, but there
is a procedure:
- Sort the data into ascending order; for the 10 pulse
rate observations:
31, 33, 36, 36, 37, 38, 39, 41, 44, 47
- If there are an odd number of observations (not the
case here), the median is the middle value, the one that
divides the data set into two equal parts.
- If there are an even number of observations (the
case above), the median is the average of the two middle
values, the pair that divides the data set into two
equal parts. Here that pair is 37, 38, so the median is
the average of these two values: (37 + 38) / 2 or 37.5
- As with the mean, round the median to one decimal
place more than the data from which it was calculated
The median is often better indicator of the "middle"
than the mean when the data set is skewed, having
a few values that are very much higher than or lower
than the mean. These values are outliers. For
example, if the highest observation in the pulse rate study
was 72 (rather than 47), the mean would increase to 40.7
(from 38.2), but the median would be unchanged at 37.5.
 |
The median is often used for
demographic data
(household income, for example), because a few outliers
(like Bill Gates) can greatly affect the mean. If there are no outliers, the mean and median will be
very close; but be wary of presentations that use one or the
other of these measures of central tendency to deliberately distort your perception of "middle".
|
If the data is skewed, it's histogram will not be very
symmetric; here are two possibilities:

Data sets that are skewed left are said to have a
long right-hand tail; those skewed right have
a long left-hand tail.
- Calculating the median from tabled data.
If you only
have tabled (not raw) data, you can still approximate the median;
here is the procedure:
- Determine the total number of observations (10 for
the pulse rate study)
- Determine which observation number(s) would
represent the median (number 5 and 6 in our case)
- Determine in which class these observations would
fall (for us, there are 2 observations in the first
class – 1 and 2 – and 5 in the second – 3, 4, 5, 6 and 7
– so observations 5 and 6 fall into the class 35-39.9
- Divide the class into intervals of equal length, so
that there are as many intervals as observations in that
class (5 for our class)
- Assume each observation is exactly at the midpoint
of its interval (see Figure 1-5, p.13)
- Calculate the median using the appropriate
observation number(s) and its assumed value (for us, the
average of observation 5 and 6 is 38.0)
Measures of Spread, Dispersion, or Variability
Just as important as measures of central tendency (mean and
median) are to describing a set of data, measures of dispersion
discussed in this chapter illustrate another dimension: how varied
is the data within the data set. The measures of dispersion
described below are range, variance, and, the most
useful, standard deviation.
- The range. The range of a data set is the difference
between its minimum and maximum observation.
Assume you have been asked to review different department's
behavior with regard to hours of sick leave taken. Consider this data:
Annual Hours
of Sick Leave Taken (by employee) |
| Department A |
Department B |
25 0 57 102 89 72 145 |
74 68 86 52 78 72 60 |
x = 70.0 median = 72.0 |
x = 70.0 median = 72.0 |
Although the measures of
central tendency are the same for both departments, it
appears that the employees in Department A are much less
consistent in the hours of sick leave they take annual than
those in Department B
Calculating the range for
each department bears this out:
Department A range = 145
- 0 = 145 Department B range = 86 - 52 = 34
Range is useful because it
can be quickly calculated, but ignores most of the data – it
only takes into account two observations in a data set
Calculating
Variance in Annual Hours of Sick Leave Taken by
Department |
| Department A (
x
= 70.0 ) |
Department B (
x
= 70.0 ) |
| x |
x
- x |
(x
- x)2 |
x |
x
- x |
(x
- x)2 |
25 0 57 102 89 72 145 |
-45 -70 -13 32 19 2 75 |
2,025 4,900 169 1,024 361 4 5,625 |
74 68 86 52 78 72 60 |
4 -2 16 -18 8 2 -10 |
16 4 256 324 64 4 100 |
|
Σ(x
- x)2
= |
14,108 |
Σ(x
- x)2
= |
768 |
|
Σ(x
- x)2
/ n = |
2,015.4 |
Σ(x
- x)2
/ n = |
109.7 |
Round the variance to
one more decimal place than the data used to
calculate it.
The calculations of variance
bear out our observations with the range of each data set:
that the observations in Department A are much more disperse than those in Department B (Why might this
be?).
Variance is a better
measure of dispersion than range, because:
- all of the data are used in its calculation
- it measures dispersion
of the data about the mean, not just the spread of the data
At this point we need
to distinguish between a population and a sample:
- Population
- The complete collection of all
objects in some set under study (all
U.S. citizens, all FSCJ students, all GE light
bulbs). If the objects are people, they are
usually called subjects.
- Sample
- A
selected subset of a population
(1,000 U.S. citizens selected at
random, FSCJ students whose
SSNs end in 2; every 100th light bulb off the GE
production line)
|
The calculations
above are for populations: We had the data from every object
(subject) in the collection (every employee in Department A and every employee in
Department B). When that is the case, the procedure shown is
appropriate. But if we are calculating the variance of a sample,
called
s2 (a representative selection of
employees from each department), we use:
We will use this formula for
the calculation of the variance, since we will almost always
be working with a sample rather than an entire population.
The standard deviation
is the positive square root of the variance.
So, the standard
deviation for Departments A and B would be 44.9 and 10.5,
respectively. Round standard deviation to one more
decimal place than the data used to calculate it.
The sample standard deviation is called s; its formula is:
The standard deviation
plays a critical role in inferential statistics: generally,
observations that are more than two standard deviations from
the mean are considered unusual.
Here is a table
describing a "rule of thumb" (also called a
heuristic) we can use for data sets that
bell-shaped and symmetric (those
that approximate a normal distribution, which
we will discuss in much greater detail later):
|
Percent
of Data within n Standard Deviations of
the Mean (for distributions approximately
normal) |
| n |
Percent |
| 1.0 |
about 68 % |
| 1.5 |
about 87 % |
| 2.0 |
about 95 % |
| 3.0 |
about 99.7 % |
From
this table, you can see that an "unusual" value
– one that is more than two standard deviations
from the mean – occurs less than 5% of the time.
This is the standard statistical sense of the
word "unusual."
Here is how to set up the
calculation for the standard deviation of the
pulse rate data (the mean was calculated above
as 38.2):
|
Pulse
Rates of Conditioned Athletes (calculating the standard deviation) |
|
x |
x
- x |
(x
- x)2 |
31 33 36 36 37 38 39 41 44 47 |
31 - 38.2 = -7.2 -5.2 -2.2 -2.2 -1.2 -0.2 0.8 2.8 5.8 8.8 |
51.84 27.04 4.84 4.84 1.44 0.04 0.64 7.84 33.64 77.44 |
|
Σ(x
- x)2
= |
209.6 |
|
s2 =
Σ(x
- x)2
/ (n -1) = |
23.3 |
|
s = |
4.8 |
Are any of the observations
unusual? They would be if they fell outside
the range [x - 2s, x +
2s] = [38.2 - 2(4.8), 38.2 + 2(4.8)]
= [28.6, 47.8].
Another way to obtain a
quick estimate of the standard deviation is:
For the above data, this
yields an estimate of (47 - 31) / 4 = 4.0;
quick, but significantly underestimating
the actual value of 4.8
-
Determine the mean for
tabled data as given above
-
Use the class midpoint
to represent all the observations in
that class; calculate the deviations by
subtracting the mean from it, squaring
it, and multiply by the number of
observations in that class
-
Add up all the
deviations calculated in step 2
-
Divide by n - 1
to get an estimate of the variance
-
Take the square root of
the variance to estimate the standard
deviation
Here is how to calculate
standard deviation for the pulse rate study,
using the frequency table (x
has already been estimated as 38.5):
Pulse Rates of
Conditioned Athletes (estimating the
standard deviation) |
| Classes |
Freq |
Class Midpoint |
Dev = Midpoint -
x |
Dev2 |
Dev2 x Freq |
| 30.0 to 34.9 |
2 |
32.5 |
32.5 - 38.5 = -6.0 |
36.0 |
72.0 |
| 35.0 to 39.9 |
5 |
37.5 |
-1.0 |
1.0 |
5.0 |
| 40.0 to 44.9 |
2 |
42.5 |
4.0 |
16.0 |
32.0 |
| 45.0 to 49.9 |
1 |
47.5 |
9.0 |
81.0 |
81.0 |
| Totals |
10 |
|
|
|
190.0 |
| Estimated variance
= 136.0 / (10 - 1) = |
21.1 |
|
Estimated standard deviation =
|
4.6 |
The standard deviation
calculated by this method is close to the
actual calculation of the standard deviation
using the raw data (4.6 v. 4.8).
Tchebycheff's Theorem: A Few Numbers in Place of Many
The above discussion of dispersion makes the assumption that we are looking at data from a
population whose histogram would be approximately
bell-shaped and symmetrical. While this is true of many
populations, the Russian statistician Tchebycheff
(1821-1894) proved a theorem that is
true for any population, no matter how the data is
actually distributed. His theorem is:
|
At least 100 - 100 / h2 percent of the
observations must lie within h standard deviations
of the mean. |
This yields these results for for some common standard
deviations:
|
Percent
of Data within h Standard Deviations of
the Mean (Tchebycheff's Theorem) |
| h |
at least |
| 1.0 |
0.0 % |
| 1.5 |
55.5 % |
| 2.0 |
75.0 % |
| 3.0 |
88.9 % |
Here are the ages of ten Oscar
winners for best actresses when they won their
award (1998-2007):
34, 26, 25, 33, 35, 35, 28,
30, 29, 61
Using the methods above, you can
compute
x = 33.6 and s = 10.3. Here are the Tchebycheff estimates of the dispersion of the
data set. compared to the "rule of thumb"
estimates and the actual percentages:
|
Percent
of Actress' Ages within h Standard Deviations of
the Mean |
| h |
x
± h
· s |
Tchebycheff Estimate |
"Rule-of-Thumb" Estimate |
Actual Percentage |
| 1.0 |
[23.3, 43.9] |
> 0.0 % |
~ 68 % |
90 % |
| 1.5 |
[18.2, 49.1] |
> 55.5 % |
~ 87 % |
90 % |
| 2.0 |
[13.0, 54.2] |
> 75.0 % |
~ 95 % |
90 % |
| 3.0 |
[2.7, 64.5] |
> 88.9 % |
~ 99.7 % |
100 % |
Are any of the ages in the data
set unusual? In what way?
Summary: Terminology of some
descriptive statistics
Here are the common symbols for
population parameters and sample statistics that we
will be working with in this course.
| |
Population Parameter |
Sample Statistic |
| Size |
N |
n |
| Mean |

(mu) |
 (x-bar) |
| Variance |

(sigma-squared) |

(s-squared) |
Standard Deviation |

(sigma) |

(s) |
|