Skip to main content Skip to main content

Private: Learning Math: Data Analysis, Statistics, and Probability

Describing Distributions Part B: Histograms (30 minutes)

In This Part: Constructing a Histogram

Like the line plot we explored in Session 2, the stem and leaf plot is a useful device for illustrating variation in data for small data sets (up to about 100 values). For larger data sets, though, the stem and leaf plot is not a practical way to organize data. Instead, you might want to use a histogram. See Note 3 below.

Let’s start with the stem and leaf plot for a new data set: 52 estimates collected in answer to the question “How long is a minute?”:

 

 

 

 

 

 

 

If the stem and leaf plot is rotated 90° counterclockwise, it looks like this:

 

 

 

 

 

 

 

 

 

To create a histogram for this data, first replace each “leaf” (second digit) with a dot:

 

 

 

 

 

 

 

 

 

While a histogram is similar to a line plot, there are, in fact, differences in the values across the horizontal axis. In a line plot, these numbers represent a single data value. In the plot above, the numbers across the bottom indicate the stems in the original stem and leaf plot. Each number represents an entire interval of values.

For instance, the “3” denotes the stem for all values in the 30s — that is, the interval (range) of values from 30 up to (but not including) 40. For the purposes of a histogram, it is useful to label this interval “30 to less than 40” (30 to < 40) to remind us that 30 is included but 40 is not.

The “4” denotes the stem for all values in the 40s — that is, the interval (range) of values from 40 up to (but not including) 50. Again, it is helpful to label this interval “40 to < 50” to remind us that 40 is included but 50 is not.

If we re-label the horizontal axis to show these intervals (groups) of data, the graph below is produced. Again, this graph is similar to a line plot except that the horizontal axis indicates intervals of data values instead of individual data values:

 

 

 

 

 

 

 

 

 

A grouped frequency table can be determined from this display in the following manner:

  • There are four dots over the first group in the interval 30 to < 40. This group has frequency 4.
  • There is one dot over the second group in the interval 40 to < 50. This group has frequency 1.

If we continue this process for the other groups, we produce the following grouped frequency table:

 

 

 

 

 

 

 

 

 

Remember that this table describes ranges of data values rather than specific data values. For instance, we can see that there are seven responses in the interval 70 to < 80, but we have no idea what the actual values are for those responses.


In This Part: Completing the Histogram

You are now ready to complete the histogram based on the line plot you created:

 

 

 

 

 

 

 

 

 

  • Draw a rectangle over each value on the horizontal axis with a height corresponding to the frequency of that value:

 

 

 

 

 

 

 

 

 

(Note that the frequency of each value on the horizontal axis is still indicated by the number of dots within each rectangle.)

  • Remove the dots, shade the rectangles, and add a vertical scale to indicate the frequency of each interval on the horizontal scale:

 

 

 

 

 

 

 

 

 

You have just created a frequency histogram!


The following Interactive Activity reviews the transitions between the various displays of data you’ve worked with so far in this session, using the 26 data values from Part A. (This interactive has been disabled.)


Problem B1
What advantages does a histogram have over a stem and leaf plot? What are the disadvantages of a histogram?


Just as with groupings in a stem and leaf plot, you can change the size of the intervals in a histogram depending on the situation. Your goal is to organize your representation so that you can present the data in the most meaningful way.


In This Part: Interpreting a Histogram

 

 

 

 

 

 

 

The histogram and grouped frequency table you just created offer different ways to present your data (the time estimates) and provide different ways to answer our original question, “How well do people judge when a minute has elapsed?”


Problem B2
Using only the histogram and grouped frequency table, give two descriptive statements that provide an answer to this question. (Since the goal is to estimate when a minute has elapsed, it would make sense to again consider how close the estimates are to the correct response which is 60 seconds.)


Problem B3
a. According to the histogram and grouped frequency table, how many people’s estimates were outside the interval from 50 to less than 70 seconds? That is, how many estimates were less than 50 seconds or 70 seconds or more?

b. How many estimates were within the interval from 50 to less than 70 seconds?

c. How many estimates were outside the interval from 40 to less than 80 seconds?

d. In Problem A5, only nine people’s estimates were more than 10 seconds away from one minute. Does your answer to question (a) of this problem imply that the people in this group were not as good at estimating a minute’s time? If so, why? If not, how could you make a fairer comparison between the two sets?

The second data set comes from a group of 52 time estimates. How many were in the first group?

Notes

Note 3
You will take an evolutionary approach to developing the histogram. The objective of this activity is for you to see the relationship between the line plot, the stem and leaf plot, and the histogram. The line plot shows the frequencies of the rows, but not the actual data values. The stem and leaf plot contains more detailed information than the histogram in that all of the data values are shown. And finally, the relative frequency histogram shows the relative sizes of the frequencies for each interval, although it does not explicitly show those frequencies.

Solutions

Problem B1
A histogram offers a better graphical perspective on an entire data set. One disadvantage is that the actual data values cannot be determined from a histogram, only the number of values within intervals.

Problem B2
There are many descriptive statements that could provide an answer to this question. Here are some things you may have noted:

  • All estimates are between 30 seconds and 100 seconds. The range is 70 seconds, which indicates a lot of variation in the estimates.
  • There is a concentration of estimates between 50 seconds and 70 seconds. Thirty-five of the 52 estimates (or 35 / 52 = 67.3%) fall within this interval. The range of this interval is only 20 seconds.
  • You may have noticed that because the histogram does not indicate individual pieces of data, we cannot look for a single number that represents the data.

Problem B3
a.
There are five estimates below 50 seconds and 12 estimates of 70 seconds or higher. In total, 17 of 52 estimates were outside the interval from 50 to less than 70 seconds.

b. Since 17 estimates were outside this interval, the remaining 35 of 52 estimates were within the interval.

c. There are four estimates below 40 seconds and five estimates of 80 seconds or higher. In total, nine of 52 estimates were outside this interval.

d. No, the answer to question (a) suggests that this group was roughly in line with the original group, since there were only 26 responses in the original group. The proportion for this group, 17 / 52 = 32.7%, is only slightly better than the proportion for the original group, which was 9 / 26 = 34.6%. Effective comparisons between groups of different sizes must be relative comparisons.

Series Directory

Private: Learning Math: Data Analysis, Statistics, and Probability

Credits

Produced by WGBH Educational Foundation. 2001.
  • Closed Captioning
  • ISBN: 1-57680-481-X

Sessions