Private: Learning Math: Data Analysis, Statistics, and Probability

Professional Development > Private: Learning Math: Data Analysis, Statistics, and Probability > 7. Bivariate Data and Analysis > 7.4 Part D: Fitting Lines to Data (60 minutes)

Mathematics

K-2, 3-5, 6-8

Bivariate Data and Analysis Part D: Fitting Lines to Data (60 minutes)

In This Part: Trend Lines
In Parts A and B, you confirmed that there is a strong positive association between height and arm span — short people tend to have short arms, and tall people tend to have long arms. In Part C, you investigated the nature of the relationship between height and arm span by graphing the line Height = Arm Span on a scatter plot of collected data. In Part D, using the same data you’ve been working with, you will investigate the use of other lines as potential models for describing the relationship between height and arm span, and you will explore various criteria for selecting the best line.

Again, here is the scatter plot of the 24 people’s data:

Problem D1
Describe the trend in the data points — in other words, how would you describe the general positioning of the points in the scatter plot? What does this trend tell you about the relationship between height and arm span?

Now let’s take another look at the scatter plot with the line Height = Arm Span graphed:

Problem D2
a. Does this line generally provide an accurate description of the trend in the scatter plot?
b. Do you think there might be a better line for describing this trend?

Let’s consider two other lines for describing the relationship between Height and Arm Span:
Height = Arm Span + 1
Height = Arm Span – 1

The following scatter plot includes the graphs of all three lines:

Problem D3
Based on a visual inspection, which of these three lines does the best job of describing the trend in the data points? Explain why you chose this line.

In This Part: Error
You should have decided in Problem D3 that two of the three lines are better candidates for describing the trend in the data points. The line Height = Arm Span has nine points that are above the line, three that are on the line, and 12 that are below the line. The line Height = Arm Span – 1 has 12 points that are above the line, four that are on the line, and eight that are below the line.

So which of these lines is “better” at describing the relationship? While personal judgement is useful, statisticians prefer to use more objective methods. To develop criteria for identifying the “better” line, we’ll use a concept developed in Part C: the vertical distance from a point to a line.

Person 11, whose arm span is 173 cm and whose height is 185 cm, is represented by the point (173, 185) in the scatter plot. If you were to use the line to predict person 11’s height based on his or her arm span, the predicted values would be represented by the point (173, 173), which lies on the line Height = Arm Span. The scatter plot thus far looks like this:

The difference between the actual observed height (Y) and the corresponding hypothetical, predicted height (on the line) is called the error. If we use YL (Y on the line) to designate the Y coordinate that represents the predicted height, then we can calculate the error as follows:

Error = Y – YL

In other words, Error = Actual Observed Height – Predicted Height (on the line).

Finally, the vertical distance between an observed height and a predicted height can be expressed as:

Distance = |Y – YL| = |Error|

Let’s see how this works for the line Height = Arm Span (i.e., YL = X).

The following table shows the arm span (X), the actual observed height (Y), the predicted height based on the line Height = Arm Span (i.e., YL = X), the error, and the vertical distance between the person’s observed height (Y) and predicted height (YL) for Persons 1 through 6 in our study:

Problem D4
Complete this table for the remaining 18 people. When you click “Show Answers,” the filled-in table will appear below the problem. Scroll down the page to see it.

Here are some observations about this table:

•	A point above the line is indicated by a positive value of (Y – YL); this is called a positive error.
•	A point below the line is indicated by a negative value of (Y – YL); this is called a negative error.
•	A point is on the line when (Y – YL) equals 0, and there is no error.
•	The vertical distance from a point to the line YL = X is the absolute value of the error. The smaller this distance is, the closer the actual data point is, vertically, to the line.

One measure of how well a particular line describes the trend in bivariate data is the total of the vertical distances. When comparing two lines, the line with the smaller total of the vertical distances is the “better” line in terms of how well it describes the linear relationship between the two variables. For the line Height = Arm Span (i.e., YL = X), this is the sum of the sixth column in the above tables combined, which is 100.

But perhaps people aren’t really “square.” Might a better prediction be that height is one centimeter shorter than arm span? Let’s see how well the line Height = Arm Span – 1 (i.e., YL = X – 1) describes the trend.

The following table shows the arm span (X), the actual observed height (Y), the predicted height based on the line YL = X – 1, the error, and the vertical distance between the person’s observed height (Y) and predicted height (YL) for Persons 1 through 6 in our study:

Problem D5
Complete the table for the remaining 18 people. Then compute the total vertical distance for the line Height = Arm Span – 1, and compare the result to the total vertical distance for the line Height = Arm Span. Based on your calculations, which line provides the better fit? When you click “Show Answers,” the filled-in table will appear below the problem. Scroll down the page to see it.

In This Part: The SSE
Another way to see how close an individual’s data point is to a line is to square the error. This is similar to how you calculated the variance in Session 5, where you squared the distances from the mean. Like the absolute value, each squared error produces a positive number. Again, for each individual point, the smaller the squared error, the closer the actual data point is to the line. Here are the squared errors for Persons 1 through 12:

Problem D6
Complete the table to find the squared error for the remaining 12 people.

Another measure of how well a particular line describes the relationship in bivariate data is the total of the squared errors. When comparing two lines, the line with the smaller total of the squared errors is the “better” line in terms of how well it describes the linear relationship between the two variables. For the line Height = Arm Span, this is the sum of the sixth column in the above table, which is 784.

This quantity, the sum of squared errors (SSE), is what statisticians prefer to use when comparing different lines for potential fit. If you could consider all possible lines, then the one with the smallest SSE is called the least squares line; it may also be referred to as the line of best fit.

Before we determine the SSE for the line Height = Arm Span – 1 (i.e., YL = X – 1), let’s take a look at Person 1 and the line YL = X – 1:

a. Judging on the basis of the SSE, which is the best line? Which is the worst?
b. What other ways could we change the line equation in an attempt to further reduce the SSE?
c. Is it possible to reduce the SSE to 0? Why or why not?

We have examined several lines that have yielded different SSEs. The lines, however, had one thing in common: they all had a slope of 1, so they were all parallel. Keep in mind that the slope of a line is often described as the ratio of rise to run. The formula for slope is: slope = (change in Y) / (change in X). Now, let’s investigate a line with a different slope to describe the trend in the data.

One such line, with slope 0.75, passes through (164, 164) and (188, 182) and near many of the other data points; its equation is YL = 0.75X + 41. Let’s compare this line to line YL = X – .7, which is the best fit we have found so far.

Note that these two lines are not parallel since they have different slopes.

Here is the scatter plot of the 24 people and the graph of the lines YL = .75X + 41 and YL = X – .7:

Here is the table to find the SSE for the line YL = .75X + 41:

Notes

Note 3
Fathom Software, used by the participants in the video segments, is helpful in creating graphical representations of data. If you try the problems in Part D using Fathom, you will be able to test various slopes and intercepts. For more information on Fathom, go to the Key Curriculum Press Web site at www.keypress.com/fathom/.

Note 4
More advanced presentations of this topic use such ideas as the standard deviation around the regression line and the coefficient R-Squared. The data in this session has been structured so that using the sum of squares for comparison gives a reasonable result.

Solutions

Problem D1
Overall, there is an upward trend; that is, the points generally go up and to the right. This corresponds to the positive association between height and arm span.

Problem D2
a. The line does a reasonably good job. Some points are above the line, some are below it, and some are on the line, but all are generally pretty close.
b. It looks like it may be possible for another line to be, overall, “closer” to these points.

Problem D3
Answers will vary. The lines Height = Arm Span and Height = Arm Span – 1 each seem to do a good job of dividing the points fairly evenly above and below the line, and matching the overall trend of data. It is difficult to distinguish between them without a more mathematical test. Each is clearly better than Height = Arm Span + 1, which lies above a majority of the points.

Problem D4

Problem D5
Here is the completed table:

For the model YL = X – 1, the total vertical distance is 7 + 4 + … + 13 = 100. Surprisingly, according to this measure of fit, the two lines are equally good. This suggests that another measure of best fit may be useful.

Problem D6

Problem D7

The sum of squared errors (SSE) is 49 + 16 + … + 169 = 772. Since this is less than the sum of squared errors for the line Height = Arm Span (which was 784), the line Height = Arm Span – 1 is a slightly better fit.

Problem D8
a. The best model is YL = X – .7, because it has the smallest SSE. The worst model is YL = X + 1, because it has the largest SSE.
b. As all of these lines have the same slope, if we changed the slope, we might find ways to reduce the SSE.
c. No, we cannot reduce the SSE to zero unless all the data points lie on a straight line, which these 24 points clearly do not do.

Series Directory

Private: Learning Math: Data Analysis, Statistics, and Probability

Credits

Produced by WGBH Educational Foundation. 2001.

Closed Captioning
ISBN: 1-57680-481-X

Sections

7.1 Part A: Scatter Plots (45 minutes)

7.2 Part B: Contingency Tables (20 minutes)

7.3 Part C: Modeling Linear Relationships (35 minutes)

7.4 Part D: Fitting Lines to Data (60 minutes)

7.5 Homework

Sessions

Session 1 Statistics As Problem Solving

Consider statistics as a problem-solving process and examine its four components: asking questions, collecting appropriate data, analyzing the data, and interpreting the results. This session investigates the nature of data and its potential sources of variation. Variables, bias, and random sampling are introduced.

Session 2 Data Organization and Representation

Explore different ways of representing, analyzing, and interpreting data, including line plots, frequency tables, cumulative and relative frequency tables, and bar graphs. Learn how to use intervals to describe variation in data. Learn how to determine and understand the median.

Session 3 Describing Distributions

Continue learning about organizing and grouping data in different graphs and tables. Learn how to analyze and interpret variation in data by using stem and leaf plots and histograms. Learn about relative and cumulative frequency.

Session 4 Min, Max and the Five-Number Summary

Investigate various approaches for summarizing variation in data, and learn how dividing data into groups can help provide other types of answers to statistical questions. Understand numerical and graphic representations of the minimum, the maximum, the median, and quartiles. Learn how to create a box plot.

Session 5 Variation About the Mean

Explore the concept of the mean and how variation in data can be described relative to the mean. Concepts include fair and unfair allocations, and how to measure variation about the mean.

Session 6 Designing Experiments

Examine how to collect and compare data from observational and experimental studies, and learn how to set up your own experimental studies.

Session 7 Bivariate Data and Analysis

Analyze bivariate data and understand the concepts of association and co-variation between two quantitative variables. Explore scatter plots, the least squares line, and modeling linear relationships.

Session 8 Probability

Investigate some basic concepts of probability and the relationship between statistics and probability. Learn about random events, games of chance, mathematical and experimental probability, tree diagrams, and the binomial probability model.

Session 9 Random Sampling and Estimation

Learn how to select a random sample and use it to estimate characteristics of an entire population. Learn how to describe variation in estimates, and the effect of sample size on an estimate's accuracy.

Session 10 Classroom Case Studies, Grades K-2

Explore how the concepts developed in this course can be applied through a case study of a K-2 teacher, Ellen Sabanosh, a former course participant who has adapted her new knowledge to her classroom.

Session 11 Classroom Case Studies, Grades 3-5

Explore how the concepts developed in this course can be applied through case studies of a grade 3-5 teacher, Suzanne L'Esperance and grade 6-8 teacher, Paul Snowden, both former course participants who have adapted their new knowledge to their classrooms.

Private: Learning Math: Data Analysis, Statistics, and Probability

Bivariate Data and Analysis Part D: Fitting Lines to Data (60 minutes)

Notes

Solutions

Series Directory

Credits

Sections

7.1 Part A: Scatter Plots (45 minutes)

7.2 Part B: Contingency Tables (20 minutes)

7.3 Part C: Modeling Linear Relationships (35 minutes)

7.4 Part D: Fitting Lines to Data (60 minutes)

7.5 Homework

Sessions

Session 1 Statistics As Problem Solving

Session 2 Data Organization and Representation

Session 3 Describing Distributions

Session 4 Min, Max and the Five-Number Summary

Session 5 Variation About the Mean

Session 6 Designing Experiments

Session 7 Bivariate Data and Analysis

Session 8 Probability

Session 9 Random Sampling and Estimation

Session 10 Classroom Case Studies, Grades K-2

Session 11 Classroom Case Studies, Grades 3-5

Session 12 Classroom Case Studies, Grades 6-8

Join us for conversations that inspire, recognize, and encourage innovation and best practices in the education profession.