Skip to main content Skip to main content

Private: Learning Math: Data Analysis, Statistics, and Probability

Bivariate Data and Analysis Part D: Fitting Lines to Data (60 minutes)

In This Part: Trend Lines
In Parts A and B, you confirmed that there is a strong positive association between height and arm span — short people tend to have short arms, and tall people tend to have long arms. In Part C, you investigated the nature of the relationship between height and arm span by graphing the line Height = Arm Span on a scatter plot of collected data. In Part D, using the same data you’ve been working with, you will investigate the use of other lines as potential models for describing the relationship between height and arm span, and you will explore various criteria for selecting the best line.

Again, here is the scatter plot of the 24 people’s data:

 

 

 

 

 

 


Problem D1
Describe the trend in the data points — in other words, how would you describe the general positioning of the points in the scatter plot? What does this trend tell you about the relationship between height and arm span?


Now let’s take another look at the scatter plot with the line Height = Arm Span graphed:

 

 

 

 

 

 

 


Problem D2
a. 
Does this line generally provide an accurate description of the trend in the scatter plot?
b. 
Do you think there might be a better line for describing this trend?


Let’s consider two other lines for describing the relationship between Height and Arm Span:
Height = Arm Span + 1
Height = Arm Span – 1

The following scatter plot includes the graphs of all three lines:

 

 

 

 

 

 

 


Problem D3
Based on a visual inspection, which of these three lines does the best job of describing the trend in the data points? Explain why you chose this line.


In This Part: Error
You should have decided in Problem D3 that two of the three lines are better candidates for describing the trend in the data points. The line Height = Arm Span has nine points that are above the line, three that are on the line, and 12 that are below the line. The line Height = Arm Span – 1 has 12 points that are above the line, four that are on the line, and eight that are below the line.

So which of these lines is “better” at describing the relationship? While personal judgement is useful, statisticians prefer to use more objective methods. To develop criteria for identifying the “better” line, we’ll use a concept developed in Part C: the vertical distance from a point to a line.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Person 11, whose arm span is 173 cm and whose height is 185 cm, is represented by the point (173, 185) in the scatter plot. If you were to use the line to predict person 11’s height based on his or her arm span, the predicted values would be represented by the point (173, 173), which lies on the line Height = Arm Span. The scatter plot thus far looks like this:

The difference between the actual observed height (Y) and the corresponding hypothetical, predicted height (on the line) is called the error. If we use YL (Y on the line) to designate the Y coordinate that represents the predicted height, then we can calculate the error as follows:

Error = Y – YL

In other words, Error = Actual Observed Height – Predicted Height (on the line).

Finally, the vertical distance between an observed height and a predicted height can be expressed as:

Distance = |Y – YL| = |Error|

Let’s see how this works for the line Height = Arm Span (i.e., YL = X).

The following table shows the arm span (X), the actual observed height (Y), the predicted height based on the line Height = Arm Span (i.e., YL = X), the error, and the vertical distance between the person’s observed height (Y) and predicted height (YL) for Persons 1 through 6 in our study:

 

 

 

 

 


Problem D4
Complete this table for the remaining 18 people. When you click “Show Answers,” the filled-in table will appear below the problem. Scroll down the page to see it.

Here are some observations about this table:

 A point above the line is indicated by a positive value of (Y – YL); this is called a positive error.
 A point below the line is indicated by a negative value of (Y – YL); this is called a negative error.
 A point is on the line when (Y – YL) equals 0, and there is no error.
 The vertical distance from a point to the line YL = X is the absolute value of the error. The smaller this distance is, the closer the actual data point is, vertically, to the line.

One measure of how well a particular line describes the trend in bivariate data is the total of the vertical distances. When comparing two lines, the line with the smaller total of the vertical distances is the “better” line in terms of how well it describes the linear relationship between the two variables. For the line Height = Arm Span (i.e., YL = X), this is the sum of the sixth column in the above tables combined, which is 100.

But perhaps people aren’t really “square.” Might a better prediction be that height is one centimeter shorter than arm span? Let’s see how well the line Height = Arm Span – 1 (i.e., YL = X – 1) describes the trend.

The following table shows the arm span (X), the actual observed height (Y), the predicted height based on the line YL = X – 1, the error, and the vertical distance between the person’s observed height (Y) and predicted height (YL) for Persons 1 through 6 in our study:

 

 

 

 

 


Problem D5
Complete the table for the remaining 18 people. Then compute the total vertical distance for the line Height = Arm Span – 1, and compare the result to the total vertical distance for the line Height = Arm Span. Based on your calculations, which line provides the better fit? When you click “Show Answers,” the filled-in table will appear below the problem. Scroll down the page to see it.

 

 

 

 

 

 

 

 

 


In This Part: The SSE
Another way to see how close an individual’s data point is to a line is to square the error. This is similar to how you calculated the variance in Session 5, where you squared the distances from the mean. Like the absolute value, each squared error produces a positive number. Again, for each individual point, the smaller the squared error, the closer the actual data point is to the line. Here are the squared errors for Persons 1 through 12:

 

 

 

 

 

 

 

 


Problem D6
Complete the table to find the squared error for the remaining 12 people.

 

 

 

 

 

 

 

 

 


Another measure of how well a particular line describes the relationship in bivariate data is the total of the squared errors. When comparing two lines, the line with the smaller total of the squared errors is the “better” line in terms of how well it describes the linear relationship between the two variables. For the line Height = Arm Span, this is the sum of the sixth column in the above table, which is 784.

This quantity, the sum of squared errors (SSE), is what statisticians prefer to use when comparing different lines for potential fit. If you could consider all possible lines, then the one with the smallest SSE is called the least squares line; it may also be referred to as the line of best fit.

Before we determine the SSE for the line Height = Arm Span – 1 (i.e., YL = X – 1), let’s take a look at Person 1 and the line YL = X – 1:

 

 

Person 1’s squared error can be represented on the graph as a square with a side whose length is |Y – YL|:

 

 

 

 

 

 

 

The following is the scatter plot for the data and a graph of the line YL = X – 1.

 

 

 

 

 

 

 

Note once again that a point above the line is indicated by a positive error; a point below the line is indicated by a negative error; and a point is on the line when the error is 0.

The following table shows the arm span (X), the observed height (Y), the predicted height based on the line Height = Arm Span – 1 (i.e., YL = X – 1), the error, and the vertical distance between the person’s observed height (Y) and predicted height (YL) for Persons 1 through 6 in our study:

 

 

 

 

 

 


Problem D7
Complete table below for the remaining 18 people. Then compute the sum of the squared errors for the line Height = Arm Span – 1, and compare the result to the sum of squared errors for the line Height = Arm Span. Based on your calculations, which line provides the better fit?

 

 

 

 

 

 

 

 

 

 

Video Segment
In this video segment, Professor Kader introduces two rules: the sum of errors and the sum of squared errors. He explains that these are used to evaluate how well any given line fits a data set and how well each line can predict the value of one variable when the value of the other variable is known.

 


In This Part: More Lines
Can we do better? Recall that for the 24 people in this study, the mean arm span is 175.5 cm and the mean height is 174.8 cm. Note that the mean arm span is .7 cm longer than the mean height. This suggests that we might try the line Height = Arm Span – .7 to describe the trend in our bivariate data. Let’s see how this line compares with the previous models.Here is the scatter plot of the data and a graph of the line YL = X – .7: 

 

Here is the table for the line YL = X – .7:

 

 

 

 

 

 

 

 

 

For this line, the sum of squared errors is 770.56, which makes it a slightly better model than the line YL = X – 1 (whose SSE was 772).


Problem D8
Here are the three lines we’ve considered, plus two new ones:

 

 

 

 

 

 

a. Judging on the basis of the SSE, which is the best line? Which is the worst?
b. 
What other ways could we change the line equation in an attempt to further reduce the SSE?
c. Is it possible to reduce the SSE to 0? Why or why not?


We have examined several lines that have yielded different SSEs. The lines, however, had one thing in common: they all had a slope of 1, so they were all parallel. Keep in mind that the slope of a line is often described as the ratio of rise to run. The formula for slope is: slope = (change in Y) / (change in X). Now, let’s investigate a line with a different slope to describe the trend in the data.

One such line, with slope 0.75, passes through (164, 164) and (188, 182) and near many of the other data points; its equation is YL = 0.75X + 41. Let’s compare this line to line YL = X – .7, which is the best fit we have found so far.

Note that these two lines are not parallel since they have different slopes.

Here is the scatter plot of the 24 people and the graph of the lines YL = .75X + 41 and YL = X – .7:

 

 

 

 

 

 

 

Here is the table to find the SSE for the line YL = .75X + 41:

 

 

 

 

 

 

 

 

 

 

The SSE for the line YL = .75X + 41 is 616.8 (as compared to 770.56). So this new line, with its different slope, turns out to be a better fit for the data set. See Note 3 below.


In This Part: Summary
In this session, we saw how the SSE can be used as criteria to determine which line best fits a set of data points. The best fit is the line with the smallest SSE. This line is referred to as the least squares line because, for a given set of data points, it is the line that minimizes the sum of the squared errors. In the following Interactive Activity, you will see how these squares can be represented graphically. The least squares line is the line that minimizes the total area of all the squares formed when the vertical distance from the data points to the line is used as the side lengths of the squares. See Note 4 below.

Notes

Note 3
Fathom Software, used by the participants in the video segments, is helpful in creating graphical representations of data. If you try the problems in Part D using Fathom, you will be able to test various slopes and intercepts. For more information on Fathom, go to the Key Curriculum Press Web site at www.keypress.com/fathom/.

Note 4
More advanced presentations of this topic use such ideas as the standard deviation around the regression line and the coefficient R-Squared. The data in this session has been structured so that using the sum of squares for comparison gives a reasonable result.

Solutions

Problem D1
Overall, there is an upward trend; that is, the points generally go up and to the right. This corresponds to the positive association between height and arm span.

Problem D2
a. 
The line does a reasonably good job. Some points are above the line, some are below it, and some are on the line, but all are generally pretty close.
b. It looks like it may be possible for another line to be, overall, “closer” to these points.

Problem D3
Answers will vary. The lines Height = Arm Span and Height = Arm Span – 1 each seem to do a good job of dividing the points fairly evenly above and below the line, and matching the overall trend of data. It is difficult to distinguish between them without a more mathematical test. Each is clearly better than Height = Arm Span + 1, which lies above a majority of the points.

Problem D4

 

 

 

 

 

 

 

 

 

 

Problem D5
Here is the completed table:

 

 

 

 

 

 

 

 

 

 

For the model YL = X – 1, the total vertical distance is 7 + 4 + … + 13 = 100. Surprisingly, according to this measure of fit, the two lines are equally good. This suggests that another measure of best fit may be useful.

Problem D6

 

 

 

 

 

 

 

 


Problem D7

 

 

 

 

 

 

 

 

 

The sum of squared errors (SSE) is 49 + 16 + … + 169 = 772. Since this is less than the sum of squared errors for the line Height = Arm Span (which was 784), the line Height = Arm Span – 1 is a slightly better fit.

Problem D8
a. 
The best model is YL = X – .7, because it has the smallest SSE. The worst model is YL = X + 1, because it has the largest SSE.
b. 
As all of these lines have the same slope, if we changed the slope, we might find ways to reduce the SSE.
c. 
No, we cannot reduce the SSE to zero unless all the data points lie on a straight line, which these 24 points clearly do not do.

Series Directory

Private: Learning Math: Data Analysis, Statistics, and Probability

Credits

Produced by WGBH Educational Foundation. 2001.
  • Closed Captioning
  • ISBN: 1-57680-481-X

Sessions