curvefit.com. Guide to nonlinear regression.Try our software free for 30 days.StatMate leads you step by step through power and sample size calculations.InStat is a less cumbersome alternative to typical heavy-duty statistical programs. With InStat, even a statistical novice can analyze data in just a few minutes.Prism is a powerful combination of basic biostatistics, curve fitting and scientific graphing in one comprehensive program.GraphPad Software. Data analysis and biostatistics resources.


spa

Table of contents
Intro to regression


s

Welcome
Linear vs. nonlinear
Avoid linearizing
Linear regression
Linear with nonlin
Other regressions
Other curve fits
Nonlinear regression
Curve fitting with Prism
Interpreting the results
Comparing two curves
Distributions of best-fit values
Radioligand binding
Saturation binding
Competitive binding
Kinetics of binding
Dose-response curves
Enzyme kinetics
Standard curves
More information
Search curvefit.com

curvefit.com was created by GraphPad Software, Inc. Send comments or questions to the author of these pages, Dr. Harvey Motulsky, president of GraphPad Software.

In April 2003, GraphPad released Prism 4 and published Fitting Models to Biological Data using Linear and Nonlinear Regression. This book includes all the information that comprises curvefit.com, and much more. You can read this book as a pdf file.



Linear regression

Introduction to linear regression

Linear regression analyzes the relationship between two variables, X and Y. For each subject (or experimental unit), you know both X and Y and you want to find the best straight line through the data. In some situations, the slope and/or intercept have a scientific meaning. In other cases, you use the linear regression line as a standard curve to find new values of X from Y, or Y from X.

The term "regression", like many statistical terms, is used in statistics quite differently than it is used in other contexts. The method was first used to examine the relationship between the heights of fathers and sons. The two were related, of course, but the slope is less than 1.0. A tall father tended to have sons shorter than himself; a short father tended to have sons taller than himself. The height of sons regressed to the mean. The term "regression" is now used for many sorts of curve fitting.

Prism determines and graphs the best-fit linear regression line, optionally including a 95% confidence interval or 95% prediction interval bands. You may also force the line through a particular point (usually the origin), calculate residuals, calculate a runs test, or compare the slopes and intercepts of two or more regression lines.

In general, the goal of linear regression is to find the line that best predicts Y from X. Linear regression does this by finding the line that minimizes the sum of the squares of the vertical distances of the points from the line.

Note that linear regression does not test whether your data are linear (except via the runs test). It assumes that your data are linear, and finds the slope and intercept that make a straight line best fit your data.

How linear regression works

Minimizing sum-of-squares

The goal of linear regression is to adjust the values of slope and intercept to find the line that best predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. Why minimize the sum of the squares of the distances?  Why not simply minimize the sum of the actual distances?

If the random scatter follows a Gaussian distribution, it is far more likely to have two medium size deviations (say 5 units each) than to have one small deviation (1 unit) and one large (9 units). A procedure that minimized the sum of the absolute value of the distances would have no preference over a line  that was 5 units away from two points and one that was 1 unit away from one point and 9 units from another. The sum of the distances (more precisely, the sum of the absolute value of the distances) is 10 units in each case. A procedure that minimizes the sum of the squares of the distances prefers to be 5 units away from two points (sum-of-squares = 50) rather than 1 unit away from one point and 9 units away from another (sum-of-squares = 82). If the scatter is Gaussian (or nearly so), the line determined by minimizing the sum-of-squares is most likely to be correct.

The calculations are shown in every statistics book, and are entirely standard.

Slope and intercept

Prism reports the best-fit values of the slope and intercept, along with their standard errors and confidence intervals.

The slope quantifies the steepness of the line. It equals the change in Y for each unit change in X. It is expressed in the units of the Y-axis divided by the units of the X-axis. If the slope is positive, Y increases as X increases. If the slope is negative, Y decreases as X increases.

The Y intercept is the Y value of the line when X equals zero. It defines the elevation of the line.

The standard error values of the slope and intercept can be hard to interpret, but their main purpose is to compute the 95% confidence intervals. If you accept the assumptions of linear regression, there is a 95% chance that the 95% confidence interval of the slope contains the true value of the slope,  and  that the 95% confidence interval for the intercept contains the true value of the intercept.

r2, a measure of goodness-of-fit of linear regression

The value r2 is a fraction between 0.0 and 1.0, and has no units. An r2 value of  0.0 means that knowing X does not help you predict Y. There is no linear relationship between X and Y, and the best-fit line is a horizontal line going through the mean of all Y values.  When r2 equals 1.0, all points lie exactly on a straight line with no scatter. Knowing X lets you predict Y perfectly.

This figure demonstrates how Prism computes r2.



Microsoft Equation 3.0

The left panel shows the best-fit linear regression line This lines minimizes the sum-of-squares of the vertical distances of the points from the line. Those vertical distances are also shown on the left panel of the figure. In this example, the sum of squares of those distances (SSreg) equals 0.86. Its units are the units of the Y-axis squared. To use this value as a measure of goodness-of-fit, you must compare it to something.

The right half of the figure shows the null hypothesis -- a horizontal line through the mean of all the Y values. Goodness-of-fit of this model (SStot) is also calculated as the sum of squares of the vertical distances of the points from the line, 4.907 in this example. The ratio of the two sum-of-squares values compares the regression model with the null hypothesis model. The equation to compute r2 is shown in the figure. In this example r2 is 0.8248. The regression model fits the data much better than the null hypothesis, so SSreg is much smaller than SStot, and r2 is near 1.0. If the regression model were not much better than the null hypothesis, r2 would be near zero.

You can think of r2 as the fraction of the total variance of Y that is "explained" by variation in X. The value of r2 (unlike the regression line itself) would be the same if X and Y were swapped. So r2 is also the fraction of the variance in X that is "explained" by variation in Y. In other words, r2 is the fraction of the variation that is shared between X and Y.

In this example, 84% of the total variance in Y is "explained" by the linear regression model. That leaves the rest of the vairance (16% of the total) as variability of the data from the model (SStot)

Why Prism doesn't report r2 in constrained linear regression

Prism does not report r2 when you force the line through the origin (or any other point), because the calculations would be ambiguous. There are two ways to compute r2 when the regression line is constrained. As you saw in the previous section, r2 is computed by comparing the sum-of-squares from the regression line with the sum-of-squares from a model defined by the null hypothesis. With constrained regression, there are two possible null hypotheses. One is a horizontal line through the mean of all Y values. But this line doesn't follow the constraint -- it does not go through the origin. The other null hypothesis would be a horizontal line through the origin, far from most of the data.   

Because r2 is ambiguous in constrained linear regression, Prism doesn't report it. If you really want to know a value for r2, use nonlinear regression to fit your data to the equation Y=slope*X.  Prism will report r2 defined the first way (comparing regression sum-of-squares to the sum-of-squares from a horizontal line at the mean Y value).

The standard deviation of the residuals, sy.x

Prism doesn't actually report the sum-of-squares of the vertical distances of the points from the line (SSreg). Instead Prism reports the standard deviation of the residuals, sy.x

The variable sy.x quantifies the average size of the residuals, expressed in the same units as Y.  Some books and programs refer to this value as se. It is calculated from SSreg and N (number of points) using this equation:

MathType Equation

Is the slope significantly different than zero?

Prism reports the P value testing the null hypothesis that the overall slope is zero. The P value answers this question: If there were no linear relationship between X and Y overall, what is the probability that randomly selected points would result in a regression line as far from horizontal (or further) than you observed? The P value is calculated from an F test, and Prism also reports the value of F and its degrees of freedom.

Additional calculations following linear regression

Confidence or prediction interval of a regression line

If you check the option box, Prism will calculate and graph either the 95% confidence interval or 95% prediction interval of the regression line.  Two curves surrounding the best-fit line define the confidence interval.

The dashed lines that demarcate the confidence interval are curved. This does not mean that the confidence interval includes the possibility of curves as well as straight lines. Rather, the curved lines are the boundaries of all possible straight lines. The figure below shows four possible linear regression lines (solid) that lie within the confidence interval (dashed).

Given the assumptions of linear regression, you can be 95% confident that the two curved confidence bands enclose the true best-fit linear regression line, leaving a 5% chance that the true line is outside those boundaries.

Many data points will be outside the 95% confidence interval boundary. The confidence interval is 95% sure to contain the best-fit regression line. This is not the same as saying it will contain 95% of the data points.

Prism can also plot the 95% prediction interval. The prediction bands are further from the best-fit line than the confidence bands, a lot further if you have many data points. The 95% prediction interval is the area in which you expect 95% of all data points to fall. In contrast, the 95% confidence interval is the area that has a 95% chance of containing the true regression line. This graph shows both prediction and confidence intervals (the curves defining the prediction intervals are further from the regression line).

Residuals from a linear regression line

Residuals are the vertical distances of each point from the regression line. The X values in the residual table are identical to the X values you entered. The Y values are the residuals. A residual with a positive value means that the point is above the line; a residual with a negative value means the point is below the line.

If you create a table of residuals, Prism automatically makes a new graph containing the residuals and nothing else. It is easier to interpret the graph than the table of numbers.

If the assumptions of linear regression have been met, the residuals will be randomly scattered above and below the line at Y=0. The scatter should not vary with X. You also should not see large clusters of adjacent points that are all above or all below the Y=0 line.

Runs test following linear regression

The runs test determines whether your data differ significantly from a straight line. Prism can only calculate the runs test if you entered the X values in order.

A run is a series of consecutive points that are either all above or all below the regression line. In other words, a run is a consecutive series of points whose residuals are either all positive or all negative.

If the data points are randomly distributed above and below the regression line, it is possible to calculate the expected number of runs. If there are Na points above the curve and Nb points below the curve, the number of runs you expect to see equals [(2NaNb)/(Na+Nb)]+1. If you observe fewer runs than expected, it may be a coincidence of random sampling or it may mean that your data deviate systematically from a straight line. The P value from the runs test answers this question: If the data really follow a straight line, what is the chance that you would obtain as few (or fewer) runs as observed in this experiment?

The P values are always one-tail, asking about the probability of observing as few runs (or fewer) than observed. If you observe more runs than expected, the P value will be higher than 0.50.

If the runs test reports a low P value, conclude that the data don't really follow a straight line, and consider using nonlinear regression to fit a curve.

Comparing slopes and intercepts

Prism can test whether the slopes and intercepts of two or more data sets are significantly different. It compares linear regression lines using the method explained in Chapter 18 of J Zar, Biostatistical Analysis, 2nd edition, Prentice-Hall, 1984.

Prism compares slopes first. It calculates a P value (two-tailed) testing the null hypothesis that the slopes are all identical (the lines are parallel). The P value answers this question: If the slopes really were identical, what is the chance that randomly selected data points would have slopes as different (or more different) than you observed. If the P value is less than 0.05, Prism concludes that the lines are significantly different. In that case, there is no point in comparing the intercepts. The intersection point of two lines is:

MathType Equation

If the P value for comparing slopes is greater than 0.05, Prism concludes that the slopes are not significantly different and  calculates a single slope for all the lines. Now the question is whether the lines are parallel or identical. Prism calculates a second P value testing the null hypothesis that the lines are identical. If this P value is low, conclude that the lines are not identical (they are distinct but parallel). If this second P value is high, there is no compelling evidence that the lines are different.

This method is equivalent to an Analysis of Covariance (ANCOVA), although ANCOVA can be extended to more complicated situations.

Standard Curve

To read unknown values from a standard curve, you must enter unpaired X or Y values below the X and Y values for the standard curve.

Depending on which option(s) you selected in the Parameters dialog, Prism calculates Y values for all the unpaired X values and/or X values for all unpaired Y values and places these on new output views.

How to think about the results of linear regression

Your approach to linear regression will depend on your goals.

If your goal is to analyze a standard curve, you won't be very interested in most of the results. Just make sure that r2 is high and that the line goes near the points. Then go straight to the standard curve results.

In many situations, you will be most interested in the best-fit values for slope and intercept. Don't just look at the best-fit values, also look at the 95% confidence interval of the slope and intercept. If the intervals are too wide, repeat the experiment with more data.

If you forced the line through a particular point, look carefully at the graph of the data and best-fit line to make sure you picked an appropriate point.

Consider whether a linear model is appropriate for your data. Do the data seem linear? Is the P value for the runs test high? Are the residuals random? If you answered no to any of those questions, consider whether it makes sense to use nonlinear regression instead.

Checklist. Is linear regression the right analysis for these data?

To check that linear regression is an appropriate analysis for these data, ask yourself these questions.

Question
Discussion
Can the relationship between X and Y be graphed as a straight line? In many experiments the relationship between X and Y is curved, making linear regression inappropriate. Either transform the data, or use a program (such as GraphPad Prism) that can perform nonlinear curve fitting.
Is the scatter of data around the line Gaussian (at least approximately)?   Linear regression analysis assumes that the scatter is Gaussian.

Is the variability the same everywhere? Linear regression assumes that scatter of points around the best-fit line has the same standard deviation all along the curve. The assumption is violated if the points with high or low X values tend to be further from the best-fit line. The assumption that the standard deviation is the same everywhere is termed homoscedasticity.
Do you know the X values precisely? The linear regression model assumes that X values are exactly correct, and that experimental error or biological variability only affects the Y values. This is rarely the case, but it is sufficient to assume that any imprecision in measuring X is very small compared to the variability in Y.
Are the data points independent? Whether one point is above or below the line is a matter of chance, and does not influence whether another point is above or below the line.
Are the X and Y values intertwined? If the value of X is used to calculate Y (or the value of Y is used to calculate X) then linear regression calculations are invalid. One example is a Scatchard plot, where the Y value (bound/free) is calculated from the X value. See Avoid Scatchard, Lineweaver-Burk and similar transforms Another example would be a graph of midterm exam scores (X) vs. total course grades(Y). Since the midterm exam score is a component of the total course grade, linear regression is not valid for these data.

Fitting linear data with nonlinear regression


All contents copyright © 1999 by GraphPad Software, Inc. All rights reserved.