Intro to Regression: Part 4: Distribution of prediction errors (residuals) and goodness of fit (R2)
- Y is the response variable
- X is the predictor variable
- \(\beta_0\) is the Y-intercept of the regression line
- \(\beta_1\) is the slope of the regression line
- \(\epsilon\) is the error term, aka the "residuals"
The residuals are highlighted in the chart below by the red lines (using the mtcars example from part 2):
library(ggplot2)
library(grid)
data(mtcars)
model <- lm(mpg ~ wt, data=mtcars)
mpg.predicted <- predict(model)
qplot(x=wt, y=mpg, data=mtcars) +
ggtitle("mpg vs. weight") +
xlab("weight (x 1000 lbs)") +
stat_smooth(method="lm", se=FALSE, size=1) +
geom_line(aes(x=c(mtcars$wt, mtcars$wt),
y=c(mtcars$mpg,mpg.predicted),
group=rep(seq_along(mtcars$wt),2)),
color="red",
linetype="dotted",
arrow=arrow(ends="both",
type="closed",
length=unit(3,"points")))
The table below tabulates the actual values of Y (mpg), the predicted values of Y, and the residuals:
mpg.actual <- mtcars$mpg
mpg.predicted <- predict(model)
mpg.residual <- resid(model)
data.frame(mpg.actual, mpg.predicted, mpg.residual)
## mpg.actual mpg.predicted mpg.residual
## Mazda RX4 21.0 23.282611 -2.2826106
## Mazda RX4 Wag 21.0 21.919770 -0.9197704
## Datsun 710 22.8 24.885952 -2.0859521
## Hornet 4 Drive 21.4 20.102650 1.2973499
## Hornet Sportabout 18.7 18.900144 -0.2001440
## Valiant 18.1 18.793255 -0.6932545
## ...
Plotting the residuals
It's usually a good idea to do a cursory examination of the residuals. We expect the residuals to be somewhat normally distributed around 0. Why? Because that means the prediction errors are somewhat balanced around the regression line. If the prediction errors are NOT balanced around the regression line, then the relationship between predictor and response variables may not be, in fact, linear, and further investigation into the relationship is warranted.
Linear model assumption: One of the assumptions about linear models is that the residuals are normally distributed around 0. If that assumption does not hold, then the accuracy of the linear model can't be relied on.
Let's examine the residuals by plotting them.
qplot(x=mtcars$wt, y=mpg.residual) +
ggtitle("mpg vs. weight - residuals") +
xlab("weight (x 1000 lbs)") +
ylab("mpg residual (actual - predicted)") +
geom_hline(y=0, colour="blue", linetype="dashed", size=1)
The residuals appear to be somewhat balanced around 0. We can get a better feel for the distribution of residuals by using a histogram.
hist(mpg.residual,
col="yellow",
main="Distribution of mpg residuals",
xlab="mpg residuals")
It's not a perfectly normal distribution by any means, but it's close enough. What we're really looking for are non-normal patterns in the residual distribution. Such patterns require an explanation before we can trust that the linear model is sound. Often these patterns suggest either that the correlation is not linear, or that a confounding variable is not accounted for.
Goodness of fit - R2
The accuracy of the linear model, aka its "goodness of fit", can be gauged by looking at the variance of the residuals. A small residual variance means the discrepencies between predicted and actual values of the response variable are small, which suggests a good fit. A large residual variance reflects larger discrepencies, which suggests a not-so-good fit.
One measure of goodness of fit is known as R2. R2 is a ratio that relates the total variance of the response variable to the residual variance.
Total variance of the response variable is simply the variance of the response variable alone. It could be thought of as the variance of the response variable without the regression model.
Residual variance is the variance of the response variable with the model.
In other words, residual variance is the variance of the response variable with respect to the regression model, just as total variance is the variance of the response variable with respect to its mean.
R2 tells us how much of the total variance of the response variable is "explained by the model". It is calculated by the equation below:
- \(SS_{res}\) is the sum-of-squares of the residuals
- \(SS_{tot}\) is the sum-of-squares of the response variable alone (wrt its mean)
Going back to the mtcars example:
ss.res <- sum(mpg.residual^2)
ss.tot <- with(mtcars, sum( (mpg - mean(mpg))^2 ) )
R2 <- 1 - ss.res / ss.tot
## 0.7528328
This means that approximately 75% of the total variance of mpg is "explained by the model", which is to say that 75% of the variance is correlated with ("explained by") the variance of the predictor variable.
In a simple linear model (where "simple" means "only one predictor variable"), R2 is equal to the square of the correlation coefficient. The correlation coefficient is often denoted with a little r, so its square is r2. This is where "R2" gets its name from.
r2 <- cor(mtcars$mpg, mtcars$wt)^2
## 0.7528328
R conveniently provides goodness-of-fit information in the summary of the linear model:
summary(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Note that the R2 from the summary ("Multiple R-squared") is equal to the R2 we calculated manually above.