Intro to Regression: Part 6: Regressing against a factor variable

Factor variables (aka "nominal" variables) are not numeric, so they require special handling. Let's walk thru an example.

Going back to the mtcars dataset, let's regress mpg against transmission type (am). There are two transmission types: "automatic" and "manual", which are encoded as "0" and "1" in the dataset, respectively.

Just so it's explicitly clear that we're NOT dealing with a numeric variable, I'm going to rename the levels "automatic" and "manual".

data(mtcars)

# convert am to factor and rename levels
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual")

As per usual, let's first do a scatterplot of mpg vs. transmission type (am):

library(ggplot2)
library(grid)

qplot(x=am, y=mpg, data=mtcars, colour=am) + 
    ggtitle("mpg vs. transmission type (am)") + 
    xlab("transmission type (am)")

It appears that automatics have lower mpg than manuals, on average.

OK, let's go ahead and build a regression model with mpg as the response variable and transmission type (am) as the sole predictor variable.

model <- lm(mpg ~ am - 1, data=mtcars)

This corresponds to the regression model:

$$
mpg = \beta_1 \cdot [am == "automatic"] + \beta_2 \cdot [am == "manual"] + \epsilon
$$

mpg is the response variable
[am == "automatic"] is a "dummy variable"
[am == "manual"] is another dummy variable
\(\epsilon\) is the error term (the residuals)

Note: The "-1" in "mpg ~ am - 1" removes the Y-intercept term from the model. Notice that the regression equation above doesn't have a standalone Y-intercept coefficient. Ignore this detail for now, we'll come back to it later.

Dummy variables

Since a factor variable does not have numeric values, we need to create "dummy variables" for it. That's what [am == "automatic"] and [am == "manual"] represent - they're both dummy variables.

There's one dummy variable for each factor level. The dummy variable takes on the value 0 or 1, depending on the value of the factor variable. If the am factor variable for a particular car equals "automatic", then the [am == "automatic"] dummy variable is set to 1, while the [am == "manual"] dummy variable is set to 0.

Think of [am == "automatic"] and [am == "manual"] as boolean expressions (note the double "==", for the programmers out there) that evaluate to true (1) or false (0), depending on the actual value of the am variable.

Only one dummy variable in the regression function is set to 1 at any given time. The remaining dummy variables are all 0. A dummy variable is set to 1 when the factor variable's value takes on the corresponding factor level.

So in this example we have two factor levels: "automatic" and "manual", and therefore two dummy variables. When am = "automatic", the [am == "automatic"] dummy variable is 1, while [am == "manual"] is 0, so it drops out of the equation:

$$ \begin{align*} mpg &= \beta_1 \cdot [am == "automatic"] + \beta_2 \cdot [am == "manual"] \\[8pt] mpg &= \beta_1 \cdot [1] + \beta_2 \cdot [0] \\[8pt] mpg &= \beta_1 \end{align*} $$

So $\beta_1$ is the predicted mpg value for "automatic" cars. There is no slope to this equation, it's just a horizontal line at $Y = \beta_1$, which happens to be the mean(mpg) for cars with am = "automatic".

When am = "manual", the [am == "manual"] variable is set to 1, while the [am == "automatic"] term drops out:

$$ \begin{align*} mpg &= \beta_1 \cdot [am == "automatic"] + \beta_2 \cdot [am == "manual"] \\[8pt] mpg &= \beta_1 \cdot [0] + \beta_2 \cdot [1] \\[8pt] mpg &= \beta_2 \end{align*} $$ So $\beta_2$ is the predicted mpg for "manual" cars. Again, this is a zero-sloped horizontal line at $Y = \beta_2$, which is the mean(mpg) for am = "manual".

Note that the regression model is not a single line but actually two lines (or points, really), one for each factor level. Let's chart them:

B1 <- coef(model)[1]
B2 <- coef(model)[2]

qplot(x=am, y=mpg, data=mtcars, colour=am) + 
    ggtitle("mpg vs. transmission type (am)") + 
    xlab("transmission type (am)") +
    geom_hline(y=B1, colour=2) + 
    geom_hline(y=B2, colour=5)

The regression model is two lines:

the lower line, at predicted mpg = \(\beta_1\) = mean(mpg) for am = "automatic"
the upper line, at predicted mpg = \(\beta_2\) = mean(mpg) for am = "manual"

If the factor variable had 3 levels, the regression model would have 3 lines. 4 levels, 4 lines. And so on.

Let's look at the model summary to get the coefficient values:

summary(model)
## 
## Call:
## lm(formula = mpg ~ am - 1, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## amautomatic   17.147      1.125   15.25 1.13e-15 ***
## ammanual      24.392      1.360   17.94  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.9452 
## F-statistic: 277.2 on 2 and 30 DF,  p-value: < 2.2e-16

\(\beta_1\) = 17.147, the Estimate for the [am==automatic] coefficient
\(\beta_2\) = 24.392, the Estimate for the [am==manual] coefficient


As mentioned above, \(\beta_1\) can be interpreted as the average mpg for "automatics", while
\(\beta_2\) can be interpreted as the average mpg for "manuals".

Re-introducing the Y-intercept term into the model

Normally you'd include the Y-intercept term in the regression model. For that we need to select a level from the factor variable that represents the "base" level. The Y-intercept coefficient corresponds to the "base" level.

The coefficients for other levels are evaluated relative to the "base" level. This is because the "base" level coefficient (the Y-intercept term) is always present (non-zero) in the regression function.

This is made clear by looking at the regression equation. Without the Y-intercept term, the regression function (from above) is:

$$
mpg = \beta_1 \cdot [am == "automatic"] + \beta_2 \cdot [am == "manual"] + \epsilon
$$

With the Y-intercept term, the regression equation is:

$$
mpg = \beta_0 + \beta_1 \cdot [am == "manual"] + \epsilon
$$

The difference being that the [am == "automatic"] dummy variable has been removed. Thus, [am == "automatic"] represents the "base" level. When the factor variable equals "automatic", the [am == "manual"] dummy variable drops out, and the regression equation becomes (ignoring the error term):

$$ \begin{align*} mpg &= \beta_0 + \beta_1 \cdot [am == "manual"] \\[8pt] mpg &= \beta_0 + \beta_1 \cdot [0] \\[8pt] mpg &= \beta_0 \end{align*} $$

So $\beta_0$ is the predicted mpg when am = "automatic".

When am = "manual", the regression function is: $$ \begin{align*} mpg &= \beta_0 + \beta_1 \cdot [am == "manual"] \\[8pt] mpg &= \beta_0 + \beta_1 \cdot [1] \\[8pt] mpg &= \beta_0 + \beta_1 \end{align*} $$

The predicted mpg when am = "manual" is: $mpg = \beta_0 + \beta_1$. Contrast this with the non-Y-intercept model above, where the predicted mpg for am = "manual" is just: $mpg = \beta_2$. In the Y-intercept model, the Y-intercept coefficient must be added to the [am == "manual"] coefficient to get the predicted mpg.

This is what I mean when I say the coefficients for other factor levels are evaluated relative to the "base" level. The "base" level coefficient, $\beta_0$, is always in play. It's never "turned off" for other factor levels.

This affects the coefficients of other levels. When a factor level dummy variable is set to 1, its coefficient is added to $\beta_0$ to get the predicted mpg. The factor level coefficient represents the difference (in the response variable) between that level and the "base" level.

The linear model including the Y-intercept term

model.y <- lm(mpg ~ am, data=mtcars)
summary(model.y)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

By default, lm will use the "lowest" level in the factor variable to represent the "base" level (in this example, the lowest level is "automatic"). The coefficients for remaining factor levels represent the difference between that level and the base level. The coefficient values are:

$\beta_0$ = 17.147, the Estimate for the (Intercept) coefficient
$\beta_1$ = 7.245, the Estimate for the [am == "manual"] coefficient

Note that $\beta_0$ from the Y-intercept model equals $\beta_1$ from the non-Y-intercept model, which makes sense since they both correspond to [am == "automatic"].

Also note that $\beta_0 + \beta_1$ from the Y-intercept model equals $\beta_2$ from the non-Y-intercept model. This makes sense because the combined coefficients from the Y-intercept model give you the predicted mpg for [am == "manual"], just as $\beta_2$ does in the non-Y-intercept model.

The regression lines from the Y-intercept model are identical to those from the non-Y-intercept model:

B0 <- coef(model.y)[1]
B1 <- coef(model.y)[2]

qplot(x=am, y=mpg, data=mtcars, colour=am) + 
    ggtitle("mpg vs. transmission type (am)") + 
    xlab("transmission type (am)") +
    geom_hline(y=B0, colour=2) + 
    geom_hline(y=B0+B1, colour=5)

Again, the regression model is two lines:

the lower line, at predicted mpg = \(\beta_0\) = mean(mpg) for am = "automatic"
the upper line, at predicted mpg = \(\beta_0 + \beta_1\) = mean(mpg) for am = "manual"

The residuals

Let's take a look at the residuals for the Y-intercept model:

mpg.actual <- mtcars$mpg
mpg.predicted <- predict(model.y)
mpg.residual <- resid(model.y)

data.frame(mpg.actual, mtcars$am, mpg.predicted, mpg.residual)

hist(mpg.residual, 
     col="yellow", 
     main="Distribution of mpg residuals", 
     xlab="mpg residuals")

##                     mpg.actual mtcars.am mpg.predicted mpg.residual
## Mazda RX4                 21.0    manual      24.39231   -3.3923077
## Mazda RX4 Wag             21.0    manual      24.39231   -3.3923077
## Datsun 710                22.8    manual      24.39231   -1.5923077
## Hornet 4 Drive            21.4 automatic      17.14737    4.2526316
## Hornet Sportabout         18.7 automatic      17.14737    1.5526316
## Valiant                   18.1 automatic      17.14737    0.9526316
## ...

The residuals appear to be distributed somewhat normally. This tells us the prediction errors are somewhat balanced around the regression line(s), which is what we want.

The residuals for the Y-intercept model are exactly the same as the residuals for the non-Y-intercept model:

mpg.actual <- mtcars$mpg
mpg.predicted <- predict(model)
mpg.residual <- resid(model)

data.frame(mpg.actual, mtcars$am, mpg.predicted, mpg.residual)

hist(mpg.residual, 
     col="yellow", 
     main="Distribution of mpg residuals", 
     xlab="mpg residuals")

##                     mpg.actual mtcars.am mpg.predicted mpg.residual
## Mazda RX4                 21.0    manual      24.39231   -3.3923077
## Mazda RX4 Wag             21.0    manual      24.39231   -3.3923077
## Datsun 710                22.8    manual      24.39231   -1.5923077
## Hornet 4 Drive            21.4 automatic      17.14737    4.2526316
## Hornet Sportabout         18.7 automatic      17.14737    1.5526316
## Valiant                   18.1 automatic      17.14737    0.9526316
## ...

Goodness of fit - R

Finally, let's calculate R ² :

ss.res <- sum(mpg.residual^2) 
ss.tot <- with(mtcars, sum( (mpg - mean(mpg))^2 ) )

R2 <- 1 - ss.res / ss.tot
## 0.3597989

Note this is the same R ² reported by the Y-intercept model summary (the "Multiple R-squared" field). It's not that great - only 36% of the variance is explained by the model. But that's not much of a surprise since our model is very simplistic.

Note, however, that it is not the same R ² reported by the non-Y-intercept model above. For an explanation of this apparent inconsistency, check out this thread: http://stats.stackexchange.com/questions/26176/removal-of-statistically-significant-intercept-term-boosts-r2-in-linear-model