Chapter 19 Evaluating linear models

We have now covered a range of linear models, that are all fitted using the same tool in R (lm): t-tests, 1-way ANOVA, 2+ way ANOVA, ordinary linear regression, multiple regression.

The models are all fitted in the same way, and have the same assumptions. We have already covered the four key diagnostic plots (see the ANOVA and linear regression sections, and the GSWR textbook), how to evaluate the significance of parameters, and the meaning of the coefficients.

There are some additional useful points to consider: proportion of variance explained (R-squared value), proportion of variance explained by different variables in the model, Akaike’s Information Criterion (and likelihood).

During the 2+ way ANOVA (multiple regression) section you may have realised that there may be multiple ways to fit a model. For example, you may have a choice of parameters to include - should you include them or not? which ones should you include? Would a log-transformed explanatory variable be better?

We will use the class data to look at these topics (download the latest version please!). In this example, I am filtering the data to only 2019, but you can choose to use all the data (i.e. no filtering), or this year’s data.

classData <- read.csv("CourseData/classData.csv") %>%
  filter(Year == 2019) %>% # you can edit this to look at particular years.
  filter(Gender %in% c("Male", "Female")) %>%
  filter(!is.na(HandWidth)) # Filter out NA

19.1 R-squared value

Let’s fit a simple model: Height ~ HandWidth + Gender

mod1 <- lm(Height ~ HandWidth + Gender, data = classData)
summary(mod1)
## 
## Call:
## lm(formula = Height ~ HandWidth + Gender, data = classData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.887  -4.041   1.217   3.697   9.412 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 158.0822     7.1746  22.034  < 2e-16 ***
## HandWidth     1.4026     0.8777   1.598 0.120500    
## GenderMale   10.0774     2.3460   4.296 0.000169 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.053 on 30 degrees of freedom
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6144 
## F-statistic: 26.49 on 2 and 30 DF,  p-value: 2.357e-07

The model summary here shows us the R-squared value, which is a measure of the proportion of variation in the response variable that is explained by variation in the explanatory variable(s). In a linear regression, an r-squared value of 1 (100%) would mean that all data points fall on the line. As the r-squared value declines, there exists more noise in the relationship (i.e. the points become more spread out around the line).

There are two type of R-squared value shown in this summary: Multiple R-squared (0.6385) and Adjusted R-squared 0.6144.

We’ll look at multiple R-squared first. This value is calculated as the amount of explained variation divided by the total amount of variation. Take a look at the anova summary table:

anova(mod1)
## Analysis of Variance Table
## 
## Response: Height
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## HandWidth  1 881.60  881.60  34.527 1.973e-06 ***
## Gender     1 471.13  471.13  18.451 0.0001685 ***
## Residuals 30 766.00   25.53                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the column labelled Sum Sq is telling us what the variance EXPLAINED by each of the terms is. The final entry, for Residuals, is the amount not explained (hence “residual”). Therefore we can calculate R-squared from this as (881.6 + 471.1) / (881.6 + 471.1 + 766) = 0.6384575. You can check that this matches the figure indicated by summary(mod1).

But what is the Adjusted R-squared?

Multiple R-squared is a measure of R-squared value for models that can have multiple predictor variables. Therefore it accurately measures the amount of variation in the response variable that can be explained by the predictor variables. When additional terms are added to the model, the multiple R-squared will always increase because terms will always explain some portion of the variance, even if it is vary small. This behaviour can be a bit annoying, and so adjusted R-squared controls against this increase, by adding penalties for the number of predictors in the model. When reporting R-squared values for models with >1 term you should report the adjusted R-squared value.

You can test this by adding terms to the model. Let’s start with something silly - we’ll add a variable that is simply a vector of random numbers to the model. By definition this cannot have any meaningful explanatory power, but what will it do to the multiple R-squared value?

classData <- classData %>%
  mutate(randomVariable = rnorm(nrow(classData)))

mod2 <- lm(Height ~ HandWidth + Gender + randomVariable, data = classData)
summary(mod2)
## 
## Call:
## lm(formula = Height ~ HandWidth + Gender + randomVariable, data = classData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.638  -3.904   1.224   3.602   9.239 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    158.0397     7.2848  21.695  < 2e-16 ***
## HandWidth        1.4032     0.8910   1.575 0.126136    
## GenderMale      10.1993     2.4103   4.232 0.000213 ***
## randomVariable  -0.2626     0.7974  -0.329 0.744267    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.13 on 29 degrees of freedom
## Multiple R-squared:  0.6398, Adjusted R-squared:  0.6025 
## F-statistic: 17.17 on 3 and 29 DF,  p-value: 1.332e-06

We can ask for the more precise multiple R-squared values like this:

summary(mod1)$r.squared
## [1] 0.638462
summary(mod2)$r.squared
## [1] 0.6398092

If you subtract one from the other, you will see that the Multiple R-squared has “improved” by 0.0013472. In contrast, the Adjusted R-squared has decreased slightly (-0.0119), from 0.6144 to 0.6025. This, hopefully, is enough evidence to make you favour reporting adjusted rather than multiple R-squared values.

The adjusted-R-squared value can be used as a one number summary of model explanatory power.

19.2 Akaike Information Criterion (AIC)

The Akaike information criterion (AIC) is an estimator of prediction error in a statistical model developed in 1970 by a Japanese statistician called Hirotugu Akaike. In a nutshell, it is a measure of the relative quality of statistical models for a given set of data. This last part is important. AIC is only comparable among statistical models that use the same data (and which have the same response variable). In other words, given a collection of various plausible models that use the same data set, AIC estimates the quality of each model, relative to each of the other models. If you are interested in the mysterious details of what this quantity is you can read further on Wikipedia, or a more advanced statistics book, otherwise you can simply trust that AIC estimates the relative quality of model, with lower values being better.

You can get R to tell you the AIC value for a model using the function AIC() e.g. AIC(mod1).

Here’s a simple example of use in practice:

mod1 <- lm(Height ~ HandWidth + Gender, data = classData)
mod2 <- lm(Height ~ HandWidth * Gender, data = classData)
mod3 <- lm(Height ~ HandWidth, data = classData)
mod4 <- lm(Height ~ Gender, data = classData)
mod5 <- lm(Height ~ 1, data = classData)
mod6 <- lm(Height ~ HandWidth + randomVariable, data = classData)


(AICtable <- AIC(mod1, mod2, mod3, mod4, mod5, mod6) %>%
  arrange(AIC))
##      df      AIC
## mod1  4 205.4242
## mod4  3 206.1204
## mod2  5 207.2667
## mod3  3 219.2433
## mod6  4 221.1694
## mod5  2 234.9980

In the AIC results table, the models are now ordered from best (lowest AIC) to worst (highest AIC).

19.3 Variance partitioning

When you have a model with numerous terms (e.g. a multiple regression model, or an 2-way ANOVA for example) it is often useful to express the results in term of variance explained.

We can do this using variance partitioning.

Consider our earlier model lm(Height ~ HandWidth + Gender. What proportion of the variance in height is explained by hand width? And what proportion by Gender? (and so on, for more complicated models…)

This is done by examining the anova summary (e.g. anova(mod1)), using the Sum Sq column.

anova(mod1)
## Analysis of Variance Table
## 
## Response: Height
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## HandWidth  1 881.60  881.60  34.527 1.973e-06 ***
## Gender     1 471.13  471.13  18.451 0.0001685 ***
## Residuals 30 766.00   25.53                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So here we already know that the model explains 63.85% of variation in Height. We can partition this among the terms by using the Sums of Squares values. The proportion of variance explained by HandWidth is 881.6/(881.6 + 471.1) = 0.6517336.

Similarly, the proportion of variance explained by Gender is 471.1/(881.6 + 471.1) = 0.3482664.

In more complicated multiple regression one could use this approach to further group variables into types so one could e.g., one could lump together different types of explanatory variables. Imagine you had data on human cholesterol level, for example. You might have explanatory variables including various genotypes, morphology (height/weight), various dietary factors and so on. After partitioning variance among the many variables, it could then be useful to group these variables into a smaller number of “types”, such as “genetic”, “morphological” and “diet”. Thus, variance partitioning can help make sense of complex data and can improve how such results are communicated.

The logical process is the same for Generalised Linear Models (GLM), which we will cover soon, except we use an analogous quantity called Deviance rather than Sum of Squares.

19.4 Conclusion

In conclusion, you now have some tools to understand your models in more detail. R-squared gives a handy summary to tell you how much variation is explained - a high R2 value indicates a good model. It can be used to compare models that use different data. AIC is another measure of model “quality” but can only compare models that use the same data set. Low AIC values are better than high ones. Variance partitioning can be used as a handy way to sum up your model (in addition to significance, and coefficient values.)