5 Multiple Linear Regression

Author

Affiliation

In Section 3.1 we have seen how to model the relationship between two variables using simple linear regression (SLR). However, in ecosystems, the relationship between the response variable and the explanatory variables is more complex and in many cases cannot be adequately captured by a single driver (i.e. influential or predictor variable). In such cases, multiple linear regression (MLR) can be used to model the relationship between the response variable and multiple explanatory variables.

5.1 Multiple Linear Regression

Multiple linear regression helps us answer questions such as:

How do various environmental factors influence the population size of a species? Factors like average temperature, precipitation levels, and habitat area can be used to predict the population size of a species in a given region. Which of these factors are most important in determining the population size?
What are the determinants of plant growth in different ecosystems? Variables such as soil nutrient content, water availability, and light exposure can help predict the growth rate of plants in various ecosystems. How do these factors interact to influence plant growth?
How do genetic and environmental factors affect the spread of a disease in a population? The incidence of a disease might depend on factors like genetic susceptibility, exposure to pathogens, and environmental conditions (e.g., humidity and temperature). What is the relative importance of these factors in determining the spread of the disease?

Multiple linear regression extends the simple linear regression model to include several independent variables. The model is expressed as: Y_i = \alpha + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_k X_{ik} + \epsilon_i \tag{5.1} Where:

Y_i is the response variable for the i-th observation,
X_{i1}, X_{i2}, \ldots, X_{ik} are the k predictor variables for the i-th observation,
\alpha is the intercept,
\beta_1, \beta_2, \ldots, \beta_k are the coefficients for the k predictor variables, and
\epsilon_i is the error term for the i-th observation (the residuals).

When including a categorical variable in a multiple linear regression model, dummy (indicator) variables are used to represent the different levels of the categorical variable. Let’s assume we have a categorical variable C with three levels: C_1, C_2, and C_3. We can represent this categorical variable using two dummy variables:

D_1: Equals 1 if C = C_2, 0 otherwise.
D_2: Equals 1 if C = C_3, 0 otherwise.

C_1 is considered the reference category and does not get a dummy variable. This way, we avoid multicollinearity (see Section 5.6.4). R’s lm() function will automatically convert the categorical variables to dummy variables (sometimes called treatment coding). The first level of the alphabetically sorted categorical variable is taken as the reference level. See Section 8.5 for more information about how to include categorical variables in a multiple linear regression model. At the end of the chapter you’ll find alternative ways to assess categorical variables in a multiple linear regression model (Section 5.9).

Assume we also have k continuous predictors X_{1}, X_{2}, \ldots, X_{k}. The multiple linear regression model with these predictors and the categorical variable can be expressed as: Y_i = \alpha + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_k X_{ik} + \gamma_1 D_{i1} + \gamma_2 D_{i2} + \epsilon_i \tag{5.2} Where:

Y_i is the dependent variable for observation i.
\alpha is the intercept term.
\beta_1, \beta_2, \ldots, \beta_k are the coefficients for the continuous independent variables X_{i1}, X_{i2}, \ldots, X_{ik}.
D_{i1} and D_{i2} are the dummy variables for the categorical predictor C.
\gamma_1 and \gamma_2 are the coefficients for the dummy variables, representing the effect of levels C_2 and C_3 relative to the reference level C_1.
\epsilon_i is the error term for observation i.

5.2 Nature of the Data

You are referred to the discussion in simple linear regression (Section 3.1). The only added consideration is that the data should be multivariate, i.e., it should contain more than one predictor variable. The predictor variables are generally continuous, but there may also be categorical variables.

5.3 Assumptions

Basically, this is as already discussed in simple linear regression (Section 3.1)—in multiple linear regression, the same assumptions apply to the response relative to each of the predictor variables. In Section 5.6.7 I will assess the assumptions in an example dataset. An additional consideration is that the predictors must not be highly correlated with each other (multicollinearity) (see Section 5.6.4).

5.4 Outliers

Again, this is as discussed in simple linear regression (Section 3.1). In multiple linear regression, the same considerations apply to the response relative to each of the predictor variables.

5.5 R Function

The lm() function in R is used to fit a multiple linear regression model. The syntax is similar to that of the lm() function used for simple linear regression, but with multiple predictor variables. The function takes the basic form:

lm(formula, data)

For a multiple linear regression with only continuous predictor variables (as in Equation 5.1), the formula is:

lm(response ~ predictor1 + predictor2 + ... + predictorN,
   data = dataset)

Interaction effects are implemented by including the product of two variables in the formula. For example, to include an interaction between predictor1 and predictor2, we can use:

lm(response ~ predictor1 * predictor2, data = dataset)

When we have both continuous and categorical predictor variables (Equation 5.2), the formula is:

lm(response ~ continuous_predictor1 + continuous_predictor2 + ...
   + continuous_predictorN + factor(categorical_predictor1) +
     factor(categorical_predictor2) + ...
   + factor(categorical_predictorM),
   data = dataset)

5.6 Example 1: The Seaweed Dataset

Load some data produced in the analysis by Smit et al. (2017). Please refer to the chapter Deep Dive into Gradients on Tangled Bank for the data description.

This dataset is suitable for a multiple linear regression because it has continuous response variables (\beta_\text{sør}, \beta_\text{sim}, and \beta_\text{sne}, the Sørenesen dissimilarity, the turnover component of \beta-diversity, and the nestedness-resultant component of \beta-diversity, respectively), continuous predictor variables (the mean climatological temperature for August, the mean climatological temperature for the year, the temperature range for February and August, and the SD of February and August), and a categorical variable (the bioregional classification of the samples).

sw <- read.csv("data/spp_df2.csv")
rbind(head(sw, 3), tail(sw, 3))[,-1]
>        dist  bio    augMean   febRange      febSD     augSD    annMean
> 1     0.000  BMP 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
> 2    51.138  BMP 0.05741369 0.09884404 0.16295271 0.3132800 0.01501846
> 3   104.443  BMP 0.15043904 0.34887754 0.09934163 0.4188239 0.02602247
> 968 102.649 ECTZ 0.41496099 0.11330069 0.24304493 0.7538546 0.52278161
> 969  49.912 ECTZ 0.17194242 0.05756093 0.18196664 0.3604341 0.24445006
> 970   0.000 ECTZ 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
>               Y        Y1          Y2
> 1   0.000000000 0.0000000 0.000000000
> 2   0.003610108 0.0000000 0.003610108
> 3   0.003610108 0.0000000 0.003610108
> 968 0.198728140 0.1948882 0.003839961
> 969 0.069337442 0.0443038 0.025033645
> 970 0.000000000 0.0000000 0.000000000

We will do a multiple linear regression analysis to understand the relationship between some of the environmental variables and the seaweed species. Specifically, we will consider only the variables augMean, febRange, febSD, augSD, and annMean as predictors of the species composition as measured by \beta_\text{sør} (Y in the data file).

The model, which we will call full_mod1 below, can be stated formally as Equation 5.3: Y = \alpha + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \epsilon \tag{5.3} Where:

Y is the response variable, the mean Sørensen dissimilarity,
the predictors X_1, X_2, X_3, X_4, and X_5 correspond to augMean, febRange, febSD, augSD, and annMean, respectively, and
\epsilon is the error term.

But before we jump into multiple linear regression, let’s warm up by first fitting some simple linear regressions.

Simple Linear Models

For interest sake, let’s fit simple linear models for each of the predictors against the response variable. Let’s look at relationships between the continuous predictors and the response in the East Coast Transition Zone (ECTZ), ignoring the other bioregions for now. We will first fit the simple linear models and then create scatter plots of the response variable \beta_\text{sør} against each of the predictor variables. To these plots, we will add a best fit (regression) lines.

sw_ectz <- sw |> filter(bio == "ECTZ")

predictors <- c("augMean", "febRange", "febSD", "augSD", "annMean")

# Fit models using purrr::map and store in a list
models <- map(predictors, ~ lm(as.formula(paste("Y ~", .x)),
                               data = sw_ectz))

names(models) <- predictors

model_summaries <- map(models, summary)
model_summaries
> $augMean
> 
> Call:
> lm(formula = as.formula(paste("Y ~", .x)), data = sw_ectz)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -0.180961 -0.059317 -0.008346  0.045695  0.192444 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 0.060104   0.007359   8.168 1.01e-14 ***
> augMean     0.346011   0.010899  31.748  < 2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.07721 on 287 degrees of freedom
> Multiple R-squared:  0.7784,  Adjusted R-squared:  0.7776 
> F-statistic:  1008 on 1 and 287 DF,  p-value: < 2.2e-16
> 
> 
> $febRange
> 
> Call:
> lm(formula = as.formula(paste("Y ~", .x)), data = sw_ectz)
> 
> Residuals:
>      Min       1Q   Median       3Q      Max 
> -0.21744 -0.08311 -0.01543  0.07536  0.25699 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 0.092722   0.009638   9.621   <2e-16 ***
> febRange    0.181546   0.008897  20.405   <2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.1048 on 287 degrees of freedom
> Multiple R-squared:  0.592,   Adjusted R-squared:  0.5905 
> F-statistic: 416.4 on 1 and 287 DF,  p-value: < 2.2e-16
> 
> 
> $febSD
> 
> Call:
> lm(formula = as.formula(paste("Y ~", .x)), data = sw_ectz)
> 
> Residuals:
>      Min       1Q   Median       3Q      Max 
> -0.24267 -0.10709 -0.02587  0.08888  0.39171 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)  0.12018    0.01168   10.29   <2e-16 ***
> febSD        0.17166    0.01245   13.79   <2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.1272 on 287 degrees of freedom
> Multiple R-squared:  0.3985,  Adjusted R-squared:  0.3964 
> F-statistic: 190.1 on 1 and 287 DF,  p-value: < 2.2e-16
> 
> 
> $augSD
> 
> Call:
> lm(formula = as.formula(paste("Y ~", .x)), data = sw_ectz)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -0.307683 -0.111051 -0.003922  0.086322  0.308041 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)  0.12781    0.01231   10.38   <2e-16 ***
> augSD        0.08793    0.00720   12.21   <2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.133 on 287 degrees of freedom
> Multiple R-squared:  0.3419,  Adjusted R-squared:  0.3396 
> F-statistic: 149.1 on 1 and 287 DF,  p-value: < 2.2e-16
> 
> 
> $annMean
> 
> Call:
> lm(formula = as.formula(paste("Y ~", .x)), data = sw_ectz)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -0.144251 -0.051607 -0.005023  0.045095  0.145173 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 0.053883   0.006309   8.541 7.94e-16 ***
> annMean     0.332150   0.008667  38.325  < 2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.0663 on 287 degrees of freedom
> Multiple R-squared:  0.8365,  Adjusted R-squared:  0.836 
> F-statistic:  1469 on 1 and 287 DF,  p-value: < 2.2e-16

The individual models show that, for each predictor, the estimate of the coefficients (for slope) and the test for the overall hypothesis are both significant (p < 0.05 in all cases; refer to the model output). All the predictor variables are therefore good predictors of the structure of seaweed species composition along.

# Create individual plots for each predictor
plts1 <- map(predictors, function(predictor) {
  ggplot(sw_ectz, aes_string(x = predictor, y = "Y")) +
    geom_point(shape = 1, colour = "dodgerblue4") +
    geom_smooth(method = "lm", col = "magenta", fill = "pink") +
    labs(title = paste("Y vs", predictor),
         x = predictor,
         y = "Y") +
    theme_bw()
})

# Name the list elements for easy reference
names(plts1) <- predictors

ggpubr::ggarrange(plotlist = plts1, ncol = 2,
                  nrow = 3, labels = "AUTO")

Figure 5.1: Individual simple linear regressions fitted to the variables `augMean`, `febRange`, `febSD`, `augSD`, and `annMean` as predictors of the seaweed species composition as measured by the Sørensen dissimilarity, `Y`.

Figure 5.1 is a series of scatter plots showing the relationship between the response variable \beta_\text{sør} and each of the predictor variables. The blue line represents the linear regression fitted to the data. We see that the relationship between the response variable and each of the predictors is positive and linear. Each of the models are significant, as indicated by the p-values in the model summaries. These simple models do not tell us how some predictors might act together to influence the response variable.

To consider combined effects and interactions between predictor variables, we must explore multiple linear regression models that include all the predictors. Multiple regression will give us a more integrated understanding of how various environmental variables jointly influence species composition along the coast. In doing so, we can control for confounding variables, improve model fit, deal with multicollinearity, test for interaction effects, and enhance predictive power.

We will fit this multiple regression model next.

State the Hypotheses for a Multiple Linear Regression

As with all inferential statistics, we need to consider the hypotheses when performing multiple linear regression.

The null hypothesis (H_0) states that there is no significant relationship between the Sørensen diversity index and any of the the climatological variables entered into the model, implying that the coefficients for all predictors are equal to zero. The alternative hypothesis (H_A), on the other hand, states that there is a significant relationship between the Sørensen diversity index and the climatological variables, positing that at least one of the coefficients is not equal to zero.

The hypotheses can be divided into two kinds: those dealing with the main effects and the one assessing the overall model stated in Equation 5.3.

Main effects hypotheses

The main effects hypotheses test, for each predictor, X_i, if the predictor has a significant effect on the response variable Y.

H_0: There is no linear relationship between the environmental variables (augMean, febRange, febSD, augSD, and annMean) and the community composition as measured by \beta_\text{sør} (in Y). Formally, for each predictor variable X_i:

H_0: \beta_i = 0 \text{ for } i = 1, 2, 3, 4, 5

Where \beta_i are the coefficients of the predictors in the multiple linear regression model.

H_A: There is a linear relationship between the environmental variables (augMean, febRange, febSD, augSD, and annMean) and the species composition as measured by \beta_\text{sør}:

H_A: \beta_i \neq 0 \text{ for } i = 1, 2, 3, 4, 5

Overall hypothesis

In addition to testing the individual predictors, X_i, we can also test a hypothesis about the overall significance of the model (F-test), which examines whether the model as a whole explains a significant amount of variance in the response variable Y. A significant F-test would suggest that at least one predictor (excluding the intercept) in the model is likely to be significantly related to the response, but it requires further investigation of individual predictors and potential multicollinearity to fully understand the relationships. For the overall model hypothesis:

Null Hypothesis (H_0):

H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0

Alternative Hypothesis (H_A):

H_A: \exists \, \beta_i \neq 0 \text{ for at least one } i

Fit the Model

We fit two models:

a full model that includes an intercept term and the five environmental variables, and
a null model that includes only an intercept term.

The reason the null model is included is to compare the full model with a model that has no predictors. This comparison will help us determine which of the predictors are useful in explaining the response variable—we will see this in action in the forward model selection process later on (Section 5.6.5).

# Select only the variables that will be used in model building
sw_sub1 <- sw_ectz[, c("Y", "augMean", "febRange",
                      "febSD", "augSD", "annMean")]

# Fit the full and null models
full_mod1 <- lm(Y ~ augMean + febRange + febSD +
                 augSD + annMean, data = sw_sub1)
null_mod1 <- lm(Y ~ 1, data = sw_sub1)

# Add fitted values from the full model to the dataframe
sw_ectz$.fitted <- fitted(full_mod1)

Dealing With Multicollinearity

Some of the predictor variables may be correlated with each other and this can lead to multicollinearity. When predictor variables are highly correlated, the model may not be able to distinguish the individual effects of each predictor. Consequently, the model becomes less precise and harder to interpret due to the coefficients’ inflated standard errors (Graham (2003)). One can create a plot of pairwise correlations to visually inspect the correlation structure of the predictors. I’ll not do this here, but you can try it on your own.

A formal way to detect multicollinearity is to calculate the variance inflation factor (VIF) for each predictor variable. The VIF measures how much the variance of the estimated regression coefficients is increased due to multicollinearity. A VIF value greater than 5 or 10 indicates a problematic amount of multicollinearity.

initial_formula <- as.formula("Y ~ .")

threshold <- 10 # Define a threshold for VIF values

# Extract the names of the predictor variables
predictors <- names(vif(full_mod1))

# Iteratively remove collinear variables
while (TRUE) {
  # Calculate VIF values
  vif_values <- vif(full_mod1)
  print(vif_values) # Print VIF values for debugging
  max_vif <- max(vif_values)
  
  # Check if the maximum VIF is above the threshold
  if (max_vif > threshold) {
    # Find the variable with the highest VIF
    high_vif_var <- names(which.max(vif_values))
    cat("Removing variable:",
        high_vif_var,
        "with VIF:",
        max_vif,
        "\n")
    
    # Update the formula to exclude the high VIF variable
    updated_formula <- as.formula(paste("Y ~ . -", high_vif_var))
    
    # Refit the model without the high VIF variable
    full_mod1 <- lm(updated_formula, data = sw_sub1)
    
    # Update the environment data frame to reflect the removal
    sw_sub1 <- sw_sub1[, !(names(sw_sub1) %in% high_vif_var)]
  } else {
    break
  }
}
>   augMean  febRange     febSD     augSD   annMean 
> 27.947767 10.806635  8.765732  2.497739 31.061900 
> Removing variable: annMean with VIF: 31.0619 
>   augMean  febRange     febSD     augSD 
>  2.290171 10.648752  8.637679  1.616390 
> Removing variable: febRange with VIF: 10.64875 
>  augMean    febSD    augSD 
> 1.423601 1.674397 1.585055

Regularisation techniques such as ridge regression, lasso regression, or elastic net can also be used to deal with multicollinearity. These advanced techniques add a penalty term to the regression model that shrinks the coefficients towards zero, which can help to reduce the impact of multicollinearity. However, these techniques are not covered in this guide. Please refer to Chapter 8 for more information on regularisation techniques.

Perform Forward Selection

It might be that not all of the variables included in the full model are necessary to explain the response variable. We can use a stepwise regression to select the best combination (subset) of predictors that best explains the response variable. To do this, we will use the stepAIC function that lives in the MASS package.

stepAIC() works by starting with the null model and then adding predictors one by one, selecting the one that improves the model the most as seen in the reduction of the AIC values along the way. This process continues until no more predictors can be added to improve the model (i.e. to further reduce the AIC). Progress is tracked as the function runs.

# Perform forward selection
mod1 <- stepAIC(null_mod1,
                scope = list(lower = null_mod1, upper = full_mod1),
                direction = "forward")
> Start:  AIC=-1044.97
> Y ~ 1
> 
>           Df Sum of Sq    RSS     AIC
> + augMean  1    6.0084 1.7108 -1478.4
> + febSD    1    3.0759 4.6433 -1189.9
> + augSD    1    2.6394 5.0797 -1163.9
> <none>                 7.7192 -1045.0
> 
> Step:  AIC=-1478.41
> Y ~ augMean
> 
>         Df Sum of Sq    RSS     AIC
> + febSD  1   0.36036 1.3504 -1544.8
> + augSD  1   0.31243 1.3984 -1534.7
> <none>               1.7108 -1478.4
> 
> Step:  AIC=-1544.77
> Y ~ augMean + febSD
> 
>         Df Sum of Sq    RSS     AIC
> + augSD  1   0.10568 1.2448 -1566.3
> <none>               1.3504 -1544.8
> 
> Step:  AIC=-1566.32
> Y ~ augMean + febSD + augSD

The model selection process shows that as we add more variables to the model, the AIC value decreases. We can infer from this that the multiple regression model provides a better fit that simple linear models that use the variables in isolation.

We also see that stepAIC() has not removed any variables from the full model. Probably one reason for failing to remove any variables is that the VIF process has already accomplished this by virtue of dealing with multicollinearity. This means that all the variables retained in mod1 are important in explaining the response variable.

Added-Variable Plots (Partial Regression Plots)

Before looking at the output in more detail, I’ll introduce partial regression plots as a means to examine the relationship between the response variable and each predictor variable. Although they can be calculated by hand, the car package provides a convenient function, avPlots(), to create these plots.

Added variable plots are also sometimes called ‘partial regression plots’ or ‘individual coefficient plots.’ They are used to display the relationship between a response variable and an individual predictor variable while accounting for the effect of other predictor variables in a multiple regression model (the marginal effect).

# Create partial regression plots
avPlots(mod1, col = "dodgerblue4", col.lines = "magenta")

Figure 5.2: Partial regression plots for `mod1` with the selected variables `augMean`, `febSD`, and `augSD`.

What insights can we draw from the added-variable plots? Although there are better ways to assess the model fit, we can already make some observations about the linearity of the model or the presence of outliers. The slope of the line in an added variable plot corresponds to the regression coefficient for that predictor in the full multiple regression model. Seen in this way, it visually indicates the magnitude and direction of each predictor’s effect. In Figure 5.2, the added-variable plot for augMean shows a tighter clustering of points around the regression line and a strong linear relationship (steep slope) with the response variable; the plots for febSD and augSD, on the other hand, show a weaker response and more scatter about the regression line. Importantly, this suggests that augMean has a stronger and more unique contribution to the multiple-variable model than the other two variables.

There are also insights to be made about possible multicollinearity using added-variable plots. These plots are not a definitive test for multicollinearity, but they can provide some clues. Notably, if a predictor shows a strong relationship with the response variable in a simple correlation but appears to have little relationship in the added-variable plot, it might indicate collinearity with other predictors. This discrepancy suggests that the predictor’s effect on the response is being masked by the presence of other correlated predictors.

Model Diagnostics

We are back in the territory of parametric statistics, so we need to check the assumptions of the multiple linear regression model (similar to those of simple linear regression). We can do this by making the various diagnostic plots. all of them consider various aspects of the residuals, which are simply the differences between the observed and predicted values.

Diagnostic plots of final model

You have been introduced to diagnostic plots in the context of simple linear regression (Section 3.1). They are also useful in multiple linear regression. Although plot.lm() can easily do this, here I use autoplot() from the ggfortify package. When applied to the final model, mod1, the plot will in its default setting show four diagnostic plots: residuals vs. fitted values, normal Q-Q plot, scale-location plot, and residuals vs. leverage plot. Note, this is for the full model inclusive of the combined contributions of all the predictors, so we will not see separate plots for each predictor as we have seen in the added-variable plots or component plus residual plots.

# Generate diagnostic plots 
autoplot(mod1, shape = 21, colour = "dodgerblue4",
         smooth.colour = "magenta") +
  theme_bw()

Figure 5.3: Diagnostic plots to assess the fit of the final multiple linear regression model, `mod1`.

Residuals vs. Fitted Values: In this plot we can assess linearity and homoscedasticity of the residuals. If the seaweed gods were with us, we’d expect the points to be randomly scattered about a horizontal line situation at zero. This would indicate that the relationship between the predictors selected by the forward selection process (augMean, febSD, and augSD) and the response variable (Y) is linear, and the variance of the residuals is constant across the range of fitted values. In this plot, there’s a very slight curvature which might suggest a potential issue with the linearity assumption—it is minute and I’d suggest not worrying about it. The variance of the residuals seems to decrease slightly at higher fitted values, indicating a mild case of heteroscedasticity.

Q-Q Plot (Quantile-Quantile Plot): This plot is used to check the normality of the residuals. The points should fall approximately along a straight diagonal line if the residuals are normally distributed. Here we see that the points generally follow the line although some deviations may be seen at the tails. These deviations are not that extreme and again I don’t think this is not a big concern.

Scale-Location Plot: This plot should reveal potential issues with homoscedasticity. The square root of the standardised residuals is used here to make it easier to spot patterns, so we would like the points to be randomly scattered around the horizontal red line. Here, the line slopes slightly downward and this indicates that the variance of the residuals might decrease as the fitted values increase. We can also see evidence of this in a plot of the observed values vs. the predictors in Figure 5.3.

Residuals vs. Leverage: This diagnostic highlights influential points (outliers). Points with high leverage (far from the mean of the predictors) can be expected to exert a strong influence on the regression line, tilting it in some direction. Cook’s distance (indicated by the yellow line) helps identify such outliers. In our seaweed data a few points could have a high leverage, but since they don’t seem to cross the Cook’s distance thresholds, I doubt they are overly worrisome.

Considering that no glaring red flags were raised by the diagnostic plots, I doubt that they are severe enough to invalidate the model. However, if you cannot stand these small issues, you could i) consider transforming the predictor or response variables to address your concerns about heteroscedasticity, ii) investigate the outliers (high leverage points) to confirm if they are valid data points or errors, or iii) try robust regression methods that are less sensitive to outliers and heteroscedasticity.

Component plus residual plots

Component plus residual plots offer another way to assess the fit of the model in multiple regression models. Unlike simple linear regression where we only had one predictor variable, here we have several. So, we need to assure ourselves that there is a linear relationship between each predictor variable and the response variable (we could already see this in the added-variable plots in Section 5.6.6). We can make component plus residual plots using the crPlots() function in the car package. It displays the relationship between the response variable and each predictor variable. If the relationship is linear, the points should be randomly scattered about a best fit line and the spline (in pink in Figure 5.4) should plot nearly on top of the linear regression line.

# Generate component plus residual plots
crPlots(mod1, col = "dodgerblue4", col.lines = "magenta")

Figure 5.4: Component plus residual diagnostic plots to assess the fit of the final multiple linear regression model, `mod1`.

Understanding the Model Fit

The above model selection process has led us to the mod1 model, which can be stated formally as: Y = \alpha + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon \tag{5.4} Where:

Y: The response variable, the mean Sørensen dissimilarity.
X_1, X_2, and X_3: The predictors corresponding to augMean, febSD, and augSD, respectively.
\epsilon: The error term.

We have convinced ourselves that the model is a good fit for the data, and we can proceed to examine the model’s output. The fitted model can be explored in two ways: by applying the summary() function or by using the anova() function. The summary() function provides a detailed output of the model, while the anova() function provides a table of deviance values that can be used to compare models.

The model summary

# Summary of the selected model
summary(mod1)
> 
> Call:
> lm(formula = Y ~ augMean + febSD + augSD, data = sw_sub1)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -0.153994 -0.049229 -0.006086  0.045947  0.148579 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 0.028365   0.007020   4.040 6.87e-05 ***
> augMean     0.283335   0.011131  25.455  < 2e-16 ***
> febSD       0.049639   0.008370   5.930 8.73e-09 ***
> augSD       0.022150   0.004503   4.919 1.47e-06 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.06609 on 285 degrees of freedom
> Multiple R-squared:  0.8387,  Adjusted R-squared:  0.837 
> F-statistic: 494.1 on 3 and 285 DF,  p-value: < 2.2e-16

The first part of the summary() function’s output is the Coefficients section. This is where the main effects hypotheses are tested (this model does not have interactions—if there were, they’d appear here, too). The important components of the coefficients part of the model summary are:

(Intercept): This row provides information about where the regression line intersects the y-axis.
Main Effects:
- augMean, febSD, and augSD: These rows give the model coefficients associated with the slopes of the regression lines fit to those predictor variables. They indicate the rate of change in the response variable for a one-unit change in the predictor variable.
- Estimate, Std. Error, t value, and Pr(>|t|): These columns contain the statistics used to interpret the hypotheses about the main effects. In the Estimate column are the coefficients for the y-intercept and the main effects’ slopes, and Std. Error indicates the variability of the estimate. The t value is obtained by dividing the coefficient by its standard error. The p-value tests the null hypothesis that the coefficient is equal to zero and significance codes are provided as a quick visual reference (their use is sometimes frowned upon by statistics purists). Using this information, we can quickly see that, for example, augMean has a coefficient of 0.2833 \pm 0.0111 and the slope of the line is highly significant, i.e. there is a significant effect of Y due to the temperature gradient set up by augMean.

The intercept and slope coefficients

The interpretation of the coefficients is a bit more complicated in multiple linear regression compared to what we are accustomed to in simple linear regression. Let us look at some greater detail at the intercept and the slope coefficients:

Intercept (\alpha): ) The intercept is the expected value of the response variable, Y, when all predictor variables are zero. It is not always meaningful, but it can be useful in some cases.

Slope Coefficients (\beta_1, \beta_2, \ldots, \beta_k): Each slope coefficient, \beta_j, represents the expected change in the response variable, Y, for a one-unit increase in the predictor variable, X_j, holding all other predictor variables constant. This partial effect interpretation implies that \beta_j accounts for the direct contribution of X_j to Y while removing the confounding effects of other predictors in the model. Figure 5.2 provides a visual representation of this concept and isolates the effect of each predictor variable on the response variable.

Therefore, in the context of our model (Equation 5.4) for this analysis, the partial interpretation is as follows:

\beta_1: Represents the change in Y for a one-unit increase in X_1, holding X_2 and X_3 constant.
\beta_2: Represents the change in Y for a one-unit increase in X_2, holding X_1 and X_3 constant.
\beta_3: Represents the change in Y for a one-unit increase in X_3, holding X_1 and X_2 constant.

There are also several overall model fit statistics—it is here where you’ll find the information you need to assess the hypothesis about the overall significance of the model. Residual standard error indicates the average distance between observed and fitted values. Multiple R-squared and Adjusted R-squared values tell us something about the model’s goodness of fit. The latter adjusts for the number of predictors in the model, and is the one you must use and report in multiple linear regressions. As you also know, higher numbers approaching 1 are better, with 1 suggesting that the model perfectly captures all of the variability in the data. The F-statistic and its associated p-value test the overall significance of the model and examines whether all regression coefficients are simultaneously equal to zero. You can also use the brief overview of the residuals, but I don’t find this particularly helpful—best examine the residuals in a histogram.

The ANOVA tables

anova(mod1)
> Analysis of Variance Table
> 
> Response: Y
>            Df Sum Sq Mean Sq  F value    Pr(>F)    
> augMean     1 6.0084  6.0084 1375.660 < 2.2e-16 ***
> febSD       1 0.3604  0.3604   82.507 < 2.2e-16 ***
> augSD       1 0.1057  0.1057   24.196 1.473e-06 ***
> Residuals 285 1.2448  0.0044                       
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This function provides a sequential analysis of variance (Type I ANOVA) table for the regression model (see more about Type I ANOVA, below). As such, this function can also be used to compare nested models. Used on a single model, it gives a more interpretable breakdown of the variability in the response variable Y and assesses the contribution of each predictor variable in explaining this variability.

The ANOVA table firstly shows the degrees of freedom (Df) for each predictor variable added sequentially to the model, as well as the residuals. For each predictor, the degrees of freedom is typically 1. For the residuals, however, it represents the total number of observations minus the number of estimated parameters. The Sum of Squares (Sum Sq) indicates the variability in Y attributable to each predictor, and the mean sum of squares (Mean Sq) is the sum of squares divided by the degrees of freedom.

The F value is calculated as the ratio of the predictor’s mean square to the residual mean square tests. It is used in testing the null hypothesis that the predictor has no effect on Y. Whether or not we accept the alternative hypothesis (reject the null) is given by the p-value (Pr(>F)) that goes with each F-statistic. You know how that works.

Because this is a sequential ANOVA, the amount of variance in Y explained by each predictor (or group of predictors) is calculated by adding the predictors to the model in sequence (as specified in the model formula). For example, the Sum of Squares for augMean (6.0084) represents the amount of variance explained by adding augMean to a model that doesn’t include any predictors yet. The Sum of Squares for febSD 0.3604) represents the amount of variance explained by adding febSD to a model that already includes augMean—this improvement indicates that febSD explains some of the variance in Y that augMean doesn’t.

Order in which predictors are assessed in multiple linear regression

The interpretation of sequential ANOVA (Type I) is inherently dependent on the order in which predictors are entered. In mod1 the order is first augMean, then febSD, and last comes augSD. This order might not be the most meaningful for interpreting the sequential sums of squares and their significance in the ANOVA table. How, then, does one decide on the order of predictors in the model?

If you have a strong theoretical or causal basis for thinking that certain predictors influence others, you can enter them in that order.
If you have a hierarchy of predictors based on their importance or general vs. specific nature, you can enter them hierarchically.
You can manually fit models with different predictor orders and compare the ANOVA tables to see how the results change. This can be time-consuming but might offer insights into the sensitivity of your conclusions to the order of entry.
You can use automated model selection procedures, such as stepwise regression, to determine the best order of predictors. This is a more objective approach but can be criticised for being data-driven and not theory-driven.
Use Type II or Type III ANOVAs, which are are not order-dependent and can be used to assess the significance of predictors after accounting for all other predictors in the model. However, they have their own limitations and assumptions that need to be considered.

My advice would be to have sound theoretical reasons for the order of predictors in the model.

Both ways of looking at the model fit of mod1—summary() and anova()—show that forward selection retained the variables augMean, febSD, and augSD. These three predictors should be used together to explain the response, Y.

Let’s make a plot of the full model with all the initial predictors and the selected model with the predictors chosen by the forward selection process.

# Add fitted values from the selected model to the dataframe
sw_ectz$.fitted_selected <- fitted(mod1)

# Create the plot of observed vs fitted values for the selected model
ggplot(sw_ectz, aes(x = .fitted_selected, y = Y)) +
  geom_point(shape = 1, colour = "black", alpha = 1.0) +
  geom_point(aes(x = .fitted), colour = "red",
             shape = 1, alpha = 0.4) +
  geom_abline(intercept = 0, slope = 1,
              color = "blue", linetype = "dashed") +
  labs(x = "Fitted Values",
       y = "Observed Values") +
  theme_bw()

Figure 5.5: Plot of observed vs. predicted value obtained from the final multiple linear regression model (`mod`) with the selected variables `augMean`, `febSD`, and `augSD` as predictors (black points), and the initial model with also `annMean` and `febRange` (red points).

Reporting

A Results section should be written in a format sutable for inclusion in your report or publication. Present the results in a clear and concise manner, with tables and figures used to help substantiate your findings. The results should be interpreted in the context of the research question and the study design. The limitations of the analysis should also be discussed, along with any potential sources of bias or confounding. Here is an example.

Results

The model demonstrates a strong overall fit, as indicated by the high R^2 value of 0.839 and an adjusted R^2 of 0.837, suggesting that approximately 83.7% of the variance in the mean Sørensen dissimilarity is explained by the predictors augMean, febSD, and augSD. All predictors in the model are statistically significant, with augMean showing the strongest effect (\beta_1 = 0.283, p < 0.0001) (Figure 5.2). The predictors febSD and augSD also have significant positive relationships with the response variable (\beta_2 = 0.050, p = 0.0001; \beta_3 = 0.022, p = 0.0001). A sequential ANOVA further confirms the significance of each predictor variable in the model, with all F-values indicating that the inclusion of each predictor significantly improves the model fit (p < 0.0001 in all cases). Our model therefore provides clear support for the mean temperatures in August, the standard deviation of temperatures in February, and the standard deviation of temperatures in August as strong predictors of the mean Sørensen dissimilarity, with each contributing uniquely to the explanation of variability in the response variable.

5.7 Example 2: Interaction of Distance and Bioregion

Our seaweed dataset includes two additional variables that we have not yet considered. These are the continuous variable dist which represents the geographic distance between the seaweed samples taken along the coast of South Africa, and the categorical variable bio which is the bioregional classification of the seaweed samples.

These two new variables lend themselves to a few interesting questions. For example:

Is the geographic distance between samples related to the Sørensens dissimilarity of the seaweed flora?
Does the average Sørensens dissimilarity vary among the bioregions to which the samples belong?
Is the effect of geographic distance on the Sørensens dissimilarity different for each bioregion?

The most complex model is (3), the one that answers the question about whether the effect of dist on the response variable Y is different for each bioregion. Questions (1) and (2) are subsets of this more inclusive question. To fully answer these quesitons, let’s first consider the full model, which includes an interaction term between the continuous predictor dist and the categorical predictor bio. When we finally test our model, we will also have to consider the simpler models that do not include the interaction term.

‘Interaction’ means that the effect of one predictor on the response variable is contingent on the value of another predictor. For example, we might have reason to suspect that the relationship of the Sørensens dissimilarity with the geographic distance between samples is different between the west coast compared to, say, the east coast. This is indeed a plausible expectation, but we will test this formally below.

The full multiple linear regression model with the interaction terms can be formally expressed as Equation :

\begin{align} Y &= \alpha + \beta_1 \text{dist} + \beta_2 \text{bio}_{\text{B-ATZ}} + \beta_3 \text{bio}_{\text{BMP}} \nonumber \\ &\quad + \beta_4 \text{bio}_{\text{ECTZ}} + \beta_5 (\text{dist} \times \text{bio}_{\text{B-ATZ}}) \nonumber \\ &\quad + \beta_6 (\text{dist} \times \text{bio}_{\text{BMP}}) + \beta_7 (\text{dist} \times \text{bio}_{\text{ECTZ}}) + \epsilon \label{mod2} \end{align}

Where:

Y: The response variable, the mean Sørensen dissimilarity.
\alpha: The intercept term.
\text{dist}: The continuous predictor variable representing distance.
\text{bio}: The categorical predictor variable representing bioregional classification with four levels: AMP (reference category), B-ATZ, BMP, and ECTZ.
\text{bio}_\text{B-ATZ}, \text{bio}_\text{BMP}, \text{bio}_\text{ECTZ}: Dummy variables for the bioregional classification, where:
- \text{bio}_\text{B-ATZ} = 1 if bio = B-ATZ, and 0 otherwise,
- \text{bio}_\text{BMP} = 1 if bio = BMP, and 0 otherwise, and
- \text{bio}_\text{ECTZ} = 1 if bio = ECTZ, and 0 otherwise.
\text{dist} \times \text{bio}_\text{B-ATZ}, \text{dist} \times \text{bio}_\text{BMP}, \text{dist} \times \text{bio}_\text{ECTZ}: Interaction terms between distance and the bioregional classification dummy variables.
\beta_1, \beta_2, \beta_3, \beta_4, \beta_5, \beta_6, \beta_7: The coefficients to be estimated for the main effects and interactions.
\epsilon: The error term.

If this seems tricky, it is because of the dummy variable coding used to represent interactions in multiple linear regression. The bio variable is a categorical variable with four levels, so we need to create three dummy variables to represent the bioregional classification. The dist variable is then interacted with each of these dummy variables to create the interaction terms. The lm() function in R takes care of this for us in a far less complicated model statement. I’ll explain the details around the interpretation of dummy variable coding when we look at the output of the model with the summary() function.

State the Hypotheses for a Multiple Linear Regression with Interaction Terms

Equation expands into the following series of hypotheses that concern the main effects, the interactions between the main effects, and the overall hypothesis:

Main effects hypotheses

In the main effects hypotheses we are concerned with the effect of each predictor variable on the response variable. For the main effect of distance we have the null:

H_0: \beta_1 = 0

vs. the alternative:

H_A: \beta_1 \neq 0

For the main effect of bioregional classification, the nulls are:

H_0: \beta_2 = 0 \quad (\text{bio}_{\text{B-ATZ}})
H_0: \beta_3 = 0 \quad (\text{bio}_{\text{BMP}})
H_0: \beta_4 = 0 \quad (\text{bio}_{\text{ECTZ}})

vs. the alternatives:

H_A: \beta_2 \neq 0 \quad (\text{bio}_{\text{B-ATZ}})
H_A: \beta_3 \neq 0 \quad (\text{bio}_{\text{BMP}})
H_A: \beta_4 \neq 0 \quad (\text{bio}_{\text{ECTZ}})

Hypotheses about interactions

This is where the hypothesis tests whether the effect of distance on the response variable is different for each bioregional classification. The null hypotheses are:

H_0: \beta_5 = 0 \quad (\text{dist} \times \text{bio}_{\text{B-ATZ}})
H_0: \beta_6 = 0 \quad (\text{dist} \times \text{bio}_{\text{BMP}})
H_0: \beta_7 = 0 \quad (\text{dist} \times \text{bio}_{\text{ECTZ}})

vs. the alternatives:

H_A: \beta_5 \neq 0 \quad (\text{dist} \times \text{bio}_{\text{B-ATZ}})
H_A: \beta_6 \neq 0 \quad (\text{dist} \times \text{bio}_{\text{BMP}})
H_A: \beta_7 \neq 0 \quad (\text{dist} \times \text{bio}_{\text{ECTZ}})

Overall hypothesis

The overall hypothesis states that all coefficients associated with the predictors (distance, bioregional categories, and their interactions) are equal to zero, therefore indicating no relationship between these predictors and the response variable, the Sørensen index. The null hypothesis is:

H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = \beta_6 = \beta_7 = 0

vs. the alternative:

H_A: \exists \, \beta_i \neq 0 \text{ for at least one } i

Figure 5.6: Plot of main effects of A) distance along the coast and B) bioregional classification on the Sørensen dissimilarity index.

Visualise the Main Effects

To facilitate the interpretation of the main effects hypotheses and make an argument for why an interaction term might be necessary, I’ve visualised the main effects (Figure 5.6). I see this as part of my exploratory data analysis ensemble of tests. We see that fitting a straight line to the Y vs. distance relationship seems unsatisfactory as there is too much scatter around that single line to adequately capture all the structure in the variability of the points. Colouring the points by bioregion reveals the hidden structure. The model could benefit from including an additional level of complexity: see how points in the same bioregion show less scatter compared to points in different bioregions.

Now look at the boxplots of the Sørensen dissimilarity index for each bioregional classification. It shows that the median values of the Sørensen dissimilarity index are different for each bioregion. Taken together, Figure 5.6 (A, B) provide a good indication that adding the bioregional classification might be an important predictor of the Sørensen dissimilarity index as a function of distance between pairs of sites along the coast.

Next, we will move ahead and fit the model inclusive of the distance along the coast and bioregion as per Equation .

Fit and Assess Nested Models

I have a suspicion that the full model (mod2; see below) with the interaction terms will be a better fit than reduced models with only the effect due to distance (seen independently). How can we have greater certainty that we should indeed favour a slightly more complex model (with two predictors) over a simpler one with only (distance only)?

One way to do this is to use a nested model comparison. We will fit a reduced model (one slope for all bioregions) and compare this model to the full model (slopes are allowed to vary among bioregions).

# Fit the linear regression model with only distance
mod2a <- lm(Y ~ dist, data = sw)

# Fit the multiple linear regression model with interaction terms
mod2 <- lm(Y ~ dist * bio, data = sw)

This is a nested model where mod2a is nested within mod2. ‘Nested’ means that the reduced model is a subset of the full model. Nested models can be used to test hypotheses about the significance of the predictors in the full model—does adding more predictors to the model improve the fit? Comparing a nested model with a full model can be done with a sequential ANOVA, which is what the anova() function also does (in addition to its use in Section 5.6.8).

So, comparing mod2a to mod2 with an F-test tests the significance of adding the bio and using it together with dist. The interaction is built into mod2 but we are not yet testing the significance of the interaction terms. We will do that later.

anova(mod2a, mod2, test = "F")
> Analysis of Variance Table
> 
> Model 1: Y ~ dist
> Model 2: Y ~ dist * bio
>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
> 1    968 7.7388                                  
> 2    962 2.2507  6    5.4881 390.95 < 2.2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The sequential ANOVA shows that there is significant merit to consider an interaction term in the model. This model would then allow us to have a separate slope for the Sørensen index as function of distance for each bioregion. The residual sum of squares (RSS) decreases from 7.7388 in Model 1 to 2.2507 in Model 2, which indicates that Model 2 explains a significantly larger proportion of the variance in the response variable. The F-test for comparing the two models yields an F-value of 390.95 with a highly significant p-value (< 0.0001). The improvement in model fit due to the inclusion of the interaction term is therefore statistically significant.

The above analyses skirted around the questions stated in the beginning of Section 5.7. I’ve provided statistical evidence that full model is a better fit than the reduced model (the sequential F-test tested this), so we should use both dist and bio in the model. I have not looked explicitly at the main effects of the predictors. However, we can easily address questions (1) and (2):

Question 1: looking at the summary of mod2a tells us that the main effect of dist is a significant (p < 0.0001) predictor of the Sørensen dissimilarity index.
Question 2: the main effect of bio is also significant (p < 0.0001), which is what we’d see if we fit the model mod2b <- lm(Y ~ bio, data = sw).

Question 3 warrants deeper investigation. Next, we will look at the interaction terms in the full model mod2 to see if the effect of dist on Y is different for each level of bio.

Interpret the Full Model

The model summary

# Summary of the model
summary(mod2)
> 
> Call:
> lm(formula = Y ~ dist * bio, data = sw)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -0.112117 -0.030176 -0.004195  0.023698  0.233520 
> 
> Coefficients:
>                 Estimate Std. Error t value Pr(>|t|)    
> (Intercept)    5.341e-03  4.177e-03   1.279   0.2013    
> dist           3.530e-04  1.140e-05  30.958  < 2e-16 ***
> bioB-ATZ      -6.140e-03  1.659e-02  -0.370   0.7114    
> bioBMP         3.820e-02  6.659e-03   5.737 1.29e-08 ***
> bioECTZ        1.629e-02  6.447e-03   2.527   0.0117 *  
> dist:bioB-ATZ  7.976e-04  1.875e-04   4.255 2.30e-05 ***
> dist:bioBMP   -1.285e-04  2.065e-05  -6.222 7.31e-10 ***
> dist:bioECTZ   4.213e-04  1.801e-05  23.392  < 2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.04837 on 962 degrees of freedom
> Multiple R-squared:  0.8607,  Adjusted R-squared:  0.8597 
> F-statistic: 849.2 on 7 and 962 DF,  p-value: < 2.2e-16

In the output returned by summary(mod2), we need to pay special attention to the use of dummy variable encoding for the categorical predictor. The Coefficients section is similar to that of mod1 (see Section 5.6.8), but now it includes the categorical predictor bio* and the interaction terms dist:bio* (* indicating the levels of the categorical variable). The bio variable has four levels, BMP, B-ATZ, AMP, and ECTZ, and AMP is selected as reference level. This decision to selected AMP as reference is entirely arbitrary, and alphabetical sorting offers a convenient approach to selecting the reference. The coefficients for the other levels of bio are interpreted as the sum of the response variable and the reference level.

The following are the key coefficients in the model summary:

(Intercept): This is the estimated average value of Y when dist is zero and bio is the reference category (AMP). Its p-value (> 0.05) suggests it’s not significantly different from zero.
Main Effects:
- dist: This represents the estimated change in Y for a one-unit increase in dist when the bioregion is the reference category, AMP. The highly significant p-value (< 0.0001) indicates a strong effect of distance in the AMP.
- bioB-ATZ, bioBMP, bioECTZ: These are dummy variables representing different bioregions. Their coefficients indicate the difference in the average value of Y between each of these bioregions and the reference bioregion when dist is zero. Only bioBMP and bioECTZ are significantly different from the reference bioregion, AMP.
Interaction Effects:
- dist:bioB-ATZ, dist:bioBMP, dist:bioECTZ: These interaction terms capture how the effect of dist on Y varies across different bioregions. For instance, dist:bioB-ATZ indicates the additional change in the effect of dist in the B-ATZ bioregion compared to the reference bioregion, AMP. All interaction terms are highly significant, suggesting the effect of distance is different across bioregions.

Given this explanation, we can now interpret the coefficients of, for example, the bioB-ATZ main effect and dist:bioB-ATZ interaction. Since AMP is the reference bioregion, its effect is absorbed into the intercept term. Therefore, the coefficient for bioB-ATZ directly reflects the difference we are interested in. The coefficient for bioB-ATZ is -0.0061 \pm 0.0166 lower than that of the reference, but the associated p-value (> 0.05) indicates that the average value of Y in the B-ATZ bioregion is not significantly different from the reference bioregion, AMP.

If we’d want to report the actual coefficient for B-ATZ, we’d calculate the sum of the coefficients for (Intercept) and bioB-ATZ. This would give us the estimated average value of Y in the B-ATZ bioregion when dist is zero. The associated SE is calculated as the square root of the sum of the squared SEs of the two coefficients. Therefore, the coefficient for B-ATZ is -8^{-4} \pm 0.0171.

The coefficient of 8^{-4} for dist:bioB-ATZ indicates that the effect of distance on Y is 8^{-4} units greater in the B-ATZ bioregion compared to the AMP bioregion. The SE of 2^{-4} suggests a high level of precision in this estimate, and the p-value (< 0.0001) indicates that this difference is statistically significant.

As before, to calculate the actual coefficient for dist in the B-ATZ bioregion, we’d sum the coefficients for dist and dist:bioB-ATZ. The associated SE of this sum is calculated as the square root of the sum of the squared SEs of the two coefficients. Therefore, the coefficient for dist in the B-ATZ bioregion is 0.0012 \pm 2^{-4}.

Concerning the overall hypothesis, the Adjusted R-squared value of 0.8597 indicates that the model explains 85.97% of the variance in the response variable Y. The F-statistic and associated p-value (< 0.0001) indicate that the model as a whole is highly significant, meaning at least one of the predictors (including interactions) has a significant effect on Y.

The ANOVA table

# The ANOVA table
anova(mod2)
> Analysis of Variance Table
> 
> Response: Y
>            Df Sum Sq Mean Sq F value    Pr(>F)    
> dist        1 8.4199  8.4199 3598.79 < 2.2e-16 ***
> bio         3 3.6232  1.2077  516.21 < 2.2e-16 ***
> dist:bio    3 1.8648  0.6216  265.69 < 2.2e-16 ***
> Residuals 962 2.2507  0.0023                      
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table’s interpretation is intuitive and simple: the Pr(>F) column shows the p-value for each predictor in the model. The dist predictor has a highly significant effect on Y (< 0.0001), as do all the bioregions and their interactions with dist. This confirms the results we obtained from the coefficients. We don’t need to overthink this result.

5.8 Example 3: The Final Model

I’ll now expand mod1 to include bio as a predictor alongside augMean, febSD, and augSD (mod1 was applied only to data pertaining to ECTZ, one of the four levels in bio).

\begin{align} Y &= \alpha + \beta_1 \text{augMean} + \beta_2 \text{febSD} + \beta_3 \text{augSD} \nonumber \\ &\quad + \beta_4 \text{bio}_{\text{B-ATZ}} + \beta_5 \text{bio}_{\text{BMP}} + \beta_6 \text{bio}_{\text{ECTZ}} \nonumber \\ &\quad + \beta_7 (\text{augMean} \times \text{bio}_{\text{B-ATZ}}) + \beta_8 (\text{augMean} \times \text{bio}_{\text{BMP}}) \nonumber \\ &\quad + \beta_9 (\text{augMean} \times \text{bio}_{\text{ECTZ}}) + \beta_{10} (\text{febSD} \times \text{bio}_{\text{B-ATZ}}) \nonumber \\ &\quad + \beta_{11} (\text{febSD} \times \text{bio}_{\text{BMP}}) + \beta_{12} (\text{febSD} \times \text{bio}_{\text{ECTZ}}) \nonumber \\ &\quad + \beta_{13} (\text{augSD} \times \text{bio}_{\text{B-ATZ}}) + \beta_{14} (\text{augSD} \times \text{bio}_{\text{BMP}}) \nonumber \\ &\quad + \beta_{15} (\text{augSD} \times \text{bio}_{\text{ECTZ}}) + \epsilon \label{mod3} \end{align}

Where:

Y: The response variable (mean Sørensen dissimilarity).
\alpha: The intercept term, representing the expected value of Y when all predictors are zero and bio is at the reference level AMP).
\beta_1: The coefficient for the main effect of augMean.
\beta_2: The coefficient for the main effect of febSD.
\beta_3: The coefficient for the main effect of augSD.
\beta_4, \beta_5, \beta_6: The coefficients for the main effects of the categorical predictor bio (for levels B-ATZ, BMP, and ECTZ respectively, with AMP as the reference category).
\beta_7, \beta_8, \beta_9: The coefficients for the interaction effects between augMean and bio (for levels B-ATZ, BMP, and ECTZ respectively).
\beta_{10}, \beta_{11}, \beta_{12}: The coefficients for the interaction effects between febSD and bio (for levels B-ATZ, BMP, and ECTZ respectively).
\beta_{13}, \beta_{14}, \beta_{15}: The coefficients for the interaction effects between augSD and bio (for levels B-ATZ, BMP, and ECTZ respectively).
\epsilon: The error term, representing the unexplained variability in the response variable.

In this multiple regression model, we aim to understand the complex and interacting relationships between the response variables and the set of predictors. It allows us to investigate not only the individual effects of the continuous predictors on Y, but also how these effects might vary across the different bioregions.

The model therefore incorporates interaction terms between each continuous predictor (augMean, febSD, and augSD) and the categorical variable bio. This allows us to assess whether the relationships between augMean, febSD, or augSD and Y change depending on the specific bioregion. Essentially, we are testing whether the slopes of these relationships are different in different bioregions.

Additionally, the model examines the main effects of the bioregions themselves on Y. This means we’re testing whether the average value of Y differs significantly across bioregions, after accounting for the influence of the continuous predictors.

This is how these different insights pertain to the model components:

Main Effects: The coefficients for the main effects of augMean, febSD, and augSD represent the effect of each predictor when bio is at its reference level.
Coefficients for bio: The coefficients for bio (e.g., \beta_4 \text{bio}_{\text{B-ATZ}}) represent the difference in the intercept for the corresponding level of bio compared to the reference level.
Interaction Terms: The interaction terms allow the slopes of augMean, febSD, and augSD to vary across the different levels of bio. For example, \beta_7 (\text{augMean} \times \text{bio}_{\text{B-ATZ}}) represents how the effect of augMean on Y changes when bio is B-ATZ compared to AMP.

State the Hypotheses

Overall hypothesis

I’ll only state the overall hypothesis for this model as the expansion of the individual hypotheses for each predictor and interactions (all the \beta-coefficients in Equation ) is quite voluminous.

The null is that there is no relationship between the response variable Y and the predictors (including their interactions):

H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = \beta_6 = \beta_7 = \beta_8 = \beta_9 = \beta_{10} = \beta_{11} = \beta_{12} = \beta_{13} = \beta_{14} = \beta_{15} = 0

The alternative is that at least one predictor or interaction term has a significant relationship with the response variable Y:

H_A: \text{At least one } \beta_i \neq 0 \text{ for } i \in \{1, 2, ..., 15\}

Fit the Model

In Section 5.6 I included the ECTZ seaweed flora in my analysis, but here I expand it to the full dataset. To assure myself that there is not a high degree of multicollinearity between the predictors, I have calculated the variance inflation factors (VIFs) for the full model (not shown). This allowed me to retain the same three predictors used in mod1, i.e. augMean, febSD, and augSD. This is the point of departure for mod3.

Now I fit the model with those three continuous predictors and their interactions with the categorical variable bio.

# Make a dataframe with only the relevant columns
sw_sub2 <- sw |>
  dplyr::select(Y, augMean, febSD, augSD, bio)

# Fit the multiple linear regression model with interaction terms
full_mod3 <- lm(Y ~ (augMean + febSD + augSD) * bio, data = sw_sub2)
full_mod3a <- lm(Y ~ augMean + febSD + augSD, data = sw_sub2)
null_mod3 <- lm(Y ~ 1, data = sw_sub2)

Model full_mod3a is similar to full_mod3 but without the interaction terms. This will allow me to compare the two models and assess the importance of the interactions.

# Compare the models
anova(full_mod3, full_mod3a)
> Analysis of Variance Table
> 
> Model 1: Y ~ (augMean + febSD + augSD) * bio
> Model 2: Y ~ augMean + febSD + augSD
>   Res.Df    RSS  Df Sum of Sq      F    Pr(>F)    
> 1    954 3.5603                                   
> 2    966 5.6890 -12   -2.1288 47.535 < 2.2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(full_mod3, full_mod3a)
>            df       AIC
> full_mod3  17 -2652.498
> full_mod3a  5 -2221.852

The AIC value for full_mod3 is lower than that of full_mod3a, indicating that including the interaction with bio is necessary. Likewise, the ANOVA test also shows that the full model (lower residual sum of squares) is significantly better than the reduced model.

I therefore use full_mod3 going forward. This is a complex model so I have used the stepwise selection function, stepAIC(), to identify the most important predictors and interactions (code and output not shown). I hoped that this might have simplified the model somewhat, but the simplification I had hoped for did not materialise.

Interpret the Model

The model summary

The model summary provides a detailed look at the individual predictors and their interactions in the model.

# Summary of the model
summary(mod3) # full_mod3 renamed to mod3 during stepAIC()
> 
> Call:
> lm(formula = Y ~ augMean + bio + augSD + febSD + augMean:bio + 
>     bio:augSD + bio:febSD, data = sw_sub2)
> 
> Residuals:
>      Min       1Q   Median       3Q      Max 
> -0.15399 -0.03841 -0.01475  0.03464  0.24051 
> 
> Coefficients:
>                    Estimate Std. Error t value Pr(>|t|)    
> (Intercept)       0.0299094  0.0062756   4.766 2.17e-06 ***
> augMean           0.3441099  0.0158575  21.700  < 2e-16 ***
> bioB-ATZ         -0.0459611  0.0242519  -1.895 0.058374 .  
> bioBMP            0.0160756  0.0100749   1.596 0.110906    
> bioECTZ          -0.0015444  0.0090275  -0.171 0.864197    
> augSD            -0.0059012  0.0034011  -1.735 0.083044 .  
> febSD            -0.0006481  0.0027954  -0.232 0.816706    
> augMean:bioB-ATZ -0.0461775  0.0874044  -0.528 0.597400    
> augMean:bioBMP   -0.2406297  0.0211404 -11.382  < 2e-16 ***
> augMean:bioECTZ  -0.0607745  0.0189030  -3.215 0.001348 ** 
> bioB-ATZ:augSD    0.0655983  0.0371033   1.768 0.077382 .  
> bioBMP:augSD      0.0410220  0.0114706   3.576 0.000366 ***
> bioECTZ:augSD     0.0280513  0.0053752   5.219 2.21e-07 ***
> bioB-ATZ:febSD    0.0409425  0.0818927   0.500 0.617223    
> bioBMP:febSD      0.0056433  0.0150126   0.376 0.707070    
> bioECTZ:febSD     0.0502867  0.0082266   6.113 1.43e-09 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.06109 on 954 degrees of freedom
> Multiple R-squared:  0.7797,  Adjusted R-squared:  0.7762 
> F-statistic: 225.1 on 15 and 954 DF,  p-value: < 2.2e-16

The first thing to notice is that the model function has been rewritten in the forward selection process (but none of the variables were deemed insignificant and removed):

Initial specification: Y ~ (augMean + febSD + augSD) * bio
Specification after stepAIC(): Y ~ augMean + bio + augSD + febSD + augMean:bio + bio:augSD + bio:febSD

Functionally, these two are identical, but the order in which the terms are presented differs. Although this has affected the order in which the coefficients are presented in the summary output, the coefficients are the same. The coefficients are:

(Intercept): This is the estimated average value of Y when all predictor variables are zero and the observation is in the reference bioregion (AMP).
Main Effects:
- augMean: For every one-unit increase in augMean, Y increases by 0.3441, on average, assuming all other predictors are held constant. This effect is highly significant.
- augSD and febSD: The main effects of these variables are not statistically significant, suggesting they might not have a direct impact on Y when averaged across all bioregions.
- bioB-ATZ, bioBMP, bioECTZ: These coefficients represent the average difference in Y between each of these bioregions and the reference bioregion, when the continuous predictors are held at zero.
Interaction Effects:
- augMean interactions: The significant interactions of augMean with bioregion indicate that the effect of augMean on Y varies across bioregions. Notably, the interaction with bioBMP has a strong, significant negative effect, suggesting that the positive effect of augMean is much weaker in this bioregion compared to the reference.
- augSD and febSD interactions: These interactions with bioregions are sometimes significant, providing good support for the alternative hypothesis that the effects of augSD and febSD on Y depend on the specific bioregion.

Since dummy coding returns differences with respect to reference levels, how would we calculate the actual coefficients for, say, augMean? Since there are significant interaction effects, we must consider the main effect of augMean in conjunction with bioregion.

For bio = B-ATZ:

\beta_{\text{augMean}} + \beta_{\text{augMean:bioB-ATZ}} = 0.3441099 + (-0.0461775) = 0.2979324

For bio = BMP:

\beta_{\text{augMean}} + \beta_{\text{augMean:bioBMP}} = 0.3441099 + (-0.2406297) = 0.1034802

For bio = ECTZ:

\beta_{\text{augMean}} + \beta_{\text{augMean:bioECTZ}} = 0.3441099 + (-0.0607745) = 0.2833354

The respective SEs for these coefficients can be calculated using the formula for the standard error of the sum of two variables. For example:

SE_{\text{augMean}} = \sqrt{SE_{\text{augMean}}^2 + SE_{\text{augMean:bio}}^2}

The ANOVA table

The ANOVA table assesses the overall significance of groups of predictors or the sequential addition of predictors to the model.

anova(mod3)
> Analysis of Variance Table
> 
> Response: Y
>              Df Sum Sq Mean Sq  F value    Pr(>F)    
> augMean       1 9.9900  9.9900 2676.902 < 2.2e-16 ***
> bio           3 1.1901  0.3967  106.296 < 2.2e-16 ***
> augSD         1 0.1393  0.1393   37.331 1.451e-09 ***
> febSD         1 0.0053  0.0053    1.422    0.2334    
> augMean:bio   3 0.7910  0.2637   70.647 < 2.2e-16 ***
> bio:augSD     3 0.3426  0.1142   30.602 < 2.2e-16 ***
> bio:febSD     3 0.1401  0.0467   12.517 4.953e-08 ***
> Residuals   954 3.5603  0.0037                       
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table shows that the model is highly significant, with very low p-values throughout (< 0.0001). This indicates that the model as a whole is a good fit for the data.

Reporting

Here is what the reporting of the findings could look like in the Results section in your favourite journal.

Results

A multiple linear regression model examining the effects of the August climatological mean temperature (augMean), the August and February climatological SD of temperature (augSD and febSD, respectively), and the bioregion classification (bio) on the response variable, the Sørensen dissimilarity (Y), including their interaction terms, revealed several significant findings (Table 5.1). This model allows a separate regression slope for each predictor within the bioregions (Figure 5.7). The model explains a substantial portion of the variance in Y (R^2 = 0.780, adjusted R^2 = 0.776), and the overall model fit is highly significant (F(15, 954) = 225.1, p < 0.0001).

Coefficient	Estimate	Std. Error	t value	P-value
`(Intercept)`	0.0299	0.0063	4.766	< 0.0001 ***
`augMean`	0.3441	0.0159	21.700	< 0.0001 ***
`bioB-ATZ`	-0.0460	0.0243	-1.895	> 0.05
`bioBMP`	0.0161	0.0101	1.596	> 0.05
`bioECTZ`	-0.0015	0.0090	-0.171	> 0.05
`augSD`	-0.0059	0.0034	-1.735	> 0.05
`febSD`	-0.0006	0.0028	-0.232	> 0.05
`augMean:bioB-ATZ`	-0.0462	0.0874	-0.528	> 0.05
`augMean:bioBMP`	-0.2406	0.0211	-11.382	< 0.0005 ***
`augMean:bioECTZ`	-0.0608	0.0189	-3.215	< 0.005 **
`bioB-ATZ:augSD`	0.0656	0.0371	1.768	> 0.05
`bioBMP:augSD`	0.0410	0.0115	3.576	< 0.0005 ***
`bioECTZ:augSD`	0.0281	0.0054	5.219	< 0.0005 ***
`bioB-ATZ:febSD`	0.0409	0.0819	0.500	> 0.05
`bioBMP:febSD`	0.0056	0.0150	0.376	> 0.05
`bioECTZ:febSD`	0.0503	0.0082	6.113	< 0.0005 ***

Table 5.1: Summary of the multiple linear regression model examining the effects of augMean, augSD, febSD, and bio on Y.

The main effect of augMean was highly significant (Estimate = 0.3441, p < 0.0001), indicating a strong positive relationship with Y. The interaction term augMean:bioBMP (Estimate = -0.2406, p < 0.0001) and augMean:bioECTZ (Estimate = -0.0608, p < 0.005) were also significant, suggesting that the effect of augMean on Y varies significantly for BMP and ECTZ bioregions compared to the reference category (AMP). The bioBMP (Estimate = 0.0161, p > 0.05) and bioECTZ (Estimate = -0.0015, p > 0.05) terms were not significant, indicating no significant difference from AMP.

For augSD, the main effect was not significant (Estimate = -0.0059, p > 0.05). Significant interaction terms for bioBMP:augSD (Estimate = 0.0410, p < 0.001) and bioECTZ:augSD (Estimate = 0.0281, p < 0.0001) indicate that the effect of augSD on Y varies by bioregion.

The main effect of febSD was not significant (Estimate = -0.0006, p > 0.05), suggesting no direct relationship with Y. However, the interaction term bioECTZ:febSD (Estimate = 0.0503, p = 0.0001) was significant, indicating that the effect of febSD on Y differs for the ECTZ bioregion.

The ANOVA further highlights the overall significance of each predictor. augMean had a highly significant contribution to the model (F = 2676.902, p < 0.0001), as did bio (F = 106.296, p < 0.0001), and their interactions (augMean:bio, F = 70.647, p < 0.0001; bio:augSD, F = 30.602, p < 0.0001; bio:febSD, F = 12.517, p = 4.953 \times 10^{-8}). The main effect of augSD was also significant (F = 37.331, p = 1.451 \times 10^{-9}), while febSD did not significantly contribute to the model on its own (F = 1.422, p = 0.2334).

These findings suggest that the effects of augMean, augSD, and febSD on Y are influenced by the bioregional classification, with significant variations in the relationships depending on the specific bioregion.

Figure 5.7: Individual linear regression fit to the variables `augMean`, `febSD`, and `augSD` for each bioregion as predictors of the seaweed species composition.

5.9 Alternative Categorical Variable Coding Schemes (Contrasts)

Throughout the book, we have used dummy variable coding the specify the categorical variables in the multiple linear regression models. But, should dummy variable coding not be to your liking, there are other coding schemes that can be used to represent the categorical variables. These alternative coding schemes are known as contrasts. The choice of contrast coding can affect the interpretation of the regression coefficients.

I’ll provide some synthetic data to illustrate a few different contrasts. The data consist of a continuous variable x, a categorical variable cat_var with four levels, and a response variable y that has some relationship with x and cat_var. I’ll use dummy variable coding as the reference (haha!).

head(data)
>            y           x cat_var
> 1  0.6667876 -0.56047565       B
> 2  1.3086873 -0.23017749       B
> 3  0.4496192  1.55870831       D
> 4  2.1326402  0.07050839       A
> 5 -2.8608771  0.12928774       D
> 6  0.1497346  1.71506499       D

Categorical variable coding (any scheme) only affects the interpretation of the categorical variable main effects and their interactions, so I’ll not discuss the coefficient associated with the continuous variable x (the slope) in the model throughout the explanations offered below.

Dummy Variable Coding (Treatment Contrasts)

This is the most commonly used coding scheme, and lm()’s default. One level is the reference category (A) and the other levels are compared against it. Contrast matrices can be assigned and/or inspected using the contrasts() function. For the dummy coding, the reference level A will remain 0 and the other levels will be independently coded as 1 in three columns. You’ll now understand why, when we have four levels within a categorical variable, we only need three dummy variables to represent them.

# Dummy coding (treatment coding) ... default
contrasts(data$cat_var)
>   B C D
> A 0 0 0
> B 1 0 0
> C 0 1 0
> D 0 0 1

When we have four levels in a categorical variable, there are three dummy variable columns in the contrast matrix. The first row, consisting of all zeros (0, 0, 0), represents the reference level, which in this case is A. The other rows represent the different levels of the categorical variable, with a 1 in the respective column indicating that level. For example, level A is represented by (0, 0, 0), B by (1, 0, 0), C by (0, 1, 0), and D by (0, 0, 1). In the regression model, these contrasts are used to estimate the differences between each level and the reference level. Specifically, the first contrast column indicates that the coefficient for this column will represent the difference between the mean of the response variable for level B and the mean for the reference level A, holding all other variables constant. Similarly, the second and third columns represent the differences between levels C and A, and D and A, respectively. This coding allows for a straightforward interpretation of how each level of the categorical variable affects the response variable relative to the reference level.

model_dummy <- lm(y ~ x + cat_var, data = data)
summary(model_dummy)
> 
> Call:
> lm(formula = y ~ x + cat_var, data = data)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -1.6615 -0.6297 -0.1494  0.4978  2.9305 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)   2.8176     0.1635  17.232  < 2e-16 ***
> x             1.8274     0.1040  17.572  < 2e-16 ***
> cat_varB     -1.7201     0.2499  -6.883 6.24e-10 ***
> cat_varC     -3.9056     0.2678 -14.586  < 2e-16 ***
> cat_varD     -5.4880     0.2512 -21.850  < 2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.9246 on 95 degrees of freedom
> Multiple R-squared:  0.887,   Adjusted R-squared:  0.8822 
> F-statistic: 186.4 on 4 and 95 DF,  p-value: < 2.2e-16

The model summary shows that the coefficients for cat_varB, cat_varC, and cat_varD represent the differences in the mean of the response variable y between the reference category A and categories B, C, and D, respectively, while controlling for the effect of the continuous variable x.

Interpretation:

(Intercept) (2.8176): The intercept represents the estimated mean value of the response (y) when x is zero and the categorical variable is at the reference level A. This is the baseline from which other categories are compared.
x (1.8274): For each one-unit increase in x, y is expected to increase by 1.8274 units, holding the categorical variable constant. This effect is consistent across all levels of the categorical variable because the model does not have an interaction effect present.
cat_varB (-1.7201): On average, the value of y for level B is 1.7201 units lower than that for the reference level A, when x is held constant. This corresponds to the (1, 0, 0) row in the contrast matrix.
cat_varC (-3.9056): Similarly, on average, the value of y for level C is 3.9056 units lower than that for the reference level, when x is held constant. This corresponds to the (0, 1, 0) row in the contrast matrix.
cat_varD (-5.4880): Lastly, on average, the value of y for level D is 5.4880 units lower compared to the reference , when x is held constant. This is row (0, 0, 1) row in the contrast matrix.

All these coefficients are highly significant (p < 0.0001), indicating strong evidence for differences between each category and the reference category A.

The model explains a large proportion of the variance in y (Adjusted R-squared: 0.8822), suggesting a good fit. The F-statistic (186.4) with a very low p-value (< 0.0001) indicates that the model as a whole is statistically significant.

If you want to change the reference level, you can use the relevel() function. For example, to change the reference level of cat_var variable to C_2, you can use:

# Set "C" as the reference level for cat_var
data$cat_var <- relevel(data$cat_var, ref = "C")
contrasts(data$cat_var)
>   A B D
> C 0 0 0
> A 1 0 0
> B 0 1 0
> D 0 0 1

This may be useful when you want to compare the other levels to a different reference level.

Effect Coding (Sum Contrasts)

This coding method compares the levels of a categorical variable to the overall mean of the dependent variable. The coefficients represent the difference between each level and the grand mean. Instead of using 0 and 1 as we did with dummy variable coding, effect coding uses -1, 0, and 1 to represent the different levels of the categorical variable.

# Reset the reference level to "A"
data <- data.frame(y, x, cat_var)

# Effect coding
contrasts(data$cat_var) <- contr.sum(4)
contrasts(data$cat_var)
>   [,1] [,2] [,3]
> A    1    0    0
> B    0    1    0
> C    0    0    1
> D   -1   -1   -1

In effect coding (sum contrasts), each level of the categorical variable is compared to the overall mean rather than a specific reference category. This contrast matrix with four levels (A, B, C, D) and three columns can be interpreted as follows:

Level A (1, 0, 0): The first row indicates that level A is included in the first contrast (cat_var1), which means the mean of level A is being compared to the overall mean. Since the other columns are zero, level A does not contribute to the other contrasts.
Level B (0, 1, 0): The second row indicates that level B is included in the second contrast (cat_var2). The mean of level B is being compared to the overall mean, and it does not contribute to the other contrasts.
Level C (0, 0, 1): The third row indicates that level C is included in the third contrast (cat_var3). The mean of level C is being compared to the overall mean, and it does not contribute to the other contrasts.
Level D (-1, -1, -1): The fourth row is a balancing row, ensuring that the sum of the contrasts for each level equals zero. This indicates that level D is being compared to the overall mean indirectly by balancing the contributions of levels A, B, and C.

model_effect <- lm(y ~ x + cat_var, data = data)
summary(model_effect)
> 
> Call:
> lm(formula = y ~ x + cat_var, data = data)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -1.6615 -0.6297 -0.1494  0.4978  2.9305 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)  0.03921    0.09452   0.415    0.679    
> x            1.82741    0.10400  17.572  < 2e-16 ***
> cat_var1     2.77844    0.14968  18.563  < 2e-16 ***
> cat_var2     1.05832    0.16329   6.481 4.04e-09 ***
> cat_var3    -1.12720    0.17765  -6.345 7.53e-09 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.9246 on 95 degrees of freedom
> Multiple R-squared:  0.887,   Adjusted R-squared:  0.8822 
> F-statistic: 186.4 on 4 and 95 DF,  p-value: < 2.2e-16

Interpretation:

(Intercept) 0.03921: The intercept represents the grand mean of the response variable (y). Since the intercept is not statistically significant (p > 0.05), it indicates that the overall mean is not significantly different from zero when considering the average effect of all levels of the categorical variable.
x (1.82741): For each one-unit increase in (x), the response (y) increases by approximately 1.82741 units. This effect is highly significant (p < 0.0001).
cat_var1 (2.77844): Level A has a mean (y) that is 2.77844 units higher than the grand mean. This effect is highly significant (p < 0.0001).
cat_var2 (1.05832): Level B has a mean (y) that is 1.05832 units higher than the grand mean. This effect is also highly significant (p < 0.0001).
cat_var3 (-1.12720): Level C has a mean (y) that is 1.12720 units lower than the grand mean. This effect is highly significant (p < 0.0001).

All these coefficients are highly significant (p < 0.0001), indicating strong evidence for differences between each category and the overall mean of all levels.

Helmert Coding

Helmert coding compares each level of a categorical variable to the mean of the subsequent levels. It is useful for testing ordered differences.

# Helmert coding
contrasts(data$cat_var) <- contr.helmert(4)
contrasts(data$cat_var)
>   [,1] [,2] [,3]
> A   -1   -1   -1
> B    1   -1   -1
> C    0    2   -1
> D    0    0    3

The contrast matrix for a categorical variable with four levels (A, B, C, D) and three columns can be interpreted as follows:

Level A (-1, -1, -1): Level A is compared to the mean of levels B, C, and D. The negative values indicate that level A is being subtracted in these comparisons.
Level B (1, -1, -1): Level B is compared to the mean of levels C and D. The positive value in the first column indicates that level B is being added in this comparison.
Level C (0, 2, -1): Level C is compared to the mean of level D. The positive value in the second column indicates that level C is being added in this comparison, while the negative value in the third column is part of the comparison for subsequent levels.
Level D (0, 0, 3): Level D is compared on its own in the final contrast. The positive value in the third column indicates that level D is being added in this comparison.

model_helmert <- lm(y ~ x + cat_var, data = data)
summary(model_helmert)
> 
> Call:
> lm(formula = y ~ x + cat_var, data = data)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -1.6615 -0.6297 -0.1494  0.4978  2.9305 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)  0.03921    0.09452   0.415    0.679    
> x            1.82741    0.10400  17.572  < 2e-16 ***
> cat_var1    -0.86006    0.12495  -6.883 6.24e-10 ***
> cat_var2    -1.01519    0.08206 -12.371  < 2e-16 ***
> cat_var3    -0.90319    0.05477 -16.491  < 2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.9246 on 95 degrees of freedom
> Multiple R-squared:  0.887,   Adjusted R-squared:  0.8822 
> F-statistic: 186.4 on 4 and 95 DF,  p-value: < 2.2e-16

Interpretation:

(Intercept) (0.03921): The grand mean of y when x is zero.
x (1.82741): For each unit increase in x , y increases by 1.82741 units.
cat_var1 (-0.86006): The mean of level A is 0.86006 units lower than the combined mean of levels B, C, and D.
cat_var2 (-1.01519): The mean of level B is 1.01519 units lower than the combined mean of levels C and D.
cat_var3 (-0.90319): The mean of level C is 0.90319 units lower than the mean of level D.

The interpretation of the overall model remains more-or-less similar to before:

All these coefficients are highly significant (p < 0.0001), indicating strong evidence for differences between each level and the overall mean of all subsequent levels.

5.10 Exercises

Task G

Use the data loaded at the start of this chapter for this task.

In this task you will develop data analysis, undertake model building, and provide an interpretation of the findings. Your goal is to explore the species composition and assembly processes of the seaweed flora around the coast of South Africa. See Smit et al. (2017) for more information about the data and the analysis.

Analysis: Please develop multiple linear regression models for the seaweed species composition (\beta_\text{sim} and \beta_\text{sne}, i.e. columns called Y1 and Y2, respectively) using the all the predictors in this dataset. At the end, the final model(s) that best describe(s) the species assembly processes operating along the South African coast should be presented. The final model may/may not contain all the predictors in the dataset, and it is your goal to justify the variable and model selection.
- Accomplishing a) will require that you work through the whole model-building process as outlined in the chapter. This includes the following steps:
  - Data exploration and visualisation (EDA)
  - Model building (providing hypothesis statements, variable selection using VIF and forward selection, comparisons of nested models, justifications for model selection)
  - Model diagnostics
  - Explanation of summary() and anova() outputs
  - Producing the Results section
  - [60%]
Interpretation: Once you have arrived at the best model, discuss your findings in the light of the appropriate ecological hypotheses that explain the relationships between the predictors and the seaweed species composition. Include insights drawn from the analysis of \beta_\text{sør} that I developed in this chapter, and also rely on the theory you have developed for the lecture material the class presented in Task A2.
- Accomplishing b) is thus all about model interpretation and discussing the ecological relevance of the results.
- [40%]

The format of this task is a Quarto file that will be converted to an HTML file. The HTML file will contain the graphs, all calculations, and the text sections. The task should be written up as a publication (i.e. use appropriate headings) using a journal style of your choice. Aside from this, there are no limitations.