Intro

We use models to explain the relationship between variables and to make predictions. For now we will focus on linear models (but remember there are other types of models too!)

Today, we will focus on fitting and interpreting models with a continuous outcome variable and multiple predictors. This type of model is often called a multivariable (not multivariate) model. We will again be using functions from the broom package. In particular, we will be using the augment function to get model predictions and check assumptions for our linear regression model.

Today’s data

Today’s data come from a series of atmospheric measurements taken at the Agriculture Exhibition Hall in Beijing from 2013 - 2017. The following variables are in the dataset:

  • year, month, day, and hour: when the observation was taken
  • PM2.5 and PM10: levels of fine particulate matter in the atmosphere, in \(\mu\)g/m\(^3\)
  • SO2, NO2, CO, O3: sulfur dioxide, nitrogen dioxide, carbon monoxide, and ozone concentrations, in \(\mu\)g/m\(^3\), respectivel
  • TEMP: temperature, in degrees C
  • PRES: barometric pressure, in hPa
  • DEWP: dew point temperature, in degrees C
  • RAIN: precipitation levels, in mm
  • wd: wind direction (eight compass directions)
  • SWPM: wind speed, in m/s

These data are adapted from a dataset released by Song Xi Chen as referenced in Zhang S., et al (2017), Proc. Royal Soc. A. Let’s explore some of these data.

Last time, we created a linear model using ordinary least squares that predicts the dew point given the current barometric pressure, and interpreted the slope and intercept parameter estimates as follows:

mod1 <- lm(DEWP ~ PRES, data = beijing)
tidy(mod1) %>%
  mutate_if(is.numeric, round, 3)
  • Intercept: if the barometric pressure is 0, we would expect the dew point to be 1071 degrees Celsius, on average.

  • Slope: for each additional hectopascal of barometric pressure, we would expect the dew point to decrease by 1.08 degrees Celsius.

Adding more predictors

We have a pretty good idea of what the dew point would be, conditional on the value of the barometric pressure. But if we knew the temperature and
precipitation, we might do even better! Let’s add some more predictors to our model:

mod2 <- lm(DEWP ~ PRES + TEMP + RAIN, data = beijing)

In general, the underlying linear model with multiple predictors is given by

\[y_i = \beta_0 + \beta_1 x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} + \epsilon_i\]

  • \(p\) is the total number of predictor or explanatory variables
  • \(y_i\) is the outcome variable for individual \(i\)
  • \(\beta_0\) is the intercept parameter
  • \(\beta_1, \beta_2, \cdots, \beta_p\) are the slope parameters
  • \(x_{1i}, \cdots, x_{pi}\) are the predictor variables
  • \(\epsilon_i\) is the unobserved error term

Interpretations for slope estimates are made conditionally on other predictors in the model. In this way, we can adjust for potential confounders in our model:

\[\widehat{DEWP} = 316 - 0.364 PRES + 0.723 TEMP + 1.15 RAIN\]

  • If the barometric pressure, temperature, and precipitation are all 0, then we would expect a dew point of 361 degrees Celsius (hmm…).

  • For each additional hPa increase in barometric pressure, we would expect a 0.364 C decrease in dew point (on average), holding temperature and precipiation constant.

  • For each additional degree C increase in temperature, we would expect a 0.723 C increase in dew point (on average), holding barometric pressure and precipiation constant.

  • For each additional mm increase in precipitation, we would expect a 1.15 C increase in dew point (on average), holding barometric pressure and temperature constant.

Inference with multiple predictors

We may again test hypotheses that \(\beta\) coefficients are equal to 0 (vs. the alternative that they are not equal to 0). Be sure that any interpretation for slopes are made conditionally on the other predictors in the model (that is, holding them constant/adjusting for them, etc.). Under the null hypothesis, each slope coefficient should follow a \(t\) distribution, with \(n - p - 1\) degrees of freedom (where \(n\) is the total sample size and \(p\) is the number of estimated parameters in your model).

Let’s check the tidy model output from before:

Is there sufficient evidence that barometric pressure has a linear association with dew point, while adjusting for temperature and precipitation?

Yes. At the \(\alpha\) = 0.05 level, we reject the null hypothesis. There is sufficient evidence of such an association (i.e., that the \(\beta\) coefficient is not equal to 0).

\(R^2\) and \(R^2_{adj}\)

We can once again glance at the model output to obtain model-specific summaries. Let’s take a look at the \(R^2\) for our model:

We see that approximately 70% of the variability in dew point explained by its linear relationship with barometric pressure, temperature, and dew point, which is higher than the 60% explained by barometric pressure alone.

However, \(R^2\) can never decrease when variables are added to a model, even if they are useless for prediction. We adjust for the number of predictors in our model to obtain adjusted \(R^2\), where the adjustment is made to account for the number of predictors. \(R^2_{adj}\) incorporates a penalty for each additional predictor, so it will go down if a new variable does not meaningfully improve prediction. However, note that \(R^2_{adj}\) can not be interpreted as explained variability.

Predicting new values

Our fitted model is given by

\[\widehat{DEWP} = 360.7 - 0.36PRES + 0.72 TEMP + 1.15RAIN,\]

and can be used to predict dew points given the barometric pressure, temperature, and precipitation (simply plug into the model).

Exercise 1

Suppose the barometric pressure, temperature, and precipitation are equal to the mean values in our dataset. What would the predicted dew point be? You may use the following space to code your answer.

mod2 <- lm(DEWP ~ PRES + TEMP + RAIN, data = beijing)

...

Is this prediction close to the mean dew point in our dataset?

Yes, our estimate of 2.37 degrees C is quite close to the mean dew point in the dataset, which is equal to 2.38.

Augmenting model results

We can actually obtain the predicted values using code, without having to deal with messy algebra. We will use the augment function. Let’s try it out on our model object:

Notice that the augment function provides fitted values and residuals (and more) for every observation in our dataset.

The mean pressure, temperature, and precipitation in our dataset are 1012.6 hPa, 13.55 C, and 0.074 mm, respectively (did you obtain these results using the summarize function in a pipeline?). Let’s create a new tibble and “feed” it into our mod2 model object, which contains our multiple linear regression model:

mean_vals <- beijing %>% 
  summarize(PRES = mean(PRES, na.rm = T),
            TEMP = mean(TEMP, na.rm = T),
            RAIN = mean(RAIN, na.rm = T))

augment(mod2, newdata = mean_vals)

Note that the new dataset has the same variable names as found in our model. With our model we predict a dew point of 2.37 C and note that the standard error for this prediction is 0.195 C.

How might we use these results to create interval estimates?

We can construct a confidence interval by taking our point estimate .fitted, and adding/subtracting the margin of error given by the confidence multiplier times the standard error of our estimate, .se.fit.

Model assumptions

So far, we have blindly fit linear models without caring about any assumptions. However, like all statistical techniques, there are a few main assumptions to the linear model:

  • The outcomes \(y_i\) are independent. This means that observations do not “influence” each other in any way (e.g., repeated outcome measures are made on the same individual)

  • A linear relationship holds (though we can relax this to some extent)

  • The error terms are normally distributed, with mean zero and constant variance across all predicted/fitted values

Many of the assumptions deal with the error term and fitted values, which we might estimate using the residuals and predictions from our model. Luckily, we can use the augment function from the broom package to calculate these. Let’s consider the multiple regression model from earlier, focusing on the .fitted and .resid variables in this data frame:

augment(mod2) %>% 
  select(.fitted, .resid)
  • .fitted provides the predicted value of the response variable (for the given observation)

  • .resid provides the residual for the observation

Let’s evaluate these four assumptions.

Independence

Unfortunately, in many cases we must simply think about how the data are collected. In this case, we have atmospheric measurements through time at a single station. Independence might not be satisfied – we might imagine that if the dew point is high today, it’ll probably still be high tomorrow. Note that the data are already sorted in chronological order. Let’s plot the dew point through time, examining the first three hundred data points:

beijing %>% 
  slice(1:300) %>% 
  ggplot(data = ., mapping = aes(x = 1:300, y = DEWP)) + 
  geom_line() + 
  labs(x = "Measurement number", y = "Dew point (C)")

We see that independence is violated. Although we used a graph to help us determine this visually, this was only because our data happened to be time- series data. In general, there is no plot that we will always be able to use to diagnose independence.

Normality of errors

We can simply plot a histogram of the residuals and assess visually. There are more formal methods, but for our purposes this will suffice. Note our use of the .resid variable from the augmented model results:

augment(mod2) %>% 
  ggplot(data = ., mapping = aes(x = .resid)) + 
  geom_histogram() +
  labs(x = "Residual", y = "Count",
       title = "Residuals are left-skewed")

Do the residuals appear to be normally distributed?

No - they are left skewed!

The residual plot

One of the most useful diagnostic plots is the residual plot, which plots the fitted values against the residuals. Let’s take a look:

augment(mod2) %>% 
  ggplot(data = ., mapping = aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0, color = "red") + 
  labs(x = "Fitted value", y = "Residual")

Ideally, the residual plot should be a “shapeless blob” with no real pattern. A pattern would be indicative of potential violation of our model assumptions.

  • If the residuals are not symmetric about the x-axis, then we might have violation of linearity. This appears to be the case in our residual plot

  • If the residuals do not have similar vertical spread for all fitted values, then we might have violation of constant error variance (homoscedasticity of errors). This might result in a “fanning” pattern. Our residual plot doesn’t suggest that this assumption violated.

Do linearity and homogeneity of error variance appear satisfied in our plot? Why or why not?

Linearity does not appear to be satisfied; we see evidence of assymetry around the x-axis (examine the ranges of the residuals above and below this line, for instance). However, constant variance does look to be an appropriate assumption - at every vertical “slice” of the data, the spread of the residuals appears to be relatively similar)

Interaction effects

Sometimes, the effect of one variable depends on the value of another. For example, the effect of exercise % on obesity may be different for smokers vs. non-smokers. To model such a relationship, we create an interaction term.

We fit interaction terms in the model by multiplying predictors together. Note the syntax used in the code below for the interaction between TEMP and RAIN:

mod3 <- lm(DEWP ~ PRES + TEMP + RAIN + TEMP*RAIN, data = beijing)

For a given PRES, TEMP, and RAIN, our predicted DEWP is given by \[\begin{align*} \widehat{DEWP} &= \hat{\beta}_0 + \hat{\beta}_1PRES + \hat{\beta}_2TEMP +\\ &\mathrel{\phantom{=}} \hat{\beta}_3RAIN + \hat{\beta}_4TEMP \times RAIN \end{align*}\]

For a day with the same PRES AND TEMP but one higher unit of RAIN, the predicted DEWP is given by \[\begin{align*} \widehat{DEWP}^\star &= \hat{\beta}_0 + \hat{\beta}_1PRES + \hat{\beta}_2TEMP +\\ &\mathrel{\phantom{=}} \hat{\beta}_3(RAIN + 1) + \hat{\beta}_4TEMP \times (RAIN + 1) \end{align*}\]

Subtracting the two equations, we have \[\widehat{DEWP} - \widehat{DEWP}^\star = \hat{\beta}_3 + \hat{\beta}_4 TEMP\] The expected change in dew point for a one mm increase in precipitation, if pressure is held constant, is \(\hat{\beta}_3 + \hat{\beta}_4 \times TEMP\)

The effect of precipitation on the dew point, adjusting for pressure, depends on temperature!

Let’s take a look at the tidy output:

tidy(mod3)

What IS the estimated expected change in dew point given a one mm increase in precipitation, if we hold barometric pressure constant? (Hint: as stated before, it depends on the temperature!)

The answer is the main effect of precipitation plus the interaction effect with temperature, multiplied by the observed temperature. That is, 9.01 - 0.35 \(\times\) TEMP

On your own!

Exercise 2

Recently, atmospheric pollution in terms of fine particulate matter (PM2.5) has been of concern. Fit a model that predicts PM2.5 levels based on current weather patterns, and assess any assumptions needed for your linear model. Is there evidence that any weather-related variables are linearly associated with PM2.5 levels, conditional on levels of the other predictors?

You may use the following space to code your answer:

___ <- lm(___ ~ ___ + ___ + ..., data = ___)
  
tidy(___)

# Histogram of residuals

augment(___) %>% 
  ggplot(data = ., mapping = aes(x = ___)) + 
  geom_histogram() +
  labs(x = "Residual", y = "Count",
       title = "___")

# Residual plot

augment(___) %>% 
  ggplot(data = ., mapping = aes(x = ___, y = ___)) + 
  geom_point() + 
  geom_hline(yintercept = 0, color = "red") + 
  labs(x = "Fitted value", y = "Residual",
       title = "___")