Why model data?

We use models to explain the relationship between variables and to make predictions. For now we will focus on linear models (but remember there are other types of models too!)

Today, we will focus on fitting and interpreting models with a continuous outcome variable and a single predictor. To do so, we will use the broom package, which takes the messy output of built-in functions in R and turns them into tidy data frames. As expected, it integrates well with the tidyverse, especially with data wrangling using dplyr.

Today’s data

Today’s data are come from a series of atmospheric measurements taken at the Agriculture Exhibition Hall in Beijing from 2013 - 2017. The following variables are in the dataset:

  • year, month, day, and hour: when the observation was taken
  • PM2.5 and PM10: levels of fine particulate matter in the atmosphere, in \(\mu\)g/m\(^3\)
  • SO2, NO2, CO, O3: sulfur dioxide, nitrogen dioxide, carbon monoxide, and ozone concentrations, in \(\mu\)g/m\(^3\), respectivel
  • TEMP: temperature, in degrees C
  • PRES: barometric pressure, in hPa
  • DEWP: dew point temperature, in degrees C
  • RAIN: precipitation levels, in mm
  • wd: wind direction (eight compass directions)
  • SWPM: wind speed, in m/s

These data are adapted from a dataset released by Song Xi Chen as referenced in Zhang S., et al (2017), Proc. Royal Soc. A. Let’s explore some of these data.

We can load in the dataset using the following code:

beijing <- read_csv("data/beijing.csv")

Note that the read_csv function is part of the tidyverse, which must first be loaded. As well, the path to the data might depend on how your directories are set up.

Brief initial EDA

Let’s do some initial EDA to examine variables in our dataset. Code has been provided for the graphs to help reinforce your ggplot2 skills!

Describe the distribution of dew point as measured at this station (in degrees C):

ggplot(data = beijing, aes(x = DEWP)) +
  geom_histogram() +
  labs(y = "Count")

Describe the distribution of barometric pressure (in hectopascals):

ggplot(data = beijing, aes(x = PRES)) +
  geom_histogram() +
  labs(y = "Count")

Models as functions

We can represent relationships between variables using functions. A function in the mathematical sense is the relationship between one or more inputs and an output created from that (those) input(s). Essentially, we can “plug in” the inputs and receive back the output.

For instance, the formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\). When \(x\) is \(5\), the output \(y\) is \(22\):

x <- 5
y <- 3*x + 7

y
## [1] 22

In statistical models, the response variable (dependent variable, or outcome variable) is the variable whose behavior or variation we are trying to understand, and is generally denoted by \(y\). The explanatory variable (independent variable, or predictor variable) is the variable we want to use to explain the variation in the response, generally denoted by \(x\).

Examine the following plot, which depicts the relationship between PM2.5 levels and wind speed. Describe the relationship between these two variables.

ggplot(data = beijing, aes(x = PRES, y = DEWP)) +
  geom_point() +
  ylim(c(-40, 30))

Now let’s overlay a line on this relationship.

What are the response and explanatory variables? Does the line describe the relationship between these two variables “well”?

ggplot(data = beijing, aes(x = PRES, y = DEWP)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  ylim(c(-40, 30))

The response variable is dew point, and the explanatory variable is barometric pressure. The line does seem to describe this relationship well; we will talk more about what this means later on in this tutorial and the tutorial for multiple regression, diagnostics, and predictions.

The blue line is a function that we can use to obtain predicted values of our response given values of our predictors. These predicted values are the output of the model function. That is, the expected value of the response given a certain value of the explanatory variable. However, we know that there is variability around our prediction (compare the line to the actual values). The residual is the difference between the actual observed value of the response variable observed in the dataset vs. the value predicted by the model. It is essentially the vertical distance between each point and the line.

What does a negative residual mean? Which points on the plot above have them?

A negative residual means that our predicted value is greater than the observed value of the response variable. Points on the plot below the blue line have negative residuals.

Interpreting linear models

As previously mentioned, we will be using the broom package. Some of the most commonly used functions are as follows:

  • tidy: Constructs a tidy data frame summarizing model’s statistical findings
  • glance: Constructs a concise one-row summary of the model
  • augment: Adds columns (e.g. predictions, residuals) to the original data that was modeled

Today, we will focus on making tidy regression output for linear models and glanceing at a model summary.

Let’s create a linear model that predicts the dew point given the current barometric pressure. We first use the lm function to create a linear model object. Let’s create and save a model, mod1, for use in later downstream analyses:

mod1 <- lm(DEWP ~ PRES, data = beijing)

Now let’s take a look at the tidy output for our object as provided by the broom package:

tidy(mod1)

This output is a tidy data frame! Each row corresponds to a parameter estimated in the model (here, the intercept and slope), and each column is a variable corresponding to a statistical measure (the estimate, standard error, p-value, etc.).

We write the fitted model as follows:

\[\widehat{DEWP} = 1071 - 1.08 PRES\]

That is, the predicted dew point in degrees Celsius is given by 1071 minus 1.08 times the barometric pressure. We can interpret the intercept and slope coefficients as follows:

  • Intercept: if the barometric pressure is 0, we would expect the dew point to be 1071 degrees Celsius, on average (is this a meaningful quantity to estimate…?)

  • Slope: for each additional hectopascal of barometric pressure, we would expect the dew point to decrease by 1.08 degrees Celsius

Note that we “expect” these relationships to hold, but there will be some variability!

In general, we assume the following underlying simple linear model for a single predictor and a single continuous outcome:

\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\]

\(\beta_0\) and \(\beta_1\) are parameters which aren’t observed (neither is the error, \(\epsilon_i\)). When we estimate these parameters, we denote these estimates with hats: \(\hat{\beta}_0\) and \(\hat{\beta}_1\). We often use ordinary least squares (OLS) to obtain these estimates to get the fitted model

\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i\]

What is different between the underlying linear model and the fitted model?

The underlying linear model is that which is assumed to hold; it contains population parameters and specifies a form for the error term. The fitted model is one that is fit on the observed data (according to some specified algorithm). When specifying the fitted model, we specify model estimates for the predicted response variable. There is no error term in the fitted model.

Remember that residuals are the differences between the observed response and the predicted response values from the model:

\[\hat{\epsilon_i} = y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i) = y_i - \hat{y}_i\]

OLS selects \(\hat{\beta}_0\) and \(\hat{\beta}_1\) such that the sum of squared residuals is minimized. The residuals are given by the red segments in the graphic below:

You will learn how to calculate residuals by using the augment function in the tutorial for multiple linear regression and model diagnostics.

Statistical inference

You might have noticed a p-value corresponding to each of the parameter estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) In regression models, p-values generally correspond to testing the null hypothesis \[H_0: \beta = 0\] vs. the alternative hypothesis \[H_1: \beta \neq 0\].

At the \(\alpha = 0.05\) level, is there sufficient evidence to suggest that \(\beta_0 \neq 0\)? How about \(\beta_1\)? (look at the p-values to tell!)

We can also get model-level summaries by using the glance function on our model object:

glance(mod1)

\(R^2\) (r.squared in the glance output) tells us the percentage of variability in the response variable that explained by our predictors. In our example, We see that approximately 60% of the variability in dew point can be explained by its linear relationship with barometric pressure.

On your own!

Create a linear model that predicts one of the atmospheric pollutants (either PM2.5, PM10, SO2, NO2, CO, or O3) based on meteorological characteristics (i.e., anything except when the observation was taken). Interpret the estimated intercept and slope from your model. What do you notice?

___ <- lm(___ ~ ___, data = ___)
  
tidy(___)