Skip to content

Scatter plots and linear regressions

cathmmitchell edited this page Mar 28, 2021 · 1 revision

Scatter plots and linear regressions

Often we visualize relationships between variables with scatter plots and linear regressions. This tutorial provides an overview on using ggplot to create scatter plots for oceanographic data, but with a focus on linear regressions and including regression lines on scatter plots.

Data

We’ll use the Damariscotta River Data: DamariscottaRiverData.csv. See the main tutorial for more details on this data set.

The file is in the main GitHub repository - make sure it’s downloaded and saved in your working directory.

First, we need to import the tidyverse and our data.

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ---------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
fieldData <- read_csv('DamariscottaRiverData.csv')
## Parsed with column specification:
## cols(
##   date = col_double(),
##   station = col_double(),
##   depthbin = col_double(),
##   year = col_double(),
##   month = col_double(),
##   day = col_double(),
##   depth_m = col_double(),
##   temperature_degC = col_double(),
##   salinity_psu = col_double(),
##   density_kg_m3 = col_double(),
##   PAR = col_double(),
##   fluorescence_mg_m3 = col_double(),
##   oxygenConc_umol_kg = col_double(),
##   oxygenSaturation_percent = col_double(),
##   latitude = col_double()
## )

Scatter plots

Let’s look at how chlorophyll fluorescence is related to temperature. Note, we’re adding our own axes labels and setting the limits of the x and y axes.

ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) + 
  geom_point() +
  xlab('Temperature (deg C)') +
  ylab('Fluorescence (mg/m^3)') +
  xlim(7,18) + ylim(0,12)
## Warning: Removed 40 rows containing missing values (geom_point).

In the above figure, we have a lot of fluorescence data below 2.5 mg m − 3 - in fact, the majority of our data are below that value. In this case (and as often happens with geophysical data), it is more useful to plot (& perform statistics on) the log-transformed data, so we can better understand the relationship between the parameters.

Let’s plot the relationship between temperature and log-transformed fluorescence:

ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) + 
  geom_point() +
  xlab('Temperature (deg C)') +
  ylab('Fluorescence (mg/m^3)') +
  xlim(7,18) +  scale_y_log10(limits=c(0.07,12)) 
## Warning: Removed 41 rows containing missing values (geom_point).

OK great - so we can visualize the relationship between fluorescence and temperature. But how do we include a linear regression?

Linear regressions

To do a linear regression, we need to fit a linear model to our data.

Note: we are not going into the statistics here, or details of whether it’s appropriate to do a linear regression on these data - we are just showing you how you would do a linear regression with your own data if you so wished.

Let’s start with looking at fluorescence and temperature. To visualize the linear model, we can add in geom_smooth(model=lm) to our plot:

ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) + 
  geom_point() +
  geom_smooth(method='lm') +
  xlab('Temperature (deg C)') +
  ylab('Fluorescence (mg/m^3)') +
  xlim(7,18) + ylim(0,12)
## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 40 rows containing non-finite values (stat_smooth).

## Warning: Removed 40 rows containing missing values (geom_point).

## Warning: Removed 11 rows containing missing values (geom_smooth).

But how do we get the equation and goodness of fit statistics?

# fitting a linear model between the two variables
model1 <- lm(fieldData$fluorescence_mg_m3 ~ fieldData$temperature_degC)

#display results of linear model
summary(model1)
## 
## Call:
## lm(formula = fieldData$fluorescence_mg_m3 ~ fieldData$temperature_degC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4802 -0.5760 -0.1582  0.2458  8.3661 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -2.731450   0.091313  -29.91   <2e-16 ***
## fieldData$temperature_degC  0.331343   0.007558   43.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 3046 degrees of freedom
## Multiple R-squared:  0.3869, Adjusted R-squared:  0.3867 
## F-statistic:  1922 on 1 and 3046 DF,  p-value: < 2.2e-16

How do we access these different values and assign them to objects?

# assigning the slope and intercept fromt the model to variables
intercept <- model1$coefficients[1]
slope <- model1$coefficients[2]

sum_model1 <- summary(model1)

rsquare <- sum_model1$r.squared
adjrsquare <- sum_model1$adj.r.squared
std_error <- sum_model1$sigma

Once we have our linear model (i.e. the equation of our line), there are other ways to add the line to the figure: geom_abline and stat_function. Let’s look at stat_function. To use stat_function, we need to provide it with a function that describes the equation of our model. In this example, we used a linear model to find a straight line, so we need to give stat_function a function that describes the equation of a line (y = mx + b):

ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) + 
  geom_point() +
  stat_function(fun = function(x) intercept + slope * x, size = 2, color = 'cyan') +
  xlab('Temperature (deg C)') +
  ylab('Fluorescence (mg/m^3)') +
  xlim(7,18) + ylim(0,12)
## Warning: Removed 40 rows containing missing values (geom_point).

## Warning: Removed 12 row(s) containing missing values (geom_path).

Why is this useful? Why don’t we just always use geom_smooth? One of the benefits of stat_function is that we can use it to plot any kind of line - it doesn’t need to be a straight line. For example, say we wanted to fit a 2nd order polynomial (a quadratic) to our data we could do the following:

# fitting a linear model between the two variables
model2 <- lm(fieldData$fluorescence_mg_m3 ~ fieldData$temperature_degC + I(fieldData$temperature_degC^2))

# assigning the coefficients from the models to the variables
p2 <- model2$coefficients[3]
p1 <- model2$coefficients[2]
p0 <- model2$coefficients[1]

ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) + 
  geom_point() +
  stat_function(fun = function(x) intercept + slope * x, size = 2, aes(color='1st order')) +
  stat_function(fun = function(x) p0 + p1 * x + p2 * (x^2), size = 2, aes(color='2nd order')) +
  xlab('Temperature (deg C)') +
  ylab('Fluorescence (mg/m^3)') +
  xlim(7,18) + ylim(0,12)
## Warning: Removed 40 rows containing missing values (geom_point).

## Warning: Removed 12 row(s) containing missing values (geom_path).

Or something even more complicated, such as a model between the temperature and log-transformed fluorescence:

modelLog <- lm(log10(fieldData$fluorescence_mg_m3) ~ fieldData$temperature_degC)
interceptLog <- modelLog$coefficients[1]
slopeLog <- modelLog$coefficients[2]

ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) + 
  geom_point() +
  stat_function(fun = function(x) intercept + slope * x, size = 2, aes(color='1st order')) +
  stat_function(fun = function(x) p0 + p1 * x + p2 * (x^2), size = 2, aes(color='2nd order')) +
  stat_function(fun = function(x) 10**interceptLog * (10**(slopeLog * x)), size = 2, aes(color = 'power')) +
  xlab('Temperature (deg C)') +
  ylab('Fluorescence (mg/m^3)') +
  xlim(7,18) + ylim(0,12)
## Warning: Removed 40 rows containing missing values (geom_point).

## Warning: Removed 12 row(s) containing missing values (geom_path).

Clone this wiki locally