-
Notifications
You must be signed in to change notification settings - Fork 0
Scatter plots and linear regressions
Often we visualize relationships between variables with scatter plots
and linear regressions. This tutorial provides an overview on using
ggplot to create scatter plots for oceanographic data, but with a
focus on linear regressions and including regression lines on scatter
plots.
We’ll use the Damariscotta River Data: DamariscottaRiverData.csv. See
the main tutorial for more details on this data set.
The file is in the main GitHub repository - make sure it’s downloaded and saved in your working directory.
First, we need to import the tidyverse and our data.
library(tidyverse)## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
fieldData <- read_csv('DamariscottaRiverData.csv')## Parsed with column specification:
## cols(
## date = col_double(),
## station = col_double(),
## depthbin = col_double(),
## year = col_double(),
## month = col_double(),
## day = col_double(),
## depth_m = col_double(),
## temperature_degC = col_double(),
## salinity_psu = col_double(),
## density_kg_m3 = col_double(),
## PAR = col_double(),
## fluorescence_mg_m3 = col_double(),
## oxygenConc_umol_kg = col_double(),
## oxygenSaturation_percent = col_double(),
## latitude = col_double()
## )
Let’s look at how chlorophyll fluorescence is related to temperature. Note, we’re adding our own axes labels and setting the limits of the x and y axes.
ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) +
geom_point() +
xlab('Temperature (deg C)') +
ylab('Fluorescence (mg/m^3)') +
xlim(7,18) + ylim(0,12)## Warning: Removed 40 rows containing missing values (geom_point).

In the above figure, we have a lot of fluorescence data below 2.5 mg m − 3 - in fact, the majority of our data are below that value. In this case (and as often happens with geophysical data), it is more useful to plot (& perform statistics on) the log-transformed data, so we can better understand the relationship between the parameters.
Let’s plot the relationship between temperature and log-transformed fluorescence:
ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) +
geom_point() +
xlab('Temperature (deg C)') +
ylab('Fluorescence (mg/m^3)') +
xlim(7,18) + scale_y_log10(limits=c(0.07,12)) ## Warning: Removed 41 rows containing missing values (geom_point).

OK great - so we can visualize the relationship between fluorescence and temperature. But how do we include a linear regression?
To do a linear regression, we need to fit a linear model to our data.
Note: we are not going into the statistics here, or details of whether it’s appropriate to do a linear regression on these data - we are just showing you how you would do a linear regression with your own data if you so wished.
Let’s start with looking at fluorescence and temperature. To visualize
the linear model, we can add in geom_smooth(model=lm) to our plot:
ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) +
geom_point() +
geom_smooth(method='lm') +
xlab('Temperature (deg C)') +
ylab('Fluorescence (mg/m^3)') +
xlim(7,18) + ylim(0,12)## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 40 rows containing non-finite values (stat_smooth).
## Warning: Removed 40 rows containing missing values (geom_point).
## Warning: Removed 11 rows containing missing values (geom_smooth).

But how do we get the equation and goodness of fit statistics?
# fitting a linear model between the two variables
model1 <- lm(fieldData$fluorescence_mg_m3 ~ fieldData$temperature_degC)
#display results of linear model
summary(model1)##
## Call:
## lm(formula = fieldData$fluorescence_mg_m3 ~ fieldData$temperature_degC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4802 -0.5760 -0.1582 0.2458 8.3661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.731450 0.091313 -29.91 <2e-16 ***
## fieldData$temperature_degC 0.331343 0.007558 43.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1 on 3046 degrees of freedom
## Multiple R-squared: 0.3869, Adjusted R-squared: 0.3867
## F-statistic: 1922 on 1 and 3046 DF, p-value: < 2.2e-16
How do we access these different values and assign them to objects?
# assigning the slope and intercept fromt the model to variables
intercept <- model1$coefficients[1]
slope <- model1$coefficients[2]
sum_model1 <- summary(model1)
rsquare <- sum_model1$r.squared
adjrsquare <- sum_model1$adj.r.squared
std_error <- sum_model1$sigmaOnce we have our linear model (i.e. the equation of our line), there are
other ways to add the line to the figure: geom_abline and
stat_function. Let’s look at stat_function. To use stat_function,
we need to provide it with a function that describes the equation of our
model. In this example, we used a linear model to find a straight line,
so we need to give stat_function a function that describes the
equation of a line (y = mx + b):
ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) +
geom_point() +
stat_function(fun = function(x) intercept + slope * x, size = 2, color = 'cyan') +
xlab('Temperature (deg C)') +
ylab('Fluorescence (mg/m^3)') +
xlim(7,18) + ylim(0,12)## Warning: Removed 40 rows containing missing values (geom_point).
## Warning: Removed 12 row(s) containing missing values (geom_path).

Why is this useful? Why don’t we just always use geom_smooth? One of
the benefits of stat_function is that we can use it to plot any kind
of line - it doesn’t need to be a straight line. For example, say we
wanted to fit a 2nd order polynomial (a quadratic) to our data we could
do the following:
# fitting a linear model between the two variables
model2 <- lm(fieldData$fluorescence_mg_m3 ~ fieldData$temperature_degC + I(fieldData$temperature_degC^2))
# assigning the coefficients from the models to the variables
p2 <- model2$coefficients[3]
p1 <- model2$coefficients[2]
p0 <- model2$coefficients[1]
ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) +
geom_point() +
stat_function(fun = function(x) intercept + slope * x, size = 2, aes(color='1st order')) +
stat_function(fun = function(x) p0 + p1 * x + p2 * (x^2), size = 2, aes(color='2nd order')) +
xlab('Temperature (deg C)') +
ylab('Fluorescence (mg/m^3)') +
xlim(7,18) + ylim(0,12)## Warning: Removed 40 rows containing missing values (geom_point).
## Warning: Removed 12 row(s) containing missing values (geom_path).

Or something even more complicated, such as a model between the temperature and log-transformed fluorescence:
modelLog <- lm(log10(fieldData$fluorescence_mg_m3) ~ fieldData$temperature_degC)
interceptLog <- modelLog$coefficients[1]
slopeLog <- modelLog$coefficients[2]
ggplot(data = fieldData, mapping = aes(x = temperature_degC, y = fluorescence_mg_m3)) +
geom_point() +
stat_function(fun = function(x) intercept + slope * x, size = 2, aes(color='1st order')) +
stat_function(fun = function(x) p0 + p1 * x + p2 * (x^2), size = 2, aes(color='2nd order')) +
stat_function(fun = function(x) 10**interceptLog * (10**(slopeLog * x)), size = 2, aes(color = 'power')) +
xlab('Temperature (deg C)') +
ylab('Fluorescence (mg/m^3)') +
xlim(7,18) + ylim(0,12)## Warning: Removed 40 rows containing missing values (geom_point).
## Warning: Removed 12 row(s) containing missing values (geom_path).
