-
Notifications
You must be signed in to change notification settings - Fork 0
Irregular data
NOTE: This file has been copied from early draft of the tutorial and still needs edited slightly to be fully stand-alone.
We first need to load the tidyverse library:
library(tidyverse)## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
and the data:
fieldData <- read_csv('DamariscottaRiverData.csv')## Parsed with column specification:
## cols(
## date = col_double(),
## station = col_double(),
## depthbin = col_double(),
## year = col_double(),
## month = col_double(),
## day = col_double(),
## depth_m = col_double(),
## temperature_degC = col_double(),
## salinity_psu = col_double(),
## density_kg_m3 = col_double(),
## PAR = col_double(),
## fluorescence_mg_m3 = col_double(),
## oxygenConc_umol_kg = col_double(),
## oxygenSaturation_percent = col_double(),
## latitude = col_double()
## )
The approach in previous section works well when your data are consistent in terms of the variables you want to compare. For the data we plotted above, there were five different cruises, and on each cruise, data was collected at the same four locations. We had data for every cruise and station. If we’d been missing data at one of the stations on one of the cruises, we’d just have a blank part on the plots we made. But what if we were missing a lot of data - would the above approach still be a good way to visualize our data?
To dig into this a bit further, let’s consider how chlorophyll fluorescence varies by depth at each station for one cruise.
We need to do a bit of data manipulation again to create a data frame we’ll use for plotting. In this case, we’re going to have station on the x-axis and depth on the y-axis, and we’ll consider the cruise that took place on September 8th 2016.
cruiseData <- filter(fieldData, date==20160908)We now have a data frame that includes columns for depth, station and chlorophyll fluorescence for just one cruise - so let’s use the same approach as before to create a contour plot:
ggplot(cruiseData,aes(x=station,y=depth_m)) +
geom_contour_filled(aes(z=fluorescence_mg_m3)) +
geom_point() +
labs(fill='surface chlorophyll fluorescence (mg m^-3)') +
scale_y_reverse() +
theme(panel.background = element_rect(fill = "white", colour = "white"))## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

We end up with no contours! What’s going on here? To draw the contours, R needs the y values need to all be at the same intervals (similarly for the x values). For our data, the depths at each station are irregular and different from each other:
ggplot(cruiseData, aes(x=fluorescence_mg_m3, y=depth_m, color = factor(station))) +
geom_point() +
labs(color='Station') +
ylim(5,1) + xlim(3,9)## Warning: Removed 174 rows containing missing values (geom_point).

So we need to sort the depth data onto a regular grid - to do this we will need to group the data into depth bins (or depth ranges) and then calculate the mean for each depth bin. Let’s bin our data into 1 m intervals. Again, we’re making a decision here based on our particular data set and situation, this could be different for you.
We are going to use a very similar process to earlier (when we considered surface chlorophyll fluorescence on all cruises in 2016).
- Round the depths to the nearest meter and include as column in the
data frame (use
mutate) - Separate the data into depth bin and station groups (use
group_by) - Take the average for each group of data (use
summarize)
binned <- cruiseData %>% mutate(depthBin = round(depth_m)) %>% group_by(station,depthBin) %>% summarize(av_fluor = mean(fluorescence_mg_m3))## `summarise()` regrouping output by 'station' (override with `.groups` argument)
head(binned)## # A tibble: 6 x 3
## # Groups: station [1]
## station depthBin av_fluor
## <dbl> <dbl> <dbl>
## 1 1 1 4.31
## 2 1 2 4.60
## 3 1 3 4.54
## 4 1 4 4.52
## 5 1 5 4.92
## 6 1 6 4.73
We’ve now got a data frame like we had before - let’s try
geom_contour_filled again:
ggplot(binned,aes(x=station,y=depthBin)) +
geom_contour_filled(aes(z=av_fluor)) +
geom_point() +
labs(fill='surface chlorophyll fluorescence (mg m^-3)') +
scale_y_reverse()
This looks better, but what this plotting function doesn’t do is interpolate data between missing data points. We know we have data at station 4 below 50 m that isn’t represented in this plot. Can we use a different function to show those data too?
ggplot(binned,aes(x=station,y=depthBin)) +
geom_tile(aes(fill=av_fluor)) +
scale_fill_continuous() +
labs(fill='surface chlorophyll fluorescence (mg m^-3)') +
scale_y_reverse()
All the data are visualized when we plot the data this way.