This lesson is still being designed and assembled (Pre-Alpha version)

Manipulating, analyzing and exporting data with tidyverse

Overview

Teaching: min
Exercises: min
Questions
  • What is the tidyverse? Why should I use it to manipulate data?

Objectives
  • Describe the purpose of the dplyr package.

  • Select certain columns in a data frame with the dplyr function select.

  • Select certain rows in a data frame according to filtering conditions with the dplyr function filter.

  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.

  • Add new columns to a data frame that are functions of existing columns with mutate.

  • Use summarize, group_by, and count to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.

  • Export a data frame to a .csv file.

Data Manipulation using dplyr

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. dplyr. dplyr is a package for making tabular data manipulation easier.

Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or data.frame(), come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.

You should already have installed the tidyverse package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as tidyr, dplyr, ggplot2, tibble, etc.

The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:

  1. The results from a base R function sometimes depend on the type of data.

  2. Using R expressions in a non standard way, which can be confusing for new learners.

  3. Hidden arguments, having default operations that new learners are not aware of.

We have seen in our previous lesson that when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. We had to set stringsAsFactors to FALSE to avoid this hidden argument to convert our data type.

This time we will use the tidyverse package to read the data and avoid having to set stringsAsFactors to FALSE

To load the tidyverse package set, including dplyr, type:

## load the tidyverse packages, including dplyr
library("tidyverse")

What is dplyr?

The package dplyr provides easy tools for the most common data manipulation tasks like selecting columns, filtering rows, creating new columns, and creating summary tables.

To learn more about dplyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet.

We’ll read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().

iris <- read_csv("data/iris.csv")
Parsed with column specification:
cols(
  Sepal.Length = col_double(),
  Sepal.Width = col_double(),
  Petal.Length = col_double(),
  Petal.Width = col_double(),
  Species = col_character()
)
## inspect the data
str(iris)
Classes 'tbl_df', 'tbl' and 'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 5
  .. ..$ Sepal.Length: list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Sepal.Width : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Petal.Length: list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Petal.Width : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Species     : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"

To preview the data, you can click on the name of the tibble in your environment or use the View function.

## preview the data
View(iris)

Either of these methods will open a window with the dataset in the same area as your script file.

Tibbles

Notice that the class of the data is now tbl_df

This is referred to as a “tibble”. Tibbles tweak some of the behaviors of the data frame objects we introduced in the previous episode. The data structure is very similar to a data frame. For our purposes the only differences are that:

  1. In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.

  2. Columns of class character are never converted into factors.

We’re going to learn some of the most common dplyr functions:

Selecting columns and filtering rows

To select columns of a data frame, use select(). The first argument to this function is the data frame (iris), and the subsequent arguments are the columns to keep.

select(iris, Sepal.Length, Sepal.Width)
# A tibble: 150 x 2
   Sepal.Length Sepal.Width
          <dbl>       <dbl>
 1          5.1         3.5
 2          4.9         3  
 3          4.7         3.2
 4          4.6         3.1
 5          5           3.6
 6          5.4         3.9
 7          4.6         3.4
 8          5           3.4
 9          4.4         2.9
10          4.9         3.1
# … with 140 more rows

To select all columns except certain ones, put a “-“ in front of the variable to exclude it.

select(iris, -Sepal.Length, -Sepal.Width)
# A tibble: 150 x 3
   Petal.Length Petal.Width Species
          <dbl>       <dbl> <chr>  
 1          1.4         0.2 setosa 
 2          1.4         0.2 setosa 
 3          1.3         0.2 setosa 
 4          1.5         0.2 setosa 
 5          1.4         0.2 setosa 
 6          1.7         0.4 setosa 
 7          1.4         0.3 setosa 
 8          1.5         0.2 setosa 
 9          1.4         0.2 setosa 
10          1.5         0.1 setosa 
# … with 140 more rows

This will select all the variables in iris except Sepal.Length and Sepal.Width.

To choose rows based on a specific criteria, use filter():

filter(iris, Sepal.Length>5)
# A tibble: 118 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          5.4         3.9          1.7         0.4 setosa 
 3          5.4         3.7          1.5         0.2 setosa 
 4          5.8         4            1.2         0.2 setosa 
 5          5.7         4.4          1.5         0.4 setosa 
 6          5.4         3.9          1.3         0.4 setosa 
 7          5.1         3.5          1.4         0.3 setosa 
 8          5.7         3.8          1.7         0.3 setosa 
 9          5.1         3.8          1.5         0.3 setosa 
10          5.4         3.4          1.7         0.2 setosa 
# … with 108 more rows

Pipes

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:

iris2 <- filter(iris, Sepal.Length < 5)
iris_small <- select(iris2, Species, Sepal.Length, Sepal.Width)

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

iris_sml <- select(filter(iris, Sepal.Length < 5), Species, Sepal.Length, Sepal.Width)

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl

iris %>%
  filter(Sepal.Length < 5) %>%
  select(Species, Sepal.Length, Sepal.Width)
# A tibble: 22 x 3
   Species Sepal.Length Sepal.Width
   <chr>          <dbl>       <dbl>
 1 setosa           4.9         3  
 2 setosa           4.7         3.2
 3 setosa           4.6         3.1
 4 setosa           4.6         3.4
 5 setosa           4.4         2.9
 6 setosa           4.9         3.1
 7 setosa           4.8         3.4
 8 setosa           4.8         3  
 9 setosa           4.3         3  
10 setosa           4.6         3.6
# … with 12 more rows

In the above code, we use the pipe to send the iris dataset first through filter() to keep rows where Sepal.Length is less than 5, then through select() to keep only the Sepal.Length, Sepal.Width, and species columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more.

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame iris, then we filtered for rows with Sepal.Length < 5, then we selected columns species, Sepal.Length, and Sepal.Width. The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.

If we want to create a new object with this smaller version of the data, we can assign it a new name:

iris_sml <- iris %>%
  filter(Sepal.Length < 5) %>%
  select(Species, Sepal.Length, Sepal.Width)

iris_sml
# A tibble: 22 x 3
   Species Sepal.Length Sepal.Width
   <chr>          <dbl>       <dbl>
 1 setosa           4.9         3  
 2 setosa           4.7         3.2
 3 setosa           4.6         3.1
 4 setosa           4.6         3.4
 5 setosa           4.4         2.9
 6 setosa           4.9         3.1
 7 setosa           4.8         3.4
 8 setosa           4.8         3  
 9 setosa           4.3         3  
10 setosa           4.6         3.6
# … with 12 more rows

Note that the final data frame is the leftmost part of this expression.

Introducing the a Plant Physiology Environmental dataset

While the iris dataset lets us illustrate many of the features of dplyr, it is not complex enough to illustrate all features. For the challenges and some examples in this lesson, we will be introducing a new dataset, phys_data

Description of phys_data

Plants have small openings on their leaf surfaces called “stomata” that permit the exchange of gases between the plant and the atmosphere during photosynthesis. During the day, plants keep their stomata open in order to obtain carbon dioxide from the atmosphere for photosynthesis. However, open stomata also allow water to escape from the leaf to the atmosphere through the process of transpiration. Plants typically close their stomata at night to preserve plant water when photosynthesis doesn’t occur. However, recent research has shown that many plants actually keep their stomata open at night, which allows nocturnal transpiration to occur. What drives nocturnal transpiration and if there is any ecological significance associated with this process is still unclear. To address this gap in knowledge, O’Keefe and Nippert (2018) aimed to understand the patterns and drivers of nocturnal transpiration in a variety of grassland plants. They measured rates of water loss in 8 different plant species, during the day and at night, over the course of two years. They also measured a variety of other plant physiological traits (photosynthesis, leaf water status) and environmental variables (air temperature, soil moisture, vapor pressure deficit) in order to determine if daytime and nocturnal transpiration are associated with other aspects of plant functioning or environmental drivers. They found that patterns of transpiration differed among species, with grasses exhibiting the highest rates of nocturnal water loss. Daytime and nocturnal transpiration had varying relationships with different environmental drivers, and nocturnal transpiration was strongly associated with predawn leaf water potential (an indicator of soil moisture around plant roots). We suggest that nocturnal transpiration is driven by water availability and may act as a competitive strategy in drought-tolerant grasses to use up water as quickly as possible, so as to create drier growing conditions for their less drought-tolerant neighbors.

Overview of the variables

Variable Description
Year Year of data collection
DOY Day of year of data collection
Species Species (“AG”=Andropogon gerardii, “PV”=Panicum virgatum, “SN”=Sorghastrum nutans, “VB”=Vernonia baldwinii, “SC”=Solidago canadensis, “AC”=Amorpha canescens, “RG”=Rhus glabra, “CD”=Cornus drummondii
Fgroup Plant functional group
Cond_night Nocturnal stomatal conductance (mol m-2s-1)
Trmmol_night Nocturnal transpiration (mmol m-2s-1)
Photo Photosynthesis (μmol CO2 m-2s-1)
Cond_day Diurnal stomatal conductance (mol m-2s-1)
Trmmol_day Diurnal transpiration (mmol m-2s-1)
Percent_cond Percent nocturnal of diurnal stomatal conductance
Percent_trmmol Percent nocturnal of diurnal transpiration
PD Predawn leaf water potential (MPa)
MD Midday leaf water potential (MPa)
Soil_moisture Average daily soil moisture (v/v)
TAIR_N Air temperature (°C) at average time of nocturnal physiology measurements (23:00 h)
VPD_N Vapor pressure deficit (kPa) at average time of nocturnal physiology measurements (23:00 h)
TAIR_D Air temperature (°C) at average time of diurnal physiology measurements (11:00 h)
VPD_D Vapor pressure deficit (kPa) at average time of diurnal physiology measurements (11:00 h)

The only difference is that the date is represented as month, day, and year rather than year and day of year.

Challenge 1

Using pipes, subset the phys_dates data to include observations collected in 2014 and retain only the columns Year, Species, and Photo.

Solution to Challenge 1

phys_dates %>% 
   filter(Year == 2014) %>%
   select(Year, Species, Photo)

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of weight in kg:

iris %>%
  mutate(Sepal.Ratio = Sepal.Length/Sepal.Width)
# A tibble: 150 x 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
          <dbl>       <dbl>        <dbl>       <dbl> <chr>         <dbl>
 1          5.1         3.5          1.4         0.2 setosa         1.46
 2          4.9         3            1.4         0.2 setosa         1.63
 3          4.7         3.2          1.3         0.2 setosa         1.47
 4          4.6         3.1          1.5         0.2 setosa         1.48
 5          5           3.6          1.4         0.2 setosa         1.39
 6          5.4         3.9          1.7         0.4 setosa         1.38
 7          4.6         3.4          1.4         0.3 setosa         1.35
 8          5           3.4          1.5         0.2 setosa         1.47
 9          4.4         2.9          1.4         0.2 setosa         1.52
10          4.9         3.1          1.5         0.1 setosa         1.58
# … with 140 more rows

You can also create a second new column based on the first new column within the same call of mutate():

iris %>%
  mutate(Sepal.Ratio = Sepal.Length/Sepal.Width,
         Petal.Ratio = Petal.Length/Petal.Width)
# A tibble: 150 x 7
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
          <dbl>       <dbl>        <dbl>       <dbl> <chr>         <dbl>
 1          5.1         3.5          1.4         0.2 setosa         1.46
 2          4.9         3            1.4         0.2 setosa         1.63
 3          4.7         3.2          1.3         0.2 setosa         1.47
 4          4.6         3.1          1.5         0.2 setosa         1.48
 5          5           3.6          1.4         0.2 setosa         1.39
 6          5.4         3.9          1.7         0.4 setosa         1.38
 7          4.6         3.4          1.4         0.3 setosa         1.35
 8          5           3.4          1.5         0.2 setosa         1.47
 9          4.4         2.9          1.4         0.2 setosa         1.52
10          4.9         3.1          1.5         0.1 setosa         1.58
# … with 140 more rows, and 1 more variable: Petal.Ratio <dbl>

Challenge 2

Instantaneous Water-Use Efficiency (WUE) is an efficiency measure of how much carbon is gained by photosynthesis per water lost through transpiration. A higher WUE indicates that the plant is losing less water, which is better for the plant. WUE is caluclated by dividing photosynthesis by transpiration. Often, different species and functional groups have distinct WUE. Create a new data frame from the phys_dates data that meets the following criteria:

  • is named phys_wue
  • contains the following columns: Species, FGroup
  • contains a new column called wue_day (wue_day = Photo/Trmmol_day)
  • contains only rows that have a wue_day >1

Which species and functional types have the highest WUE?

Hint: think about how the commands should be ordered to produce this data frame!

Solution to Challenge 2

phys_wue <- phys_dates %>%
    mutate(wue_day = Photo/Trmmol_day) %>%
    filter(wue_day > 1) %>%
    select(Species, Fgroup, wue_day)

Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data_output, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output directory, so even if the files it contains are deleted, we can always re-generate them.

We can save it as a CSV file in a new folder called data_output folder. Let’s save the dataset that that we created dates in.

write_csv(phys_dates, path = "data_output/phys_dates.csv")

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean Sepal.Length by Species:

iris %>%
  group_by(Species) %>%
  summarize(mean_sepal_length = mean(Sepal.Length))
# A tibble: 3 x 2
  Species    mean_sepal_length
  <chr>                  <dbl>
1 setosa                  5.01
2 versicolor              5.94
3 virginica               6.59

You can also group by multiple columns. For this, we’ll need to use the phys_dates dataset:

phys_dates %>%
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo))
# A tibble: 16 x 3
# Groups:   Year [2]
    Year Species mean_photo
   <int> <chr>        <dbl>
 1  2014 AC            12.0
 2  2014 AG            18.8
 3  2014 CD            11.4
 4  2014 PV            19.3
 5  2014 RG            15.8
 6  2014 SC            11.0
 7  2014 SN            19.3
 8  2014 VB            10.7
 9  2015 AC            12.7
10  2015 AG            18.1
11  2015 CD            13.3
12  2015 PV            24.7
13  2015 RG            17.4
14  2015 SC            19.8
15  2015 SN            29.5
16  2015 VB            12.9

If you want to display more data, you can use the print() function at the end of your chain with the argument n specifying the number of rows to display:

phys_dates %>%  
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo)) %>% 
  print(n = 15)
# A tibble: 16 x 3
# Groups:   Year [2]
    Year Species mean_photo
   <int> <chr>        <dbl>
 1  2014 AC            12.0
 2  2014 AG            18.8
 3  2014 CD            11.4
 4  2014 PV            19.3
 5  2014 RG            15.8
 6  2014 SC            11.0
 7  2014 SN            19.3
 8  2014 VB            10.7
 9  2015 AC            12.7
10  2015 AG            18.1
11  2015 CD            13.3
12  2015 PV            24.7
13  2015 RG            17.4
14  2015 SC            19.8
15  2015 SN            29.5
# … with 1 more row

Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum photosynthesis level by year and species:

phys_dates %>%
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo),
            min_photo = min(Photo))
# A tibble: 16 x 4
# Groups:   Year [2]
    Year Species mean_photo min_photo
   <int> <chr>        <dbl>     <dbl>
 1  2014 AC            12.0      6.60
 2  2014 AG            18.8     10.3 
 3  2014 CD            11.4      6.71
 4  2014 PV            19.3      6.56
 5  2014 RG            15.8      2.47
 6  2014 SC            11.0      1.82
 7  2014 SN            19.3      9.78
 8  2014 VB            10.7      3.13
 9  2015 AC            12.7      5.14
10  2015 AG            18.1      8.24
11  2015 CD            13.3      7.38
12  2015 PV            24.7     10.6 
13  2015 RG            17.4      8.35
14  2015 SC            19.8      6.58
15  2015 SN            29.5     11.7 
16  2015 VB            12.9      1.08

It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min_photo to put the less photosynthetic first:

phys_dates %>%
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo),
            min_photo = min(Photo)) %>%
  arrange(min_photo)
# A tibble: 16 x 4
# Groups:   Year [2]
    Year Species mean_photo min_photo
   <int> <chr>        <dbl>     <dbl>
 1  2015 VB            12.9      1.08
 2  2014 SC            11.0      1.82
 3  2014 RG            15.8      2.47
 4  2014 VB            10.7      3.13
 5  2015 AC            12.7      5.14
 6  2014 PV            19.3      6.56
 7  2015 SC            19.8      6.58
 8  2014 AC            12.0      6.60
 9  2014 CD            11.4      6.71
10  2015 CD            13.3      7.38
11  2015 AG            18.1      8.24
12  2015 RG            17.4      8.35
13  2014 SN            19.3      9.78
14  2014 AG            18.8     10.3 
15  2015 PV            24.7     10.6 
16  2015 SN            29.5     11.7 

To sort in descending order, we need to add the desc() function. If we want to sort the results by decreasing order of minimum photosynthesis:

phys_dates %>%
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo),
            min_photo = min(Photo)) %>%
  arrange(desc(min_photo))
# A tibble: 16 x 4
# Groups:   Year [2]
    Year Species mean_photo min_photo
   <int> <chr>        <dbl>     <dbl>
 1  2015 SN            29.5     11.7 
 2  2015 PV            24.7     10.6 
 3  2014 AG            18.8     10.3 
 4  2014 SN            19.3      9.78
 5  2015 RG            17.4      8.35
 6  2015 AG            18.1      8.24
 7  2015 CD            13.3      7.38
 8  2014 CD            11.4      6.71
 9  2014 AC            12.0      6.60
10  2015 SC            19.8      6.58
11  2014 PV            19.3      6.56
12  2015 AC            12.7      5.14
13  2014 VB            10.7      3.13
14  2014 RG            15.8      2.47
15  2014 SC            11.0      1.82
16  2015 VB            12.9      1.08

Challenge 3

The original paper containing the iris dataset distinguished the three types of irises based on the sepal and petal length and width, as well as the ratio of length to width for sepals and petals.

For this challenge, let’s replicate their analysis.

  • Calculate the mean sepal length and sepal width for each species
  • Calculate the mean petal lengths and petal widths for each species
  • Calculate the ratio of Sepal.Length to Sepal.Width
  • Calculate the ratio of Petal.Length to Petal.Width

Which of these values does the best job of differentiating the setsosa species from the others?

Solution to Challenge 3

iris %>% group_by(Species) %>% 
    summarize(meanSepal.Length = mean(Sepal.Length),
              meanSepal.Width = mean(Sepal.Width),
              Sepal.ratio=mean(Sepal.Length/Sepal.Width), 
              meanPetal.Length=mean(Petal.Length),
              meanPetal.Width=mean(Petal.Width),
              Petal.ratio = mean(Petal.Length/Petal.Width)
              )
# A tibble: 3 x 7
  Species meanSepal.Length meanSepal.Width Sepal.ratio meanPetal.Length
  <chr>              <dbl>           <dbl>       <dbl>            <dbl>
1 setosa              5.01            3.42        1.47             1.46
2 versic…             5.94            2.77        2.16             4.26
3 virgin…             6.59            2.97        2.23             5.55
# … with 2 more variables: meanPetal.Width <dbl>, Petal.ratio <dbl>

Petal ratio has the largest difference between setsosa and the other species.

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each species, we would do:

iris %>%
    count(Species) 
# A tibble: 3 x 2
  Species        n
  <chr>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, iris %>% count() is equivalent to:

iris %>%
    group_by(Species) %>%
    summarise(count = n())
# A tibble: 3 x 2
  Species    count
  <chr>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

If we wanted to count combination of factors, such as Species and Year, we would specify the first and the second factor as the arguments of count():

phys_dates %>%
  count(Year, Species) 
# A tibble: 16 x 3
    Year Species     n
   <int> <chr>   <int>
 1  2014 AC         18
 2  2014 AG         18
 3  2014 CD         18
 4  2014 PV         18
 5  2014 RG         18
 6  2014 SC         18
 7  2014 SN         15
 8  2014 VB         18
 9  2015 AC         18
10  2015 AG         18
11  2015 CD         18
12  2015 PV         18
13  2015 RG         18
14  2015 SC         18
15  2015 SN         18
16  2015 VB         15

With the above code, we can proceed with arrange() to sort the table according to a number of criteria so that we have a better comparison. For instance, we might want to arrange the table above in (i) an alphabetical order of the levels of the species and (ii) in descending order of the count:

phys_dates %>%
  count(Year, Species) %>%
  arrange(Species, desc(n))
# A tibble: 16 x 3
    Year Species     n
   <int> <chr>   <int>
 1  2014 AC         18
 2  2015 AC         18
 3  2014 AG         18
 4  2015 AG         18
 5  2014 CD         18
 6  2015 CD         18
 7  2014 PV         18
 8  2015 PV         18
 9  2014 RG         18
10  2015 RG         18
11  2014 SC         18
12  2015 SC         18
13  2015 SN         18
14  2014 SN         15
15  2014 VB         18
16  2015 VB         15

Challenge 4

How many individuals of each species were measured in each Year?

Solution to Challenge 4

phys_dates %>%
   group_by(Year)%>%
   count(Species) 
# A tibble: 16 x 3
# Groups:   Year [2]
    Year Species     n
   <int> <chr>   <int>
 1  2014 AC         18
 2  2014 AG         18
 3  2014 CD         18
 4  2014 PV         18
 5  2014 RG         18
 6  2014 SC         18
 7  2014 SN         15
 8  2014 VB         18
 9  2015 AC         18
10  2015 AG         18
11  2015 CD         18
12  2015 PV         18
13  2015 RG         18
14  2015 SC         18
15  2015 SN         18
16  2015 VB         15

Challenge 5

Use group_by() and summarize() to find the mean, min, and max photosynthesis level for each species. Also add the number of observations (hint: see ?n).

Solution to Challenge 5

phys_dates %>%
   group_by(Species) %>%
   summarize(
       mean_photo = mean(Photo),
       max_photo = max(Photo),
       min_photo = min(Photo),
       n = n()
   )
# A tibble: 8 x 5
 Species mean_photo max_photo min_photo     n
 <chr>        <dbl>     <dbl>     <dbl> <int>
1 AC            12.4      22.0      5.14    36
2 AG            18.4      29.9      8.24    36
3 CD            12.4      18.5      6.71    36
4 PV            22.0      35.5      6.56    36
5 RG            16.6      25.4      2.47    36
6 SC            15.4      32.7      1.82    36
7 SN            24.9      40.6      9.78    33
8 VB            11.7      27.0      1.08    33

Challenge 6

Which plant had the highest photosynthetic rates in each year? Return the columns Year, Species, and Photo.

Solution to Challenge 7

phys_dates %>%
    group_by(Year) %>%
    filter(Photo == max(Photo)) %>%
    select(Year, Species, Photo) %>%
    arrange(Year)
# A tibble: 2 x 3
# Groups:   Year [2]
   Year Species Photo
  <int> <chr>   <dbl>
1  2014 SN       30.4
2  2015 SN       40.6

Key Points

  • The tidyverse is built for data manipulation.

  • The read_csv function creates tibbles instead of data frames.

  • The select function picks particular columns based on names.

  • The filter function picks rows based on values.

  • The mutate function creates new columns based on the value of other columns.

  • The group_by and summarize functions can be used to create summary tables.

  • The write_csv function exports tibbles into a .csv file.