Manipulating, analyzing and exporting data with tidyverse
Overview
Teaching: min
Exercises: minQuestions
What is the tidyverse? Why should I use it to manipulate data?
Objectives
Describe the purpose of the
dplyrpackage.Select certain columns in a data frame with the
dplyrfunctionselect.Select certain rows in a data frame according to filtering conditions with the
dplyrfunctionfilter.Link the output of one
dplyrfunction to the input of another function with the ‘pipe’ operator%>%.Add new columns to a data frame that are functions of existing columns with
mutate.Use
summarize,group_by, andcountto split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.Export a data frame to a .csv file.
Data Manipulation using dplyr
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. dplyr. dplyr is a package for making tabular data manipulation easier.
Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or data.frame(), come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.
You should already have installed the tidyverse package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as tidyr, dplyr, ggplot2, tibble, etc.
The tidyverse package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:
- 
    
The results from a base R function sometimes depend on the type of data.
 - 
    
Using R expressions in a non standard way, which can be confusing for new learners.
 - 
    
Hidden arguments, having default operations that new learners are not aware of.
 
We have seen in our previous lesson that when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. We had to set stringsAsFactors to FALSE to avoid this hidden argument to convert our data type.
This time we will use the tidyverse package to read the data and avoid having to set stringsAsFactors to FALSE
To load the tidyverse package set, including dplyr, type:
## load the tidyverse packages, including dplyr
library("tidyverse")
What is dplyr?
The package dplyr provides easy tools for the most common data manipulation tasks like selecting columns, filtering rows, creating new columns, and creating summary tables.
To learn more about dplyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet.
We’ll read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().
iris <- read_csv("data/iris.csv")
Parsed with column specification:
cols(
  Sepal.Length = col_double(),
  Sepal.Width = col_double(),
  Petal.Length = col_double(),
  Petal.Width = col_double(),
  Species = col_character()
)
## inspect the data
str(iris)
Classes 'tbl_df', 'tbl' and 'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 5
  .. ..$ Sepal.Length: list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Sepal.Width : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Petal.Length: list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Petal.Width : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Species     : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"
To preview the data, you can click on the name of the tibble in your environment or use the View function.
## preview the data
View(iris)
Either of these methods will open a window with the dataset in the same area as your script file.
Tibbles
Notice that the class of the data is now tbl_df
This is referred to as a “tibble”. Tibbles tweak some of the behaviors of the data frame objects we introduced in the previous episode. The data structure is very similar to a data frame. For our purposes the only differences are that:
- 
    
In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
 - 
    
Columns of class
characterare never converted into factors. 
We’re going to learn some of the most common dplyr functions:
select(): subset columnsfilter(): subset rows on conditionsmutate(): create new columns by using information from other columnsgroup_by()andsummarize(): create summary statisitcs on grouped dataarrange(): sort resultscount(): count discrete values
Selecting columns and filtering rows
To select columns of a data frame, use select(). The first argument
to this function is the data frame (iris), and the subsequent
arguments are the columns to keep.
select(iris, Sepal.Length, Sepal.Width)
# A tibble: 150 x 2
   Sepal.Length Sepal.Width
          <dbl>       <dbl>
 1          5.1         3.5
 2          4.9         3  
 3          4.7         3.2
 4          4.6         3.1
 5          5           3.6
 6          5.4         3.9
 7          4.6         3.4
 8          5           3.4
 9          4.4         2.9
10          4.9         3.1
# … with 140 more rows
To select all columns except certain ones, put a “-“ in front of the variable to exclude it.
select(iris, -Sepal.Length, -Sepal.Width)
# A tibble: 150 x 3
   Petal.Length Petal.Width Species
          <dbl>       <dbl> <chr>  
 1          1.4         0.2 setosa 
 2          1.4         0.2 setosa 
 3          1.3         0.2 setosa 
 4          1.5         0.2 setosa 
 5          1.4         0.2 setosa 
 6          1.7         0.4 setosa 
 7          1.4         0.3 setosa 
 8          1.5         0.2 setosa 
 9          1.4         0.2 setosa 
10          1.5         0.1 setosa 
# … with 140 more rows
This will select all the variables in iris except Sepal.Length
and Sepal.Width.
To choose rows based on a specific criteria, use filter():
filter(iris, Sepal.Length>5)
# A tibble: 118 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          5.4         3.9          1.7         0.4 setosa 
 3          5.4         3.7          1.5         0.2 setosa 
 4          5.8         4            1.2         0.2 setosa 
 5          5.7         4.4          1.5         0.4 setosa 
 6          5.4         3.9          1.3         0.4 setosa 
 7          5.1         3.5          1.4         0.3 setosa 
 8          5.7         3.8          1.7         0.3 setosa 
 9          5.1         3.8          1.5         0.3 setosa 
10          5.4         3.4          1.7         0.2 setosa 
# … with 108 more rows
Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
iris2 <- filter(iris, Sepal.Length < 5)
iris_small <- select(iris2, Species, Sepal.Length, Sepal.Width)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
iris_sml <- select(filter(iris, Sepal.Length < 5), Species, Sepal.Length, Sepal.Width)
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option, pipes, are a recent addition to R. Pipes let you take
the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.  Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl
- Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
 
iris %>%
  filter(Sepal.Length < 5) %>%
  select(Species, Sepal.Length, Sepal.Width)
# A tibble: 22 x 3
   Species Sepal.Length Sepal.Width
   <chr>          <dbl>       <dbl>
 1 setosa           4.9         3  
 2 setosa           4.7         3.2
 3 setosa           4.6         3.1
 4 setosa           4.6         3.4
 5 setosa           4.4         2.9
 6 setosa           4.9         3.1
 7 setosa           4.8         3.4
 8 setosa           4.8         3  
 9 setosa           4.3         3  
10 setosa           4.6         3.6
# … with 12 more rows
In the above code, we use the pipe to send the iris dataset first through filter() to keep rows where Sepal.Length is less than 5, then through select() to keep only the Sepal.Length, Sepal.Width, and species columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more.
Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame iris, then we filtered for rows with Sepal.Length < 5, then we selected columns species, Sepal.Length, and Sepal.Width. The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
iris_sml <- iris %>%
  filter(Sepal.Length < 5) %>%
  select(Species, Sepal.Length, Sepal.Width)
iris_sml
# A tibble: 22 x 3
   Species Sepal.Length Sepal.Width
   <chr>          <dbl>       <dbl>
 1 setosa           4.9         3  
 2 setosa           4.7         3.2
 3 setosa           4.6         3.1
 4 setosa           4.6         3.4
 5 setosa           4.4         2.9
 6 setosa           4.9         3.1
 7 setosa           4.8         3.4
 8 setosa           4.8         3  
 9 setosa           4.3         3  
10 setosa           4.6         3.6
# … with 12 more rows
Note that the final data frame is the leftmost part of this expression.
Introducing the a Plant Physiology Environmental dataset
While the iris dataset lets us illustrate many of the features of dplyr, it is not complex enough to illustrate all features. For the challenges and some examples in this lesson, we will be introducing a new dataset,  phys_data
Description of phys_data
Plants have small openings on their leaf surfaces called “stomata” that permit the exchange of gases between the plant and the atmosphere during photosynthesis. During the day, plants keep their stomata open in order to obtain carbon dioxide from the atmosphere for photosynthesis. However, open stomata also allow water to escape from the leaf to the atmosphere through the process of transpiration. Plants typically close their stomata at night to preserve plant water when photosynthesis doesn’t occur. However, recent research has shown that many plants actually keep their stomata open at night, which allows nocturnal transpiration to occur. What drives nocturnal transpiration and if there is any ecological significance associated with this process is still unclear. To address this gap in knowledge, O’Keefe and Nippert (2018) aimed to understand the patterns and drivers of nocturnal transpiration in a variety of grassland plants. They measured rates of water loss in 8 different plant species, during the day and at night, over the course of two years. They also measured a variety of other plant physiological traits (photosynthesis, leaf water status) and environmental variables (air temperature, soil moisture, vapor pressure deficit) in order to determine if daytime and nocturnal transpiration are associated with other aspects of plant functioning or environmental drivers. They found that patterns of transpiration differed among species, with grasses exhibiting the highest rates of nocturnal water loss. Daytime and nocturnal transpiration had varying relationships with different environmental drivers, and nocturnal transpiration was strongly associated with predawn leaf water potential (an indicator of soil moisture around plant roots). We suggest that nocturnal transpiration is driven by water availability and may act as a competitive strategy in drought-tolerant grasses to use up water as quickly as possible, so as to create drier growing conditions for their less drought-tolerant neighbors.
Overview of the variables
| Variable | Description | 
|---|---|
| Year | Year of data collection | 
| DOY | Day of year of data collection | 
| Species | Species (“AG”=Andropogon gerardii, “PV”=Panicum virgatum, “SN”=Sorghastrum nutans, “VB”=Vernonia baldwinii, “SC”=Solidago canadensis, “AC”=Amorpha canescens, “RG”=Rhus glabra, “CD”=Cornus drummondii | 
| Fgroup | Plant functional group | 
| Cond_night | Nocturnal stomatal conductance (mol m-2s-1) | 
| Trmmol_night | Nocturnal transpiration (mmol m-2s-1) | 
| Photo | Photosynthesis (μmol CO2 m-2s-1) | 
| Cond_day | Diurnal stomatal conductance (mol m-2s-1) | 
| Trmmol_day | Diurnal transpiration (mmol m-2s-1) | 
| Percent_cond | Percent nocturnal of diurnal stomatal conductance | 
| Percent_trmmol | Percent nocturnal of diurnal transpiration | 
| PD | Predawn leaf water potential (MPa) | 
| MD | Midday leaf water potential (MPa) | 
| Soil_moisture | Average daily soil moisture (v/v) | 
| TAIR_N | Air temperature (°C) at average time of nocturnal physiology measurements (23:00 h) | 
| VPD_N | Vapor pressure deficit (kPa) at average time of nocturnal physiology measurements (23:00 h) | 
| TAIR_D | Air temperature (°C) at average time of diurnal physiology measurements (11:00 h) | 
| VPD_D | Vapor pressure deficit (kPa) at average time of diurnal physiology measurements (11:00 h) | 
The only difference is that the date is represented as month, day, and year rather than year and day of year.
Challenge 1
Using pipes, subset the
phys_datesdata to include observations collected in 2014 and retain only the columnsYear,Species, andPhoto.Solution to Challenge 1
phys_dates %>% filter(Year == 2014) %>% select(Year, Species, Photo)
Mutate
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().
To create a new column of weight in kg:
iris %>%
  mutate(Sepal.Ratio = Sepal.Length/Sepal.Width)
# A tibble: 150 x 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
          <dbl>       <dbl>        <dbl>       <dbl> <chr>         <dbl>
 1          5.1         3.5          1.4         0.2 setosa         1.46
 2          4.9         3            1.4         0.2 setosa         1.63
 3          4.7         3.2          1.3         0.2 setosa         1.47
 4          4.6         3.1          1.5         0.2 setosa         1.48
 5          5           3.6          1.4         0.2 setosa         1.39
 6          5.4         3.9          1.7         0.4 setosa         1.38
 7          4.6         3.4          1.4         0.3 setosa         1.35
 8          5           3.4          1.5         0.2 setosa         1.47
 9          4.4         2.9          1.4         0.2 setosa         1.52
10          4.9         3.1          1.5         0.1 setosa         1.58
# … with 140 more rows
You can also create a second new column based on the first new column within the same call of mutate():
iris %>%
  mutate(Sepal.Ratio = Sepal.Length/Sepal.Width,
         Petal.Ratio = Petal.Length/Petal.Width)
# A tibble: 150 x 7
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
          <dbl>       <dbl>        <dbl>       <dbl> <chr>         <dbl>
 1          5.1         3.5          1.4         0.2 setosa         1.46
 2          4.9         3            1.4         0.2 setosa         1.63
 3          4.7         3.2          1.3         0.2 setosa         1.47
 4          4.6         3.1          1.5         0.2 setosa         1.48
 5          5           3.6          1.4         0.2 setosa         1.39
 6          5.4         3.9          1.7         0.4 setosa         1.38
 7          4.6         3.4          1.4         0.3 setosa         1.35
 8          5           3.4          1.5         0.2 setosa         1.47
 9          4.4         2.9          1.4         0.2 setosa         1.52
10          4.9         3.1          1.5         0.1 setosa         1.58
# … with 140 more rows, and 1 more variable: Petal.Ratio <dbl>
Challenge 2
Instantaneous Water-Use Efficiency (WUE) is an efficiency measure of how much carbon is gained by photosynthesis per water lost through transpiration. A higher WUE indicates that the plant is losing less water, which is better for the plant. WUE is caluclated by dividing photosynthesis by transpiration. Often, different species and functional groups have distinct WUE. Create a new data frame from the
phys_datesdata that meets the following criteria:
- is named phys_wue
 - contains the following columns: Species, FGroup
 - contains a new column called wue_day (wue_day = Photo/Trmmol_day)
 - contains only rows that have a wue_day >1
 Which species and functional types have the highest WUE?
Hint: think about how the commands should be ordered to produce this data frame!
Solution to Challenge 2
phys_wue <- phys_dates %>% mutate(wue_day = Photo/Trmmol_day) %>% filter(wue_day > 1) %>% select(Species, Fgroup, wue_day)
Exporting data
Now that you have learned how to use dplyr to extract information from
or summarize your raw data, you may want to export these new data sets to share
them with your collaborators or for archival.
Similar to the read_csv() function used for reading CSV files into R, there is
a write_csv() function that generates CSV files from data frames.
Before using write_csv(), we are going to create a new folder, data_output,
in our working directory that will store this generated dataset. We don’t want
to write generated datasets in the same directory as our raw data. It’s good
practice to keep them separate. The data folder should only contain the raw,
unaltered data, and should be left alone to make sure we don’t delete or modify
it. In contrast, our script will generate the contents of the data_output
directory, so even if the files it contains are deleted, we can always
re-generate them.
We can save it as a CSV file in a new folder called data_output folder.
Let’s save the dataset that that we created dates in.
write_csv(phys_dates, path = "data_output/phys_dates.csv")
Split-apply-combine data analysis and the summarize() function
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.
The summarize() function
group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group.  group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean Sepal.Length by Species:
iris %>%
  group_by(Species) %>%
  summarize(mean_sepal_length = mean(Sepal.Length))
# A tibble: 3 x 2
  Species    mean_sepal_length
  <chr>                  <dbl>
1 setosa                  5.01
2 versicolor              5.94
3 virginica               6.59
You can also group by multiple columns. For this, we’ll need to use the phys_dates dataset:
phys_dates %>%
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo))
# A tibble: 16 x 3
# Groups:   Year [2]
    Year Species mean_photo
   <int> <chr>        <dbl>
 1  2014 AC            12.0
 2  2014 AG            18.8
 3  2014 CD            11.4
 4  2014 PV            19.3
 5  2014 RG            15.8
 6  2014 SC            11.0
 7  2014 SN            19.3
 8  2014 VB            10.7
 9  2015 AC            12.7
10  2015 AG            18.1
11  2015 CD            13.3
12  2015 PV            24.7
13  2015 RG            17.4
14  2015 SC            19.8
15  2015 SN            29.5
16  2015 VB            12.9
If you want to display more data, you can use the print() function
at the end of your chain with the argument n specifying the number of rows to display:
phys_dates %>%  
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo)) %>% 
  print(n = 15)
# A tibble: 16 x 3
# Groups:   Year [2]
    Year Species mean_photo
   <int> <chr>        <dbl>
 1  2014 AC            12.0
 2  2014 AG            18.8
 3  2014 CD            11.4
 4  2014 PV            19.3
 5  2014 RG            15.8
 6  2014 SC            11.0
 7  2014 SN            19.3
 8  2014 VB            10.7
 9  2015 AC            12.7
10  2015 AG            18.1
11  2015 CD            13.3
12  2015 PV            24.7
13  2015 RG            17.4
14  2015 SC            19.8
15  2015 SN            29.5
# … with 1 more row
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum photosynthesis level by year and species:
phys_dates %>%
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo),
            min_photo = min(Photo))
# A tibble: 16 x 4
# Groups:   Year [2]
    Year Species mean_photo min_photo
   <int> <chr>        <dbl>     <dbl>
 1  2014 AC            12.0      6.60
 2  2014 AG            18.8     10.3 
 3  2014 CD            11.4      6.71
 4  2014 PV            19.3      6.56
 5  2014 RG            15.8      2.47
 6  2014 SC            11.0      1.82
 7  2014 SN            19.3      9.78
 8  2014 VB            10.7      3.13
 9  2015 AC            12.7      5.14
10  2015 AG            18.1      8.24
11  2015 CD            13.3      7.38
12  2015 PV            24.7     10.6 
13  2015 RG            17.4      8.35
14  2015 SC            19.8      6.58
15  2015 SN            29.5     11.7 
16  2015 VB            12.9      1.08
It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min_photo to put the less photosynthetic first:
phys_dates %>%
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo),
            min_photo = min(Photo)) %>%
  arrange(min_photo)
# A tibble: 16 x 4
# Groups:   Year [2]
    Year Species mean_photo min_photo
   <int> <chr>        <dbl>     <dbl>
 1  2015 VB            12.9      1.08
 2  2014 SC            11.0      1.82
 3  2014 RG            15.8      2.47
 4  2014 VB            10.7      3.13
 5  2015 AC            12.7      5.14
 6  2014 PV            19.3      6.56
 7  2015 SC            19.8      6.58
 8  2014 AC            12.0      6.60
 9  2014 CD            11.4      6.71
10  2015 CD            13.3      7.38
11  2015 AG            18.1      8.24
12  2015 RG            17.4      8.35
13  2014 SN            19.3      9.78
14  2014 AG            18.8     10.3 
15  2015 PV            24.7     10.6 
16  2015 SN            29.5     11.7 
To sort in descending order, we need to add the desc() function. If we want to sort the results by decreasing order of minimum photosynthesis:
phys_dates %>%
  group_by(Year, Species) %>%
  summarize(mean_photo = mean(Photo),
            min_photo = min(Photo)) %>%
  arrange(desc(min_photo))
# A tibble: 16 x 4
# Groups:   Year [2]
    Year Species mean_photo min_photo
   <int> <chr>        <dbl>     <dbl>
 1  2015 SN            29.5     11.7 
 2  2015 PV            24.7     10.6 
 3  2014 AG            18.8     10.3 
 4  2014 SN            19.3      9.78
 5  2015 RG            17.4      8.35
 6  2015 AG            18.1      8.24
 7  2015 CD            13.3      7.38
 8  2014 CD            11.4      6.71
 9  2014 AC            12.0      6.60
10  2015 SC            19.8      6.58
11  2014 PV            19.3      6.56
12  2015 AC            12.7      5.14
13  2014 VB            10.7      3.13
14  2014 RG            15.8      2.47
15  2014 SC            11.0      1.82
16  2015 VB            12.9      1.08
Challenge 3
The original paper containing the iris dataset distinguished the three types of irises based on the sepal and petal length and width, as well as the ratio of length to width for sepals and petals.
For this challenge, let’s replicate their analysis.
- Calculate the mean sepal length and sepal width for each species
 - Calculate the mean petal lengths and petal widths for each species
 - Calculate the ratio of Sepal.Length to Sepal.Width
 - Calculate the ratio of Petal.Length to Petal.Width
 Which of these values does the best job of differentiating the setsosa species from the others?
Solution to Challenge 3
iris %>% group_by(Species) %>% summarize(meanSepal.Length = mean(Sepal.Length), meanSepal.Width = mean(Sepal.Width), Sepal.ratio=mean(Sepal.Length/Sepal.Width), meanPetal.Length=mean(Petal.Length), meanPetal.Width=mean(Petal.Width), Petal.ratio = mean(Petal.Length/Petal.Width) )# A tibble: 3 x 7 Species meanSepal.Length meanSepal.Width Sepal.ratio meanPetal.Length <chr> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.01 3.42 1.47 1.46 2 versic… 5.94 2.77 2.16 4.26 3 virgin… 6.59 2.97 2.23 5.55 # … with 2 more variables: meanPetal.Width <dbl>, Petal.ratio <dbl>Petal ratio has the largest difference between setsosa and the other species.
Counting
When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each species, we would do:
iris %>%
    count(Species) 
# A tibble: 3 x 2
  Species        n
  <chr>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50
The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, iris %>% count() is equivalent to:
iris %>%
    group_by(Species) %>%
    summarise(count = n())
# A tibble: 3 x 2
  Species    count
  <chr>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50
If we wanted to count combination of factors, such as Species and Year, we would specify the first and the second factor as the arguments of count():
phys_dates %>%
  count(Year, Species) 
# A tibble: 16 x 3
    Year Species     n
   <int> <chr>   <int>
 1  2014 AC         18
 2  2014 AG         18
 3  2014 CD         18
 4  2014 PV         18
 5  2014 RG         18
 6  2014 SC         18
 7  2014 SN         15
 8  2014 VB         18
 9  2015 AC         18
10  2015 AG         18
11  2015 CD         18
12  2015 PV         18
13  2015 RG         18
14  2015 SC         18
15  2015 SN         18
16  2015 VB         15
With the above code, we can proceed with arrange() to sort the table 
according to a number of criteria so that we have a better comparison. 
For instance, we might want to arrange the table above in (i) an alphabetical order of 
the levels of the species and (ii) in descending order of the count:
phys_dates %>%
  count(Year, Species) %>%
  arrange(Species, desc(n))
# A tibble: 16 x 3
    Year Species     n
   <int> <chr>   <int>
 1  2014 AC         18
 2  2015 AC         18
 3  2014 AG         18
 4  2015 AG         18
 5  2014 CD         18
 6  2015 CD         18
 7  2014 PV         18
 8  2015 PV         18
 9  2014 RG         18
10  2015 RG         18
11  2014 SC         18
12  2015 SC         18
13  2015 SN         18
14  2014 SN         15
15  2014 VB         18
16  2015 VB         15
Challenge 4
How many individuals of each species were measured in each
Year?Solution to Challenge 4
phys_dates %>% group_by(Year)%>% count(Species)# A tibble: 16 x 3 # Groups: Year [2] Year Species n <int> <chr> <int> 1 2014 AC 18 2 2014 AG 18 3 2014 CD 18 4 2014 PV 18 5 2014 RG 18 6 2014 SC 18 7 2014 SN 15 8 2014 VB 18 9 2015 AC 18 10 2015 AG 18 11 2015 CD 18 12 2015 PV 18 13 2015 RG 18 14 2015 SC 18 15 2015 SN 18 16 2015 VB 15
Challenge 5
Use
group_by()andsummarize()to find the mean, min, and max photosynthesis level for each species. Also add the number of observations (hint: see?n).Solution to Challenge 5
phys_dates %>% group_by(Species) %>% summarize( mean_photo = mean(Photo), max_photo = max(Photo), min_photo = min(Photo), n = n() )# A tibble: 8 x 5 Species mean_photo max_photo min_photo n <chr> <dbl> <dbl> <dbl> <int> 1 AC 12.4 22.0 5.14 36 2 AG 18.4 29.9 8.24 36 3 CD 12.4 18.5 6.71 36 4 PV 22.0 35.5 6.56 36 5 RG 16.6 25.4 2.47 36 6 SC 15.4 32.7 1.82 36 7 SN 24.9 40.6 9.78 33 8 VB 11.7 27.0 1.08 33
Challenge 6
Which plant had the highest photosynthetic rates in each year? Return the columns
Year,Species, andPhoto.Solution to Challenge 7
phys_dates %>% group_by(Year) %>% filter(Photo == max(Photo)) %>% select(Year, Species, Photo) %>% arrange(Year)# A tibble: 2 x 3 # Groups: Year [2] Year Species Photo <int> <chr> <dbl> 1 2014 SN 30.4 2 2015 SN 40.6
Key Points
The tidyverse is built for data manipulation.
The
read_csvfunction creates tibbles instead of data frames.The
selectfunction picks particular columns based on names.The
filterfunction picks rows based on values.The
mutatefunction creates new columns based on the value of other columns.The
group_byandsummarizefunctions can be used to create summary tables.The
write_csvfunction exports tibbles into a .csv file.