Manipulating, analyzing and exporting data with tidyverse
Overview
Teaching: min
Exercises: minQuestions
What is the tidyverse? Why should I use it to manipulate data?
Objectives
Describe the purpose of the
dplyr
package.Select certain columns in a data frame with the
dplyr
functionselect
.Select certain rows in a data frame according to filtering conditions with the
dplyr
functionfilter
.Link the output of one
dplyr
function to the input of another function with the ‘pipe’ operator%>%
.Add new columns to a data frame that are functions of existing columns with
mutate
.Use
summarize
,group_by
, andcount
to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.Export a data frame to a .csv file.
Data Manipulation using dplyr
Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. dplyr
. dplyr
is a package for making tabular data manipulation easier.
Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str()
or data.frame()
, come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.
You should already have installed the tidyverse
package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as tidyr
, dplyr
, ggplot2
, tibble
, etc.
The tidyverse
package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R:
-
The results from a base R function sometimes depend on the type of data.
-
Using R expressions in a non standard way, which can be confusing for new learners.
-
Hidden arguments, having default operations that new learners are not aware of.
We have seen in our previous lesson that when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor
data type. We had to set stringsAsFactors
to FALSE
to avoid this hidden argument to convert our data type.
This time we will use the tidyverse
package to read the data and avoid having to set stringsAsFactors
to FALSE
To load the tidyverse package set, including dplyr, type:
## load the tidyverse packages, including dplyr
library("tidyverse")
What is dplyr
?
The package dplyr
provides easy tools for the most common data manipulation tasks like selecting columns, filtering rows, creating new columns, and creating summary tables.
To learn more about dplyr
after the workshop, you may want to check out this handy data transformation with dplyr
cheatsheet.
We’ll read in our data using the read_csv()
function, from the tidyverse package readr
, instead of read.csv()
.
iris <- read_csv("data/iris.csv")
Parsed with column specification:
cols(
Sepal.Length = col_double(),
Sepal.Width = col_double(),
Petal.Length = col_double(),
Petal.Width = col_double(),
Species = col_character()
)
## inspect the data
str(iris)
Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
- attr(*, "spec")=List of 2
..$ cols :List of 5
.. ..$ Sepal.Length: list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ Sepal.Width : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ Petal.Length: list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ Petal.Width : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ Species : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
To preview the data, you can click on the name of the tibble in your environment or use the View
function.
## preview the data
View(iris)
Either of these methods will open a window with the dataset in the same area as your script file.
Tibbles
Notice that the class of the data is now tbl_df
This is referred to as a “tibble”. Tibbles tweak some of the behaviors of the data frame objects we introduced in the previous episode. The data structure is very similar to a data frame. For our purposes the only differences are that:
-
In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
-
Columns of class
character
are never converted into factors.
We’re going to learn some of the most common dplyr
functions:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columnsgroup_by()
andsummarize()
: create summary statisitcs on grouped dataarrange()
: sort resultscount()
: count discrete values
Selecting columns and filtering rows
To select columns of a data frame, use select()
. The first argument
to this function is the data frame (iris
), and the subsequent
arguments are the columns to keep.
select(iris, Sepal.Length, Sepal.Width)
# A tibble: 150 x 2
Sepal.Length Sepal.Width
<dbl> <dbl>
1 5.1 3.5
2 4.9 3
3 4.7 3.2
4 4.6 3.1
5 5 3.6
6 5.4 3.9
7 4.6 3.4
8 5 3.4
9 4.4 2.9
10 4.9 3.1
# … with 140 more rows
To select all columns except certain ones, put a “-“ in front of the variable to exclude it.
select(iris, -Sepal.Length, -Sepal.Width)
# A tibble: 150 x 3
Petal.Length Petal.Width Species
<dbl> <dbl> <chr>
1 1.4 0.2 setosa
2 1.4 0.2 setosa
3 1.3 0.2 setosa
4 1.5 0.2 setosa
5 1.4 0.2 setosa
6 1.7 0.4 setosa
7 1.4 0.3 setosa
8 1.5 0.2 setosa
9 1.4 0.2 setosa
10 1.5 0.1 setosa
# … with 140 more rows
This will select all the variables in iris
except Sepal.Length
and Sepal.Width
.
To choose rows based on a specific criteria, use filter()
:
filter(iris, Sepal.Length>5)
# A tibble: 118 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 5.4 3.9 1.7 0.4 setosa
3 5.4 3.7 1.5 0.2 setosa
4 5.8 4 1.2 0.2 setosa
5 5.7 4.4 1.5 0.4 setosa
6 5.4 3.9 1.3 0.4 setosa
7 5.1 3.5 1.4 0.3 setosa
8 5.7 3.8 1.7 0.3 setosa
9 5.1 3.8 1.5 0.3 setosa
10 5.4 3.4 1.7 0.2 setosa
# … with 108 more rows
Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
iris2 <- filter(iris, Sepal.Length < 5)
iris_small <- select(iris2, Species, Sepal.Length, Sepal.Width)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
iris_sml <- select(filter(iris, Sepal.Length < 5), Species, Sepal.Length, Sepal.Width)
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option, pipes, are a recent addition to R. Pipes let you take
the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>%
and are made available via the magrittr
package, installed automatically with dplyr
. If you use RStudio, you can type the pipe with Ctrl
- Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
iris %>%
filter(Sepal.Length < 5) %>%
select(Species, Sepal.Length, Sepal.Width)
# A tibble: 22 x 3
Species Sepal.Length Sepal.Width
<chr> <dbl> <dbl>
1 setosa 4.9 3
2 setosa 4.7 3.2
3 setosa 4.6 3.1
4 setosa 4.6 3.4
5 setosa 4.4 2.9
6 setosa 4.9 3.1
7 setosa 4.8 3.4
8 setosa 4.8 3
9 setosa 4.3 3
10 setosa 4.6 3.6
# … with 12 more rows
In the above code, we use the pipe to send the iris
dataset first through filter()
to keep rows where Sepal.Length
is less than 5, then through select()
to keep only the Sepal.Length
, Sepal.Width
, and species
columns. Since %>%
takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter()
and select()
functions any more.
Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame iris
, then we filter
ed for rows with Sepal.Length < 5
, then we select
ed columns species
, Sepal.Length
, and Sepal.Width
. The dplyr
functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
iris_sml <- iris %>%
filter(Sepal.Length < 5) %>%
select(Species, Sepal.Length, Sepal.Width)
iris_sml
# A tibble: 22 x 3
Species Sepal.Length Sepal.Width
<chr> <dbl> <dbl>
1 setosa 4.9 3
2 setosa 4.7 3.2
3 setosa 4.6 3.1
4 setosa 4.6 3.4
5 setosa 4.4 2.9
6 setosa 4.9 3.1
7 setosa 4.8 3.4
8 setosa 4.8 3
9 setosa 4.3 3
10 setosa 4.6 3.6
# … with 12 more rows
Note that the final data frame is the leftmost part of this expression.
Introducing the a Plant Physiology Environmental dataset
While the iris dataset lets us illustrate many of the features of dplyr, it is not complex enough to illustrate all features. For the challenges and some examples in this lesson, we will be introducing a new dataset, phys_data
Description of phys_data
Plants have small openings on their leaf surfaces called “stomata” that permit the exchange of gases between the plant and the atmosphere during photosynthesis. During the day, plants keep their stomata open in order to obtain carbon dioxide from the atmosphere for photosynthesis. However, open stomata also allow water to escape from the leaf to the atmosphere through the process of transpiration. Plants typically close their stomata at night to preserve plant water when photosynthesis doesn’t occur. However, recent research has shown that many plants actually keep their stomata open at night, which allows nocturnal transpiration to occur. What drives nocturnal transpiration and if there is any ecological significance associated with this process is still unclear. To address this gap in knowledge, O’Keefe and Nippert (2018) aimed to understand the patterns and drivers of nocturnal transpiration in a variety of grassland plants. They measured rates of water loss in 8 different plant species, during the day and at night, over the course of two years. They also measured a variety of other plant physiological traits (photosynthesis, leaf water status) and environmental variables (air temperature, soil moisture, vapor pressure deficit) in order to determine if daytime and nocturnal transpiration are associated with other aspects of plant functioning or environmental drivers. They found that patterns of transpiration differed among species, with grasses exhibiting the highest rates of nocturnal water loss. Daytime and nocturnal transpiration had varying relationships with different environmental drivers, and nocturnal transpiration was strongly associated with predawn leaf water potential (an indicator of soil moisture around plant roots). We suggest that nocturnal transpiration is driven by water availability and may act as a competitive strategy in drought-tolerant grasses to use up water as quickly as possible, so as to create drier growing conditions for their less drought-tolerant neighbors.
Overview of the variables
Variable | Description |
---|---|
Year | Year of data collection |
DOY | Day of year of data collection |
Species | Species (“AG”=Andropogon gerardii, “PV”=Panicum virgatum, “SN”=Sorghastrum nutans, “VB”=Vernonia baldwinii, “SC”=Solidago canadensis, “AC”=Amorpha canescens, “RG”=Rhus glabra, “CD”=Cornus drummondii |
Fgroup | Plant functional group |
Cond_night | Nocturnal stomatal conductance (mol m-2s-1) |
Trmmol_night | Nocturnal transpiration (mmol m-2s-1) |
Photo | Photosynthesis (μmol CO2 m-2s-1) |
Cond_day | Diurnal stomatal conductance (mol m-2s-1) |
Trmmol_day | Diurnal transpiration (mmol m-2s-1) |
Percent_cond | Percent nocturnal of diurnal stomatal conductance |
Percent_trmmol | Percent nocturnal of diurnal transpiration |
PD | Predawn leaf water potential (MPa) |
MD | Midday leaf water potential (MPa) |
Soil_moisture | Average daily soil moisture (v/v) |
TAIR_N | Air temperature (°C) at average time of nocturnal physiology measurements (23:00 h) |
VPD_N | Vapor pressure deficit (kPa) at average time of nocturnal physiology measurements (23:00 h) |
TAIR_D | Air temperature (°C) at average time of diurnal physiology measurements (11:00 h) |
VPD_D | Vapor pressure deficit (kPa) at average time of diurnal physiology measurements (11:00 h) |
The only difference is that the date is represented as month, day, and year rather than year and day of year.
Challenge 1
Using pipes, subset the
phys_dates
data to include observations collected in 2014 and retain only the columnsYear
,Species
, andPhoto
.Solution to Challenge 1
phys_dates %>% filter(Year == 2014) %>% select(Year, Species, Photo)
Mutate
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of weight in kg:
iris %>%
mutate(Sepal.Ratio = Sepal.Length/Sepal.Width)
# A tibble: 150 x 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.46
2 4.9 3 1.4 0.2 setosa 1.63
3 4.7 3.2 1.3 0.2 setosa 1.47
4 4.6 3.1 1.5 0.2 setosa 1.48
5 5 3.6 1.4 0.2 setosa 1.39
6 5.4 3.9 1.7 0.4 setosa 1.38
7 4.6 3.4 1.4 0.3 setosa 1.35
8 5 3.4 1.5 0.2 setosa 1.47
9 4.4 2.9 1.4 0.2 setosa 1.52
10 4.9 3.1 1.5 0.1 setosa 1.58
# … with 140 more rows
You can also create a second new column based on the first new column within the same call of mutate()
:
iris %>%
mutate(Sepal.Ratio = Sepal.Length/Sepal.Width,
Petal.Ratio = Petal.Length/Petal.Width)
# A tibble: 150 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.46
2 4.9 3 1.4 0.2 setosa 1.63
3 4.7 3.2 1.3 0.2 setosa 1.47
4 4.6 3.1 1.5 0.2 setosa 1.48
5 5 3.6 1.4 0.2 setosa 1.39
6 5.4 3.9 1.7 0.4 setosa 1.38
7 4.6 3.4 1.4 0.3 setosa 1.35
8 5 3.4 1.5 0.2 setosa 1.47
9 4.4 2.9 1.4 0.2 setosa 1.52
10 4.9 3.1 1.5 0.1 setosa 1.58
# … with 140 more rows, and 1 more variable: Petal.Ratio <dbl>
Challenge 2
Instantaneous Water-Use Efficiency (WUE) is an efficiency measure of how much carbon is gained by photosynthesis per water lost through transpiration. A higher WUE indicates that the plant is losing less water, which is better for the plant. WUE is caluclated by dividing photosynthesis by transpiration. Often, different species and functional groups have distinct WUE. Create a new data frame from the
phys_dates
data that meets the following criteria:
- is named phys_wue
- contains the following columns: Species, FGroup
- contains a new column called wue_day (wue_day = Photo/Trmmol_day)
- contains only rows that have a wue_day >1
Which species and functional types have the highest WUE?
Hint: think about how the commands should be ordered to produce this data frame!
Solution to Challenge 2
phys_wue <- phys_dates %>% mutate(wue_day = Photo/Trmmol_day) %>% filter(wue_day > 1) %>% select(Species, Fgroup, wue_day)
Exporting data
Now that you have learned how to use dplyr
to extract information from
or summarize your raw data, you may want to export these new data sets to share
them with your collaborators or for archival.
Similar to the read_csv()
function used for reading CSV files into R, there is
a write_csv()
function that generates CSV files from data frames.
Before using write_csv()
, we are going to create a new folder, data_output
,
in our working directory that will store this generated dataset. We don’t want
to write generated datasets in the same directory as our raw data. It’s good
practice to keep them separate. The data
folder should only contain the raw,
unaltered data, and should be left alone to make sure we don’t delete or modify
it. In contrast, our script will generate the contents of the data_output
directory, so even if the files it contains are deleted, we can always
re-generate them.
We can save it as a CSV file in a new folder called data_output
folder.
Let’s save the dataset that that we created dates in.
write_csv(phys_dates, path = "data_output/phys_dates.csv")
Split-apply-combine data analysis and the summarize() function
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function.
The summarize()
function
group_by()
is often used together with summarize()
, which collapses each group into a single-row summary of that group. group_by()
takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean Sepal.Length
by Species
:
iris %>%
group_by(Species) %>%
summarize(mean_sepal_length = mean(Sepal.Length))
# A tibble: 3 x 2
Species mean_sepal_length
<chr> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
You can also group by multiple columns. For this, we’ll need to use the phys_dates
dataset:
phys_dates %>%
group_by(Year, Species) %>%
summarize(mean_photo = mean(Photo))
# A tibble: 16 x 3
# Groups: Year [2]
Year Species mean_photo
<int> <chr> <dbl>
1 2014 AC 12.0
2 2014 AG 18.8
3 2014 CD 11.4
4 2014 PV 19.3
5 2014 RG 15.8
6 2014 SC 11.0
7 2014 SN 19.3
8 2014 VB 10.7
9 2015 AC 12.7
10 2015 AG 18.1
11 2015 CD 13.3
12 2015 PV 24.7
13 2015 RG 17.4
14 2015 SC 19.8
15 2015 SN 29.5
16 2015 VB 12.9
If you want to display more data, you can use the print()
function
at the end of your chain with the argument n
specifying the number of rows to display:
phys_dates %>%
group_by(Year, Species) %>%
summarize(mean_photo = mean(Photo)) %>%
print(n = 15)
# A tibble: 16 x 3
# Groups: Year [2]
Year Species mean_photo
<int> <chr> <dbl>
1 2014 AC 12.0
2 2014 AG 18.8
3 2014 CD 11.4
4 2014 PV 19.3
5 2014 RG 15.8
6 2014 SC 11.0
7 2014 SN 19.3
8 2014 VB 10.7
9 2015 AC 12.7
10 2015 AG 18.1
11 2015 CD 13.3
12 2015 PV 24.7
13 2015 RG 17.4
14 2015 SC 19.8
15 2015 SN 29.5
# … with 1 more row
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum photosynthesis level by year and species:
phys_dates %>%
group_by(Year, Species) %>%
summarize(mean_photo = mean(Photo),
min_photo = min(Photo))
# A tibble: 16 x 4
# Groups: Year [2]
Year Species mean_photo min_photo
<int> <chr> <dbl> <dbl>
1 2014 AC 12.0 6.60
2 2014 AG 18.8 10.3
3 2014 CD 11.4 6.71
4 2014 PV 19.3 6.56
5 2014 RG 15.8 2.47
6 2014 SC 11.0 1.82
7 2014 SN 19.3 9.78
8 2014 VB 10.7 3.13
9 2015 AC 12.7 5.14
10 2015 AG 18.1 8.24
11 2015 CD 13.3 7.38
12 2015 PV 24.7 10.6
13 2015 RG 17.4 8.35
14 2015 SC 19.8 6.58
15 2015 SN 29.5 11.7
16 2015 VB 12.9 1.08
It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min_photo
to put the less photosynthetic first:
phys_dates %>%
group_by(Year, Species) %>%
summarize(mean_photo = mean(Photo),
min_photo = min(Photo)) %>%
arrange(min_photo)
# A tibble: 16 x 4
# Groups: Year [2]
Year Species mean_photo min_photo
<int> <chr> <dbl> <dbl>
1 2015 VB 12.9 1.08
2 2014 SC 11.0 1.82
3 2014 RG 15.8 2.47
4 2014 VB 10.7 3.13
5 2015 AC 12.7 5.14
6 2014 PV 19.3 6.56
7 2015 SC 19.8 6.58
8 2014 AC 12.0 6.60
9 2014 CD 11.4 6.71
10 2015 CD 13.3 7.38
11 2015 AG 18.1 8.24
12 2015 RG 17.4 8.35
13 2014 SN 19.3 9.78
14 2014 AG 18.8 10.3
15 2015 PV 24.7 10.6
16 2015 SN 29.5 11.7
To sort in descending order, we need to add the desc()
function. If we want to sort the results by decreasing order of minimum photosynthesis:
phys_dates %>%
group_by(Year, Species) %>%
summarize(mean_photo = mean(Photo),
min_photo = min(Photo)) %>%
arrange(desc(min_photo))
# A tibble: 16 x 4
# Groups: Year [2]
Year Species mean_photo min_photo
<int> <chr> <dbl> <dbl>
1 2015 SN 29.5 11.7
2 2015 PV 24.7 10.6
3 2014 AG 18.8 10.3
4 2014 SN 19.3 9.78
5 2015 RG 17.4 8.35
6 2015 AG 18.1 8.24
7 2015 CD 13.3 7.38
8 2014 CD 11.4 6.71
9 2014 AC 12.0 6.60
10 2015 SC 19.8 6.58
11 2014 PV 19.3 6.56
12 2015 AC 12.7 5.14
13 2014 VB 10.7 3.13
14 2014 RG 15.8 2.47
15 2014 SC 11.0 1.82
16 2015 VB 12.9 1.08
Challenge 3
The original paper containing the iris dataset distinguished the three types of irises based on the sepal and petal length and width, as well as the ratio of length to width for sepals and petals.
For this challenge, let’s replicate their analysis.
- Calculate the mean sepal length and sepal width for each species
- Calculate the mean petal lengths and petal widths for each species
- Calculate the ratio of Sepal.Length to Sepal.Width
- Calculate the ratio of Petal.Length to Petal.Width
Which of these values does the best job of differentiating the setsosa species from the others?
Solution to Challenge 3
iris %>% group_by(Species) %>% summarize(meanSepal.Length = mean(Sepal.Length), meanSepal.Width = mean(Sepal.Width), Sepal.ratio=mean(Sepal.Length/Sepal.Width), meanPetal.Length=mean(Petal.Length), meanPetal.Width=mean(Petal.Width), Petal.ratio = mean(Petal.Length/Petal.Width) )
# A tibble: 3 x 7 Species meanSepal.Length meanSepal.Width Sepal.ratio meanPetal.Length <chr> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.01 3.42 1.47 1.46 2 versic… 5.94 2.77 2.16 4.26 3 virgin… 6.59 2.97 2.23 5.55 # … with 2 more variables: meanPetal.Width <dbl>, Petal.ratio <dbl>
Petal ratio has the largest difference between setsosa and the other species.
Counting
When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr
provides count()
. For example, if we wanted to count the number of rows of data for each species, we would do:
iris %>%
count(Species)
# A tibble: 3 x 2
Species n
<chr> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
The count()
function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, iris %>% count()
is equivalent to:
iris %>%
group_by(Species) %>%
summarise(count = n())
# A tibble: 3 x 2
Species count
<chr> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
If we wanted to count combination of factors, such as Species
and Year
, we would specify the first and the second factor as the arguments of count()
:
phys_dates %>%
count(Year, Species)
# A tibble: 16 x 3
Year Species n
<int> <chr> <int>
1 2014 AC 18
2 2014 AG 18
3 2014 CD 18
4 2014 PV 18
5 2014 RG 18
6 2014 SC 18
7 2014 SN 15
8 2014 VB 18
9 2015 AC 18
10 2015 AG 18
11 2015 CD 18
12 2015 PV 18
13 2015 RG 18
14 2015 SC 18
15 2015 SN 18
16 2015 VB 15
With the above code, we can proceed with arrange()
to sort the table
according to a number of criteria so that we have a better comparison.
For instance, we might want to arrange the table above in (i) an alphabetical order of
the levels of the species and (ii) in descending order of the count:
phys_dates %>%
count(Year, Species) %>%
arrange(Species, desc(n))
# A tibble: 16 x 3
Year Species n
<int> <chr> <int>
1 2014 AC 18
2 2015 AC 18
3 2014 AG 18
4 2015 AG 18
5 2014 CD 18
6 2015 CD 18
7 2014 PV 18
8 2015 PV 18
9 2014 RG 18
10 2015 RG 18
11 2014 SC 18
12 2015 SC 18
13 2015 SN 18
14 2014 SN 15
15 2014 VB 18
16 2015 VB 15
Challenge 4
How many individuals of each species were measured in each
Year
?Solution to Challenge 4
phys_dates %>% group_by(Year)%>% count(Species)
# A tibble: 16 x 3 # Groups: Year [2] Year Species n <int> <chr> <int> 1 2014 AC 18 2 2014 AG 18 3 2014 CD 18 4 2014 PV 18 5 2014 RG 18 6 2014 SC 18 7 2014 SN 15 8 2014 VB 18 9 2015 AC 18 10 2015 AG 18 11 2015 CD 18 12 2015 PV 18 13 2015 RG 18 14 2015 SC 18 15 2015 SN 18 16 2015 VB 15
Challenge 5
Use
group_by()
andsummarize()
to find the mean, min, and max photosynthesis level for each species. Also add the number of observations (hint: see?n
).Solution to Challenge 5
phys_dates %>% group_by(Species) %>% summarize( mean_photo = mean(Photo), max_photo = max(Photo), min_photo = min(Photo), n = n() )
# A tibble: 8 x 5 Species mean_photo max_photo min_photo n <chr> <dbl> <dbl> <dbl> <int> 1 AC 12.4 22.0 5.14 36 2 AG 18.4 29.9 8.24 36 3 CD 12.4 18.5 6.71 36 4 PV 22.0 35.5 6.56 36 5 RG 16.6 25.4 2.47 36 6 SC 15.4 32.7 1.82 36 7 SN 24.9 40.6 9.78 33 8 VB 11.7 27.0 1.08 33
Challenge 6
Which plant had the highest photosynthetic rates in each year? Return the columns
Year
,Species
, andPhoto
.Solution to Challenge 7
phys_dates %>% group_by(Year) %>% filter(Photo == max(Photo)) %>% select(Year, Species, Photo) %>% arrange(Year)
# A tibble: 2 x 3 # Groups: Year [2] Year Species Photo <int> <chr> <dbl> 1 2014 SN 30.4 2 2015 SN 40.6
Key Points
The tidyverse is built for data manipulation.
The
read_csv
function creates tibbles instead of data frames.The
select
function picks particular columns based on names.The
filter
function picks rows based on values.The
mutate
function creates new columns based on the value of other columns.The
group_by
andsummarize
functions can be used to create summary tables.The
write_csv
function exports tibbles into a .csv file.