In this post, we will discuss about a brief intro to dplyr package in R.
Table of Contents
dplyr is a famous package for data manipulation. It has been developed by Hadley Wickham and Romain Francois. It’s constructed to be quick, highly expressive, and open-minded concerning how your information is saved.
dplyr has evolved from a previous package called plyr. Whereas plyr covers a varied group of inputs and outputs (e.g., arrays, data frames, lists), dplyr concentrates mainly on data frames and tibbles. Dplyr is a package-level enhancement of ddply() function from plyr.
dplyr is installed as part of Tidyverse collection of R packages used for data science. The Tidyverse contains other packages as well, like ggplot2 , readr, tidyr, purr etc.
Most datasets contain more information than they display. dplyr is a package that can help you access that information.
dplyr introduces a grammar of data manipulation. It has 5 simple functions that you can use to reveal new variables, new observations, and new ways to describe data.You can also use these functions to make a subset of the data, do group wise operations.
dplyr is fast, and the key pieces of dplyr are written in C++. This means that we would get speed of C++ and ease of R language.
Let’s learn how to use dplyr to solve any data related task.
how to install dplyr?
First, lets install dplyr in your R-Studio console.
1 |
install.packages("dplyr") |
You can also install Tidyverse package in order to use dplyr. As dplyr is part of Tidyverse, it will get installed automatically.
1 |
install.packages("tidyverse") |
Let’s now install hflights package. We shall use this dataset to learn about dplyr.
Let’s load these two packages using library function.Once hflights is loaded, it will be available as a dataframe in your R console. Read this blog post to understand about dataframes
1 2 |
library(dplyr) library(hflights) |
head() function in dplyr
Use head() function to see the first few rows of a dataframe.
1 |
head(hflights) |
summary() function in dplyr
summary() function is used to see higher level info about any dataset.Try using summary() to find summary statistics.
1 |
summary(hflights) |
The variables are stored in the columns, whereas the observations are stored in the rows of this data set.
If you go fully through the output of the summary() function, you can see that, this hflights data set is relatively large.It contains 227496 rows !!
what is tbl?
dplyr can help you to look at a dataset thoroughly.It provides a new datastructure for R, the tbl(pronounced the tibble).
A tbl is just a different type of dataframe.
tbl_df()
To turn hflights dataframe to a tibble (tbl) , you need to run tbl_df(hflights)
1 |
hflights <- tbl_df(hflights) |
Then when we see hflights in R console, it will cut out the superfluous columns and show the results neatly. It will tell the dimensions of the dataset, the names and data types of each column present in the dataset. The best feature is that tbl will size the output according to the size of your screen !
Note that even if we convert a dataframe to a tbl, you can still modify the tbl as if it were a dataframe.
Whatever functions we can use on a dataframe, can be used on a tbl as well.
glimpse()
We can use glimpse() to see the datatypes and initial values for each column in your dataset.
1 |
glimpse(hflights) |
Use the class function to verify that a tbl is internally a dataframe.
1 2 |
> class(hflights) [1] "tbl_df" "tbl" "data.frame" |
The five verbs of dplyr grammar
dplyr does more than just providing data structures.It provides a complete grammar for data manipulation.This grammar is built around five functions, that do the basic tasks of data manipulation.
select – removes columns, and returns a subset of columns.
filter- removes rows, and returns a subset of rows.
arrange-reorders rows in a dataset.
mutate-uses the dataset to build new columns, add columns from existing data.
summarize-calculates summary statistics, and reduces each group to a single row.
You can even combine these functions and execute them in a chain, one after another.
dplyr select
Let’s see how to use dplyr select.It is used to return a specific group of columns.
While using dplyr select, you can use column names or integer indexes .
Using column names in dplyr select
Use tbl name and column names together within the select.
Here tblname is hflights , and column names are DepTime and ArrTime.
It is not needed to use quotes while using the column names.
1 |
select(hflights,DepTime,ArrTime) |

dplyr functions can recognize variable names as they are. No need to use $ as in basic R syntax.This is true for each of the functions of dplyr grammar.
Using indexes in dplyr select
Another approach is to use integer indexes within dplyr select.
1 |
select(hflights,5,6) |

dplyr select – colon (:) and minus (-)
Use : to select a range of variables and – to exclude some variables.
: and – operators can be used on indexes and column names.
In the first code snippet given below, we are selecting the first five variables using indexes.
In the second code snippet, we are selecting the first five variables except for the second one using minus (-) operator.
1 |
select(hflights, 1:5) |
In the code snippet given below, we are selecting the variables from Year till DepTime, except for the DayOfWeek using colon (:) and minus (-) operator.
1 |
select(hflights, Year:DepTime, -DayOfWeek) |

dplyr select does not modify the original dataset , it returns a modified copy.
You have to explicitly assign the result of dplyr select() to a variable to store the result.
This is a pattern common to each verb of the dplyr grammar.
Note that dplyr functions do not change the original dataset.
If you will need to use a modified copy, you will need to save it to a variable.
dplyr select helper functions
starts_with(): select column names starting with a prefix.
ends_with(): select column names ending with a particular suffix
contains(): select column names that contain a literal string.
In this example, I am using dplyr select and its helper function “contains” to return the columns containing the string “Time” and “Delay”:
1 |
select(hflights, contains("Time"),contains("Delay")) |
matches(): select column names with matching regular expressions.
In this example, I am using dplyr select and its helper function “matches” to return the columns starting with the string “Dep”:
1 |
select(hflights, matches("^(Dep)")) |
num_range(): used to select column names containing numbers, like col1, col2, col3.
1 2 3 |
hflights2 <- hflights colnames(hflights2) <- sprintf("col%d", 1:21) select(hflights2, num_range("col", 1:5)) |

one_of(): used to select variables in character vector.
everything(): all variables.
Hence, by using these dplyr select helper functions , your R code will become very concise.
To see the added value of the dplyr package, it is useful to compare its syntax with base R.
Both of the below lines returns the same columns.
But the elegance and ease-of-use of dplyr is a great plus.
How to create a lookup table in dplyr
In R, a lookup table is used to convert alphabetical codes into more meaningful strings.
Let’s work with a lookup table, that comes in the form of a named vector.
When you subset the lookup table with a character string, R will return the values of the lookup table that correspond to the names in the character string.
To see how this works, run following code in the console:
1 2 3 |
countries <- c("IN","AU") lookup_tbl <- c("IN" = "India","AU"="Australia") lookup_tbl[countries] |
You can also use tbl$columnName syntax to save the column of a tbl as an object named arrtime, using R syntax.
1 |
> arrtime <- hflights$ArrTime |
What are variables and observations in dplyr?
In a tibble , we call columns as “variables”, and rows as “observations”.
select & mutate manipulates the variables in your dataset.
filter & arrange manipulates the observations. summarize manipulates groups of observations.
You can examine the order of the variables in hflights with names(hflights) in the console.
1 2 3 4 5 6 7 |
>names(hflights) [1] "Year" "Month" "DayofMonth" "DayOfWeek" [5] "DepTime" "ArrTime" "UniqueCarrier" "FlightNum" [9] "TailNum" "ActualElapsedTime" "AirTime" "ArrDelay" [13] "DepDelay" "Origin" "Dest" "Distance" [17] "TaxiIn" "TaxiOut" "Cancelled" "CancellationCode" [21] "Diverted" |
In my data science course, we will learn how to manipulate data using dplyr , and to use dplyr tbl structure, its pipe operator, which are two features to save a lot of time.
We will also learn to use dplyr to access data stored in a database.As you can see, with dplyr, R has become faster, bigger and better.
Any questions on dplyr, feel free to ask in the comments section below.
2 Comments. Leave new
Nice to learn concepts on R for data science.
Wonderful content!!