Intro to dplyr package in R

In this post, we will discuss about a brief intro to dplyr package in R.

dplyr is a famous package for data manipulation. It has been developed by Hadley Wickham and Romain Francois. It’s constructed to be quick, highly expressive, and open-minded concerning how your information is saved.

dplyr has evolved from a previous package called plyr. Whereas plyr covers a varied group of inputs and outputs (e.g., arrays, data frames, lists), dplyr concentrates mainly on data frames and tibbles. Dplyr is a package-level enhancement of ddply() function from plyr.

dplyr is installed as part of Tidyverse collection of R packages used for data science. The Tidyverse contains  other packages as well, like ggplot2 , readr, tidyr, purr etc.

Most datasets contain more information than they display. dplyr is a package that can help you access that information.

dplyr introduces a grammar of data manipulation. It has 5 simple functions that you can use to reveal new variables, new observations, and new ways to describe data.You can also use these functions to make a subset of the data, do group wise operations.

dplyr is fast, and the key pieces of dplyr are written in C++. This means that we would get speed of C++ and ease of R language.

Let’s learn how to use dplyr to solve any data related task.

how to install dplyr?

First, lets install dplyr  in your R-Studio console.

install dplyr package honingds.com

You can also install Tidyverse package in order to use dplyr. As dplyr is part of Tidyverse, it will get installed automatically.

 

Let’s now install hflights package. We shall use this dataset to learn about dplyr.

install hflights package honingds.com

Let’s load these two packages using library function.Once hflights is loaded, it will be available as a dataframe in your R console. Read this blog post to understand about dataframes

 

load hflights and dplyr packages honingds.com

head() function in dplyr 

Use head() function to see the first few rows of a dataframe.

head function in R

summary() function in dplyr

summary() function is used to see higher level  info about any dataset.Try using summary() to find summary statistics.

summary() function in dplyr

The variables are stored in the columns, whereas the observations are stored in the rows of this data set.

If you go fully through the output of the summary() function, you can see that, this hflights data set is relatively large.It contains 227496 rows !!

what is tbl?

dplyr can help you to look at a dataset thoroughly.It provides a new datastructure for R, the tbl(pronounced the tibble).

A tbl is just a different type of dataframe.

tbl_df()

To turn hflights dataframe to a tibble (tbl) , you need to run tbl_df(hflights)

Then when we see hflights in R console, it will cut out the superfluous columns and show the results neatly. It will tell the dimensions of the dataset, the names and data types of each column present in the dataset. The best feature is that tbl will size the output according to the size of your screen !

generate tbl in R

Note that even if we convert a dataframe to a tbl, you can still modify the tbl as if it were a dataframe.

Whatever functions we can use on a dataframe, can be used on a tbl as well.

glimpse()

We can use glimpse() to see the datatypes and initial values for each column in your dataset.

Use the class function to verify that a tbl is internally  a dataframe.

The five verbs of dplyr grammar

dplyr does more than just providing data structures.It provides a complete grammar for data manipulation.This grammar is built around five functions, that do the basic tasks of data manipulation.

select – removes columns, and returns a subset of columns.

filter- removes rows, and returns a subset of rows.

arrange-reorders rows in a dataset.

mutate-uses the dataset to build new columns, add columns from existing data.

summarize-calculates summary statistics, and reduces each group to a single row.

You can even combine these functions and execute them in a chain, one after another.

dplyr select

Let’s see how to use dplyr select.It is used to return a specific group of columns.

While using dplyr select, you can use column names or integer indexes .

Using column names in dplyr select

Use tbl name and column names together within the select.

Here tblname is hflights , and column names are DepTime and ArrTime.

It is not needed to use quotes while using the column names.

using select in dplyr

dplyr functions can recognize variable names as they are. No need to use $ as in basic R syntax.This is true for each of the functions of dplyr grammar.

Using indexes in dplyr select

Another approach is to use integer indexes within dplyr select.

using select with indexes in dplyr

dplyr select  – colon (:) and minus (-) 

Use : to select a range of variables and – to exclude some variables.

: and – operators can be used on indexes and column names.

In the first code snippet given below, we are selecting the first five variables using indexes.

In the second code snippet, we are selecting the first five variables except for the second one using minus (-) operator.

 using select with index in dplyr

In the code snippet given below, we are selecting the variables from Year till DepTime, except for the DayOfWeek using colon (:) and minus (-) operator.

using select with column names in dplyr

dplyr select does not modify the original dataset , it returns a modified copy.

You have to explicitly assign the result of dplyr select() to a variable to store the result.

This is a pattern common to each verb of the dplyr grammar.

Note that dplyr functions do not change the original dataset.

If you will need to use a modified copy, you will need to save it to a variable.

saving the selected columns into another tbl

dplyr select helper functions

starts_with(): select column names starting with a prefix.

helper function starts_with in dplyr

ends_with(): select column names ending with a particular suffix

helper function ends_with in dplyr

contains(): select column names that contain a literal string.

In this example, I am using dplyr select and its helper function “contains” to return the columns containing the string “Time” and “Delay”:

helper function chaining in select dplyr

matches(): select column names with matching regular expressions.

In this example, I am using dplyr select and its helper function “matches” to return the columns starting with the string “Dep”:

dplyr select matches

num_range(): used to select column names containing numbers, like col1, col2, col3.

Using num_range dplyr select

one_of(): used to select variables in character vector.

everything(): all variables.

Hence, by using these dplyr select helper functions , your R code will become very concise.

To see the added value of the dplyr package, it is useful to compare its syntax with base R.

Both of the below lines returns the same columns.

comparison of r syntax with dplyr syntax

But the  elegance and ease-of-use of dplyr is a great plus.

How to create a lookup table in dplyr

In R, a lookup table is used to convert alphabetical codes into more meaningful strings.

Let’s work with a lookup table, that comes in the form of a named vector.

When you subset the lookup table with a character string, R will return the values of the lookup table that correspond to the names in the character string.

To see how this works, run following code in the console:

You can also use tbl$columnName syntax to save the column of a tbl as an object named arrtime, using R syntax.

What are variables and observations in dplyr?

In a tibble , we call columns as “variables”, and rows as “observations”.

select & mutate manipulates the variables in your dataset.

filter & arrange manipulates the observations. summarize manipulates groups of observations.

You can examine the order of the variables in hflights with names(hflights) in the console.

I hope this introduction to dplyr article has been useful to you.Feel free to add your comments on this post.

In my data science course, we will learn how to manipulate data using dplyr , and to use dplyr tbl structure, its pipe operator, which are two features to save a lot of time.

We will also learn to  use dplyr to access data stored in a database.As you can see, with dplyr, R has become faster, bigger and better.

Any questions on dplyr, feel free to ask in the comments section below.

Recent Posts

Menu