Data manipulation with dplyr part 3

In this post, we will discuss about usage of group_by with dplyr. We will solve some complex coding  challenges with dplyr. Post that, we will learn how  to work with external data present in databases using dplyr.

Get group-wise insights using group_by

summarize is the only verb that creates a new dataset , with new observations. The value in the new row are all observations on your dataset.

If we can make observations on entire dataset, then you can make observations on groups of rows as well within our dataset.

You can do this within dplyr using group_by

group_by groups the values within a dataframe or tbl by the values of the variable in the dataset.

Syntax of group_by:
In the pipe syntax, the operation will look like below.
You can include more than one column in group by.

group_by() lets you define groups within your data set.

Its influence becomes clear when calling summarize() on a grouped dataset.

Summarizing statistics are calculated for the different groups separately.

Let us create an ordered per-carrier summary of hflights by combining group_by(), summarize() and arrange().

We will use group_by() to group hflights by UniqueCarrier, and then summarize() the grouped tbl with two summary variables:

p_canc, the percentage of cancelled flights.

avg_delay, the average arrival delay of flights whose delay does not equal NA.

Finally, we shall order the carriers in the summary from low to high by their average arrival delay.

We will use percentage of flights cancelled to break any ties.

 

using group_by, summarize and arrange with dplyr

Combine group_by with mutate

You can also combine group_by() with mutate().

When you mutate grouped data, mutate() will calculate the new variables independently for each group.

This is particularly useful when mutate() uses the rank() function, that calculates within-group rankings.

rank() takes a group of values and calculates the rank of each value within the group, e.g.

As with arrange(), rank() ranks values from the smallest to the largest.

Let’s filter the hflights tbl to only keep observations for which ArrDelay is not NA and positive.

We will use group_by() on the result to group by UniqueCarrier.

Next, we will use summarize() to calculate the average ArrDelay per carrier.

Let’s call this summary variable avg. Then, we will feed the result into a mutate() call.

During this process, we create a new variable, rank, calculated as rank(avg).

Finally, we will arrange by this new rank variable.

summarize in dplyr with filter, group_by, summarize, mutate and rank

By now you’ve learned the fundamentals of dplyr: the five data manipulation verbs and the additional group_by() function to discover interesting group-wise statistics.

Let’s review the concepts you have learned about till now !

Challenge 1

We will find out how many airplanes flew to only one destination.For simplicity, you can include cancelled flights in your answers, so you shouldn’t filter based on the Cancelled column.

The solution tbl  needs to have a single column, named nplanes and a single row.

The solution for this challenge is the code mentioned in the snippet below:

Challenge 2

We will now find out the most visited destination for each carrier.

The solution tbl you print out should contain four columns: UniqueCarrier, Dest, n – how often a carrier visited a particular destination, and rank, how each destination ranks per carrier.

rank should be 1 for every row, as you want to find the most visited destination for each carrier.

The solution for this challenge is  the code mentioned in the snippet below:-

dplyr challenge 2 solution

How to use dplyr to work with data.table

dplyr is designed to work with any type of tabular data in R.Usually tabular data comes in a dataframe. But sometimes it can come in a different form, like data.table or a database.

You can use the same five verbs to manipulate any of these structures : select, mutate, filter, arrange, group_by.

You can create a tbl version of data.table or database using tbl_dt or tbl.

Let’s install the package data.table and then load hflights as a data.table.

 

installing data.table package in R

Now, hflights2 is a copy of hflights that is saved as a data table.

hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.

Let’s find the number of unique carriers using hflights2 , which is pre-loaded as a data.table

How to use dplyr to work with databases

Now we will discover how easy it is to manipulate data present in databases with dplyr.

Let’s suppose we have a postgres database with data about flights that departed from New York City in 2013. The data is similar to the data in hflights, but it does not contain information about cancellations or diversions. With the tbl() function, we can create a reference to a table present in this database.

Try to understand the code that creates nycflights, a reference to a database table.

 

Although nycflights is a reference to data that lives outside of R, you can use the dplyr commands on them as usual. Behind the scenes, dplyr will convert the commands to the database’s native language (in this case, postgres), and return the results.

This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.

Use glimpse() to check out nycflights. Although nycflights is a reference to a tbl in a remote database, there is no difference in syntax. Look carefully: the variable names in nycflights differ from the ones in hflights!

 

Group nycflights data by carrier, then summarize() with two variables: n_flights, the number of flights flown by each carrier and avg_delay, the average arrival delay of flights flown by each carrier. Finally, arrange the carriers by average delay from low to high.

As you can see, dplyr can connect to a database and load the data in R. As a result, you can work with much larger datasets than it would have otherwise been possible using R alone.

This is the last post in my series of posts on the dplyr package. Hopefully it would have been useful to you.Feel free to comment on this article in the section below.

Recent Posts

Menu