Data manipulation with dplyr part 2

In this post, we will discuss advanced data manipulation concepts with dplyr. First, we will discuss about aggregate functions in dplyr, and then discuss about using pipes for complex data manipulation.

dplyr aggregate functions

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

first(x) – The first element of vector x.
last(x) – The last element of vector x.
nth(x, n) – The nth element of vector x.
n() – The number of rows in the data.frame or group of observations that summarize() describes.
n_distinct(x) – The number of unique values in vector x.

Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean().

A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0.

sum() then represents the total number of observations that passed the test; mean() represents the proportion.

Let’s print out a summary of hflights with the following variables:
n_obs: the total number of observations,
n_carrier: the total number of carriers,
n_dest: the total number of destinations,

summarize in dplyr with aggregate functions

Let’s print out a summary of all flights flown by American Airlines, with the following variables:
n_flights: the total number of flights (each observation is a flight),
n_canc: the total number of cancelled flights,
avg_delay: the average arrival delay of flights whose delay is not NA (na.rm = TRUE).

how to use pipes in dplyr

We have used multiple dplyr verbs in conjunction till now.But linking the verbs together can make the code verbose. You need to link the functions together by saving a new object after applying each function.

But this method is not very efficient, because it requires some extra typing to create these objects.

Also, each object takes up  some space in your computer’s memory.

If you have a large dataset, that can slow down your analysis.

You can avoid passing objects between functions by chaining them together , like below

R will evaluate this piece of code from the innermost parantheses, making this code is difficult to read.

dplyr provides a simpler way for functions to link together, using an operator known as the pipe.

The pipe operator looks like this %>% . It comes from magrittr package developed by Stefan Bache.

You dont need to import the magrittr package to work with the pipe operator. dplyr imports it for you.

pipe can be understood as an operator that you place between an object and a function.

pipe takes the object on the left and passes it to the function.

So when you use pipe, you don’t pass the object.

The pipe provides an ideal way to link together dplyr functions, because each function takes a dataframe or a tbl as the first argument.

We can write our code from before like this.

Not only the code is concise, but it is easy to read.

The pipe character may seem unusual at first, but it quickly becomes second nature.

As another example of the %>%, have a look at the following two commands that are completely equivalent:

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it.

Let’s use dplyr functions and the pipe operator to transform the following English sentences into R code:

Take the hflights data set and then …
Add a variable named diff that is the result of subtracting TaxiIn from TaxiOut, and then …
Pick all of the rows whose diff value does not equal NA, and then …
Summarize the data set with a value named avg that is the mean diff value.

As you can see, one can answer sophisticated questions by combining the verbs of dplyr.

Now, we  will examine whether it sometimes makes sense to drive instead of fly. We will begin by making a data set that contains relevant variables.

Then, we will find flights whose equivalent average velocity is lower than the velocity when traveling by car.

In the following code snippet, we will carry out a series of dplyr verbs on the hflights dataset.We will use the %>% operator to chain them all together.

Let’s use mutate() on the hflights dataset and add two variables:

RealTime: the actual elapsed time plus 100 minutes (for the overhead that flying involves) and
mph: calculated as 60 times Distance divided by RealTime, then

filter() to keep observations that have an mph that is not NA and that is below 70, finally

summarize() the result by creating four summary variables:
n_less, the number of observations,
n_dest, the number of destinations,
min_dist, the minimum distance and
max_dist, the maximum distance.

summarize in dplyr with relevant variables

This example suggested that some flights might be less efficient than driving in terms of speed.

But is speed all that matters? Flying imposes burdens on a traveler that driving does not.

For example, airplane tickets are very expensive. Air travellers also need to limit what they bring on their trip and arrange for a pick up or a drop off. Given these burdens we might demand that a flight provide a large speed advantage over driving.

Let’s define preferable flights as flights that are at least 50% faster than driving, i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or diverted flights are less preferable than driving.

The mutate() call from the previous exercise is already coded up; can you add more pipes and verbs to the command?

filter() the result of mutate to:
Keep observations that have an mph under 105 or for which Cancelled equals 1 or for which Diverted equals 1.
summarize() the result by creating four summary variables:
n_non, the number of observations,
n_dest, the number of destinations,
min_dist, the minimum distance and
max_dist, the maximum distance.

 

summarize in dplyr with mutate and filter operations

Please feel free to add your comments to this article.

 

 

Recent Posts

Menu