In this post, we will learn how to perform data manipulation with dplyr package.

We will learn to use mutate, filter, arrange and summarize verbs in dplyr.

mutate

mutate does the opposite of select. It reveals info that your dataset already contains.

If you’ve got a dataset that includes length, width and height, then we could use mutate to figure out the volume, because volume = length * breadth * height.

mutate builds a variable from data that is currently present in the dataset.

Syntax of mutate

mutate(df (or) tbl, <new variable> = <some operation using existing variables>)

As with select, mutate doesn’t modify the existing dataframe/tbl.

It just creates its copy of the existing dataframe with the changes.mutate() creates new columns which are added to a copy of the dataset.

Lets us add a new variable ActualGroundTime to a copy of hflights and save the result as g1.

1 |
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime ) |

Upto now, you’ve added variables to hflights one at a time, but you can also use mutate to add multiple variables at the same time.

To create more than one variable, place a comma between each variable that you specify inside mutate.

mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to create the new variable y:

1 |
mutate(my_df, x = a + b, y = x + c) |

filter filters out the rows in a dataset based on their contents.

You can think of filter as row equivalent of select.

filter generates a new version of a dataset that only includes a few rows.

Syntax of filter

filter(df (or) tbl,<one or more logical tests>)

filter will return all of the rows that pass the logical test.

As with other dplyr verbs, filter returns a new copy of the dataset.

To use filter, you need to know how to use logical operators.

R comes with a set of logical operators that you can use inside filter.

x < y, TRUE if x is *less than* y

x <= y, TRUE if x is *less than or equal to* y

x == y, TRUE if x *equals* y

x != y, TRUE if x *does not equal* y

x >= y, TRUE if x *is greater than or equal to* y

x > y, TRUE if x is *greater than* y

x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

1 |
filter(hflights, Distance > 3900) |

1 |
filter(hflights, UniqueCarrier %in% c("AA","XE", "B6")) |

R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and !(not).

Instead of using the & operator, you can also pass several logical tests to filter, separated by commas.

The following two calls are completely equivalent:

1 |
filter(df, a > 0 & b > 0) |

1 |
filter(df, a > 0, b > 0) |

Next, is.na() will also come in handy.

This example keeps the observations in df for which the variable x is *not* NA:

1 |
filter(df, !is.na(x)) |

1 |
filter(hflights, DepTime < 500 | ArrTime > 2200) |

Using a combination of filter, mutate and select, let us generate a new dataset from the hflights dataset that contains some useful information on flights that had JFK airport as their destination.

- First, use filter() to select the flights that had JFK as their destination and save this result to c1.
- Second, add a new column named Date to c1:
- paste()together the Year, Month and DayofMonth variables, separate them by a “-” by using the sep attribute of paste(). Save the resulting data frame as c2.
- Finally, select some columns to provide an overview: Date, DepTime, ArrTime and TailNum, in this order. Do not assign the resulting data frame to a variable; just print it to the console.

With select(), mutate() and filter(), you can already reveal interesting information from a dataset. Through a combination of these expressions or by the use of a one-liner, try to answer the following question:

*How many weekend flights flew a distance of more than 1000 miles but had a total taxiing time below 15 minutes?*

1 |
filter(hflights,Distance>1000&(DayOfWeek == 7|DayOfWeek == 6)&(TaxiIn + TaxiOut < 15) ) |

**arrange**

arrange reorders the rows of a dataset based on their content.

Syntax of arrange

arrange(df/tbl, <columnName to arrange by>)

By default, it will arrange the given dataset in ascending order.

If multiple rows have same value for the given column, you can try to arrange by adding one more column.

arrange() can be used to rearrange rows according to any type of data.

If you pass arrange() a character variable, for example, R will rearrange the rows in alphabetical order according to values of the variable.

If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

Let’s write write some arrange() expressions to display its contents appropriately.

Arrange hflights, by departure delays so that the shortest departure delay is at the top of the data set.

Arrange hflights so that flights by the same carrier appear next to each other. Within each carrier, flights that have smaller departure delays appear before flights that have higher departure delays.

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set.

You can reverse this behavior with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange().

Arrange hflights so that flights by the same carrier appear next to each other and within each carrier, flights that have larger departure delays appear before flights that have smaller departure delays.

Arrange the flights in hflights by their total delay (the sum of DepDelay and ArrDelay). Try to do this directly inside arrange().

1 |
arrange(hflights,DepDelay + ArrDelay) |

summarize

summarize uses your data to create a new dataset containing summary statistics.

You can decide which statistics to include.

Syntax of summarize

summarize(df/tbl, new column name = expression)

Below code will take a dataframe, and calculate sum of column A, mean of column B, and variance of column B

summarize(df/tbl, sum = sum(A),avg = mean(B), var= var(B))

You can use any functions to calculate summary statistics, but they should obey one rule.They should take a vector as input and return single number as output. These type of functions can be referred to as aggregate functions.

summarize(), the last of the 5 verbs in dplyr, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarize() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

Use summarize() to print out a summary of hflights containing two variables: min_dist, the shortest distance flown, and max_dist, the longest distance flown.

1 |
summarize(hflights, min_dist=min(Distance),max_dist=max(Distance)) |

Print out a summary of hflights with a single variable, max_div: the longest Distance for diverted flights.

1 2 3 |
> summarize(filter(hflights,Diverted == 1),max_div = max(Distance)) max_div 1 3904 |

You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:

- min(x) – minimum value of vector x.
- max(x) – maximum value of vector x.
- mean(x) – mean value of vector x.
- median(x) – median value of vector x.
- quantile(x, p) – pth quantile of vector x.
- sd(x) – standard deviation of vector x.
- var(x) – variance of vector x.
- IQR(x) – Inter Quartile Range (IQR) of vector x.
- diff(range(x)) – total range of vector x.

Let us remove rows that have NAs in the arrival delay column and save the resulting dataset to temp1.

1 |
temp1 <- filter(hflights, !is.na(ArrDelay)==1) |

Print out a summary of temp1 with the following variables (in this order):

earliest: the minimum arrival delay,

average: the average arrival delay,

latest: the longest arrival delay,

sd: the standard deviation for arrival delays.

1 2 3 4 |
> summarize(temp1,earliest = min(ArrDelay),average = mean(ArrDelay),latest = max(ArrDelay), sd = sd(ArrDelay)) earliest average latest sd 1 -70 7.094334 978 30.70852 |

Filter hflights such that only rows that have no NA TaxiIn and no NA TaxiOut are kept; save this temporary result to temp2.

Print out a summary of temp2, with one variable, max_taxi_diff: the biggest absolute difference in time between TaxiIn and TaxiOut for a single flight.

1 2 3 4 |
> temp2 <- filter(hflights, !is.na(TaxiIn) == 1, !is.na(TaxiOut) == 1) > summarize(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut))) max_taxi_diff 1 160 |