Data manipulation with dplyr

In this post, we will learn how to perform data manipulation with dplyr package.We will learn to use mutate, filter, arrange and summarize verbs in dplyr.

mutate

mutate does the opposite of select. It reveals info that your dataset already contains.

If you’ve got a dataset that includes length, width and height, then we could use mutate to figure out the volume, because volume = length * breadth * height.

mutate builds a variable from data that is currently present in the dataset.

Syntax of mutate

mutate(df (or) tbl, <new variable> = <some operation using existing variables>)

As with select, mutate doesn’t modify the existing dataframe/tbl.

It just creates its copy of the existing dataframe with the changes.mutate() creates new columns which are added to a copy of the dataset.

Lets us add a new variable ActualGroundTime to a copy of hflights and save the result as g1.

mutate in dplyr

copy of tbl after mutate

Upto now, you’ve added variables to hflights one at a time, but you can also use mutate to add multiple variables at the same time.

To create more than one variable, place a comma between each variable that you specify inside mutate.

mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to create the new variable y:

filter

filter filters out the rows in a dataset based on their contents.

You can think of filter as row equivalent of select.

filter generates a new version of a dataset that only includes a few rows.

Syntax of filter

filter(df (or) tbl,<one or more logical tests>)

filter will return all of the rows that pass the logical test.

using filter in tbl

As with other dplyr verbs, filter returns a new copy of the dataset.

To use filter, you need to know how to use  logical operators.

R comes with a set of logical operators that you can use inside filter.

x < y, TRUE if x is less than y

x <= y, TRUE if x is less than or equal to y

x == y, TRUE if x equals y

x != y, TRUE if x does not equal y

x >= y, TRUE if x is greater than or equal to y

x > y, TRUE if x is greater than y

x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

using filter on tbl

example of filter on tbl

R also includes a group of boolean operators which you could use to unite several logical evaluations to one evaluation. These boolean operators are | (or),  & (and) ,  !(not ).

As opposed to working with the & operator, we might even pass a few logical conditions to filter, with each condition being separated by commas.

The next two calls to filter are entirely same:

Another really helpful function is is.na(). It is used to check whether a particular variable is NA or not.

The below code retains those observations in the dataframe where the variable x isn’t NA.

Using R’s boolean and logical  operators, let’s select only the observations in which a flight left before 5:00 am (500) or arrived after 10:00 pm (2200).

filter example using logical and boolean operator

With a combination of filter, mutate and select , let’s examine the hflights dataset for insights on flights which had the destination airport as JFK.

  1. Firstly, we shall use filter to choose the flights which had  destination as JFK and save its output to c1.
  2. Secondly, add a column named Date into c1.
  3. paste() together the Year, Month and DayofMonth variables, use “-” as separator. Save the resultant data frame as c2.
  4. Eventually, select few columns to get a quick overview : Date, DepTime, ArrTime and TailNum, in this sequence. Don’t save the resulting data frame to a variable; simply publish it into the console.

data manipulation with dplyr

Using select(), mutate() and filter(), you have seen that interesting information can be revealed from a dataset.

Through a combination of those expressions or by using a one-liner, let’s attempt to answer the next question.Let’s now find out how many weekend flights flew a distance of over 1000 km but had a net taxiing time under 15 minutes.

arrange

arrange reorders the rows of a dataset based on their content.

Syntax of arrange

arrange(df/tbl, <columnName to arrange by>)

By default, it will arrange the given dataset in ascending order.

If multiple rows have same value for the given column, you can try to arrange by adding one more column.

arrange() may be used to rearrange rows based on any kind of data.

Should you pass arrange () a character variable, for instance, R will arrange the rows in alphabetical sequence based on values of this variable.

Should you pass a factor variable, R will automatically rearrange the rows based on the sequence of the levels in your factor (running levels() on the variable shows this order ).

Let us write few arrange () expressions to show a dataframe contents appropriately.

arrange() hflights by departure delays to ensure that the lowest departure delay is on top of the dataset.

data manipulation with arrange in dplyr

Arrange hflights such that flights from the same carrier are next to each other and within each carrier, flights which have bigger departure delays appear after flights which have smaller departure delays.

example of using arrange in dplyr

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set.

You can reverse this behavior with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange().

Arrange hflights so that flights by the same carrier appear next to each other and within each carrier, flights that have larger departure delays appear before flights that have smaller departure delays.

using arrange along with desc in dplyr

Arrange the flights in hflights by their total delay (the sum of DepDelay and ArrDelay). Try to do this directly inside arrange().

summarize

summarize uses your data to create a new dataset containing summary statistics.

You can decide which statistics to include.

Syntax of summarize

summarize(df/tbl, new column name = expression)

Below code will take a dataframe, and calculate sum of column A, mean of column B, and variance of column B

summarize(df/tbl, sum = sum(A),avg = mean(B), var= var(B))

You can use any functions to calculate summary statistics, but they should obey one rule.They should take a vector as input and return single number as output. These type of functions can be referred to as aggregate functions.

summarize(), the last of the 5 verbs in dplyr, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarize() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

Use summarize() to print out a summary of hflights containing two variables: min_dist, the shortest distance flown, and max_dist, the longest distance flown.

Print out a summary of hflights with a single variable, max_div: the longest Distance for diverted flights.

It is possible to use any function in summarize() provided that the function can take a vector of data return a specific number.

R includes many aggregating functions, like mentioned below:

  • min(vect) – find minimum value of vector vect.
  • max(vect) – find maximum value of vector vect.
  • mean(vect) – find mean value of vector vect.
  • median(vect) – find median value of vector vect.
  • quantile(vect, n) – nth quantile of vector vect.
  • sd(vect) – standard deviation of vector vect.
  • var(vect) – variance of vector vect.
  • IQR(vect) – Inter Quartile Range (IQR) of vector vect.
  • diff(range(vect)) – total range of vector vect.

Let’s eliminate rows which have NAs from the arrival delay column and then save the resulting dataset into temp1.

Let’s find the summary of temp1 using the following variables (in this sequence ):

earliest: the minimal arrival delay,
average: the average arrival delay,
latest: the longest arrival delay,
sd: the standard deviation of arrival delays.

Filter hflights for rows which don’t have any NA TaxiIn and  no NA TaxiOut are retained; save this temporary result of temp.

Print a summary of temp, with a single variable, max_taxi_diff: the largest absolute difference in time between TaxiIn and TaxiOut for one flight.

That’s it for now on data manipulation with dplyr. If you would like to learn these concepts in depth, please subscribe to my data science online course.

Recent Posts

Menu