Tidying data with tidyr package in R

Overview of tidyr

tidyr is a wonderful package in R written by Hadley Wickham for the purpose of helping you to apply the principles of tidy data.

using tidyr package in r

How to install tidyr

Since tidyr is a part of tidyverse group of packages, it can be installed either by installing tidyverse or just tidyr itself

 

We will focus on a subset of functions in tidyr that will allow you to do some of the most common data cleaning tasks.

Gather columns into key-value pairs : gather() function

One of the most important function of tidyr is gather.It should be used when the columns of your dataframe are not variables, and you want to collapse them into key value pairs.
The gather() function makes wide datasets long. Let’s see with an example

Suppose we have a wide dataframe called wide_df, we wish to make it a long dataframe by turning the column names A, B and C into values of a new variable called my_key , using gather function. No information is lost in this process.

We still have a value for each combination of X and Y , with A, B and C but these values are now represented vertically in the column we have labelled my_val.We refer to this process as gathering the columns A,B and C into key-value pairs. We use the -col argument to make it clear that we want to gather all cols except for the first column labelled col. In general, the gather function takes four arguments,

data: your dataset, usually a dataframe
key: it is the name of the new column to contain so called “keys”
value: it is the name of the new column to contain the “values”
…: the three dots represent either the names of the columns that you wish to gather or the names of the columns that you wish to ignore, each prefixed with a minus(-) sign.
None of these arguments require quotes around the variable names.

spread() function

The spread() function does the opposite of gather(). It takes key-value pairs and distributes them across multiple columns. It makes long datasets wide.

Lets do the opposite of the previous example, spreading key-value pairs represented in my_key and my_val columns  into columns using the spread function.

The first argument of spread function is the name of the dataset.
The second argument is the name of the keys column.The third argument is the name of the values column

You can see that the result is the wide_df we saw in the previous example.

The separate function

It is often useful to separate data in a single column into multiple columns. This can be done easily with the separate function.

Here we represent a small dataset.
It  contains a column called year_mo , with two pieces of information, the year and month , separated by a dash (-). We wish to separate this single column into two columns representing the year and month respectively, by using separate function

We provide three arguments to separate.

1)The name of the dataframe
2)The name of the column to split
3)A character vector containing the name of the new columns

It is important to note that while the first two arguments are unquoted, the third must be an actual character vector to work properly.By default, separate assumes you want to split on some type of non-alphanumeric value, like an empty space, a period (.) , a forward slash(/), or in this example , the dash(-) between the year and month.
If this is not the case, you can provide the fourth argument called sep( separator) , to the function which then specifies exactly which character to split on.

The unite function

Finally,  just like spread does the opposite of gather, the unite function does the opposite of separate.Here we take the result from the last example, and simply join the year and month columns back together again , to form the year_month column.

unite function takes three arguments:-

the first argument is the dataframe , the second is the unquoted name of the new column to be formed, the remaining arguments represented by … are the unquoted names of all columns to be joined.

If we don’t specify the sep argument, the default separator will be underscore (_). If we want to separate year and month by a dash(-), we could add sep = “-” to the function call.

Summary of tidyr functions

You have now seen the examples of some of the most useful functions from the tidyr package.

gather gather columns into key value pairs
spread spread key value pairs to columns
separate separate one column into multiple columns
unite uniting multiple columns into one column

In case of any questions on tidyr package, feel free to ask in the comments section below.

Recent Posts

Menu