In this post, we will discuss about creation of look up tables, and usage of select verb in dplyr.We will also see how to use helper functions within select.
How to create a lookup table in dplyr
In R, a lookup table is used to convert alphabetical codes into more meaningful strings.
Let’s work with a lookup table, that comes in the form of a named vector.
When you subset the lookup table with a character string, R will return the values of the lookup table that correspond to the names in the character string.
To see how this works, run following code in the console:
countries <- c("IN","AU")
lookup_tbl <- c("IN" = "India","AU"="Australia")
You can also use tbl$columnName syntax to save the column of a tbl as an object named arrtime, using R syntax.
> arrtime <- hflights$ArrTime
The five verbs of dplyr grammar
dplyr does more than just providing data structures.It provides a complete grammar for data manipulation.This grammar is built around five functions, that do the basic tasks of data manipulation.
select – removes columns, and returns a subset of columns.
filter- removes rows, and returns a subset of rows.
arrange-reorders rows in a dataset.
mutate-uses the dataset to build new columns, add columns from existing data.
summarize-calculates summary statistics, and reduces each group to a single row.
You can even combine these functions and execute them in a chain, one after another.
What are variables and observations in dplyr?
In a tbl , we call columns as “variables”, and rows as “observations”.
select & mutate manipulates the variables in your dataset.
filter & arrange manipulates the observations. summarize manipulates groups of observations.
You can examine the order of the variables in hflights with names(hflights) in the console.
 "Year" "Month" "DayofMonth" "DayOfWeek"
 "DepTime" "ArrTime" "UniqueCarrier" "FlightNum"
 "TailNum" "ActualElapsedTime" "AirTime" "ArrDelay"
 "DepDelay" "Origin" "Dest" "Distance"
 "TaxiIn" "TaxiOut" "Cancelled" "CancellationCode"
how to use select in dplyr
Let’s see how to use select to return a specific group of columns.
You can use column names or integer indexes while using select.
Using column names
Use tbl name and column names together within the select.It is not needed to use quotes while using the column names.
dplyr functions can recognise variable names as they are. No need to use $ as in basic R syntax.This is true for each of the functions of dplyr grammar.
You can also use integer indexes within select.
Usage of colon (:) and minus (-) within dplyr
Use : to select a range of variables and – to exclude some variables.
: and – operators can be used on indexes and column names.
In the first code snippet given below, we are selecting the first five variables using indexes.
In the second code snippet, we are selecting the first five variables except for the second one using minus (-) operator.
In the code snippet given below, we are selecting the variables from Year till DepTime, except for the DayOfWeek using colon (:) and minus (-) operator.
select does not modify the original dataset , it returns a modified copy.
You have to explicitly assign the result of select() to a variable to store the result.
This is a pattern common to each verb of the dplyr grammar.dplyr functions do not change the original dataset.If you will need to use a modified copy, you will need to save it to a variable.
select Helper functions in dplyr
starts_with(): starts with a prefix
ends_with(): ends with a prefix
contains(): contains a literal string
matches(): matches a regular expression
num_range(): a numerical range like x01, x02, x03.
one_of(): variables in character vector.
everything(): all variables.
You can also use a chain of helper functions with select.See below code for the most concise way to return the following columns with select and its helper functions: DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay, DepDelay.
To see the added value of the dplyr package, it is useful to compare its syntax with base R.Both of the below lines returns the same columns.
But the elegance and ease-of-use of dplyr is a great plus.
Feel free to add your comments on this post.