In this post, we will start with an introduction to R for data science.We will learn all important R language concepts to be learnt by a data science aspirant.
These concepts are also a part of my data science online training using R.
What is a vector in R?
Vectors are one-dimension arrays that can hold numeric data, character data, or logical data.
In R, you create a vector using the combine function c(). You place the vector elements separated by a comma between the parentheses.
You can provide a name to the vector elements using names() function. Take a look at this example:
1 2 |
employee_vector <- c("Robert", "Data Analyst") names(employee_vector) <- c("Name", "Job") |
This code defines a vector employee_vector and then names the two elements.
The first element is named as Name, whereas the next element is named as Job.
Let’s create another vector, called earnings_vector.
In order to provide names for the elements of earnings_vector, we will create another vector containing the names, called days_vector and assign days_vector to the names(earnings_vector)
1 2 3 4 5 6 7 |
earnings_vector <- c(140, 150, 120, 220, 240) # The variable days_vector days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") # Assign the names to earnings_vector names(earnings_vector) <- days_vector |
How to select single element of a vector?
To select single element of a vector (and later matrices, data frames, …), you can use square brackets. Between the square brackets, you indicate what element to select.
For example, to select the first element of the earnings_vector, you type earnings_vector[1].
Notice that the first element in a vector has index 1, not 0 as in many other programming languages.
Another way to select the elements in earnings_vector is by using the names of the vector elements (Monday, Tuesday, …) instead of their numeric positions.
To select multiple elements from a vector, you can add square brackets at the end of it.
You can indicate between the brackets what elements should be selected.
For example: suppose you want to select the earnings on first and the fifth day of the week in the earnings_vector: use the vector c(1, 5) between the square brackets.
You can also use the element names to select multiple elements, for example:
1 2 |
earnings_vector[c(1, 5)] earnings_vector[c("Monday","Friday")] |
How to select multiple contiguous elements of a vector?
Selecting multiple elements of earnings_vector, by using multiple indexes , is not very convenient.
What I mean is, in order to select the elements the first three elements of the earnings_vector, it is more convenient to use earnings_vector[c(1:3)] instead of earnings_vector[c(1,2,3)].Both these expressions yield the same result.
How to select vector elements using a logical vector?
Till now, we have seen selecting vector elements by their index or name. You can select vector elements in another way.
For example, you need to select only those earnings from earnings_vector which are greater than 200.
First, you need to generate a logical vector containing TRUE or FALSE, for the condition you want each element to satisfy.
Next, select elements from the earnings_vector using this logical vector.
Refer to the code below:-
R knows what to do when you pass a logical vector in square brackets: it will only select the elements that correspond to TRUE in selection_vector.
How to create a matrix in R?
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.
Since you are only working with rows and columns, a matrix is called two-dimensional.
You can construct a matrix in R with the matrix() function. Consider the following example:
1 |
matrix(1:9, byrow = TRUE, nrow = 3) |
In the matrix() function:
The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).
The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.
The third argument nrow indicates that the matrix should have three rows.
Let’s try to create another matrix. But this time, we will create three numeric vectors first, and then will use them to create the matrix.
1 2 3 4 5 6 7 |
earnings1 <- c(100, 110) earnings2 <- c(120, 130) earnings3 <- c(140, 150) earnings_vector <- c(earnings1,earnings2,earnings3 ) earnings_matrix <- matrix(earnings_vector, byrow = TRUE, nrow = 3) |
Similar to vectors, you can add names for the rows and the columns of a matrix.
1 2 |
rownames(earnings_matrix) <- c("Andrew","David","Robert") colnames(earnings_matrix) <- c("Year1","Year2") |
In R, the function rowSums() calculates the totals for each row of a matrix, and the function colSums() calculates the totals at column level.
Both rowSums and colSums return new vectors.
How to add another column to an existing matrix in R?
You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example:
1 |
changed_matrix1 <- cbind(matrix1, matrix2, vector1 ...) |
How to add another row to an existing matrix in R?
Similarly you may use rbind() to add rows to an existing matrix.
How to select matrix elements in R?
You can use the square brackets [ ] to select one or multiple elements from a matrix. While vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:
earnings_matrix[2,3] selects the element at the second row and third column.
earnings_matrix[2:4,2:3] results in a matrix with the data on the rows 2, 3, 4 and columns 2, 3.
If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:
earnings_matrix[,2] selects all elements of the second column.
earnings_matrix[2,] selects all elements of the second row.
Dataframe in R
Let’s learn about dataframes now. We will create a data frame that describes the main characteristics of student in a college.
Assume the main features of student are:
Student Name.
Roll No.
Marks.
If the student has received Financial Aid or not (TRUE or FALSE).
We will construct a data frame with the data.frame() function. As arguments, you pass the vectors containing the attributes: they will become the different columns of your data frame.
Because every column has the same length, the vectors you pass should also have the same length. But don’t forget that it is possible that they contain different types of data.
Pass the vectors as arguments to data.frame().
1 2 3 4 5 6 7 8 |
# Definition of vectors student_name <- c("Andrew","David","Robert","Alex") roll_no <- c(7,17,25,5) marks <- c(90,80,95,65) financial_aid <- c(FALSE,TRUE,FALSE,TRUE) # Create a data frame from the vectors students_df <- data.frame(student_name,roll_no,marks,financial_aid ) |
All the elements that you put in a matrix should be of the same type. But, it may not be the case in dataframe.
Within a dataframe, each column may have different data types, but within a specific column, all the content should have same datatype.
Finding structure of a dataframe
When you work with large data sets and data frames, your first task is to develop a clear understanding of its structure and main elements. The function str() shows you the structure of a dataframe.
Use the function head() to see the first few observations of a data frame.
Similarly, the function tail() prints out the last few observations in your dataframe.
Selection of dataframe elements
Like vectors and matrices, you select elements from a data frame using square brackets [ ]. By using a comma, you can select the rows and the columns respectively.
For example:
students_df[1,2] selects the value at the first row and second column in students_df.
students_df[1:3,2:4] selects rows 1, 2, 3 and columns 2, 3, 4 in students_df.
students_df[1, ] selects all elements of the first row.
Instead of using numerics to select elements of a data frame, you can also use the column names to select data, like .
students_df[1:3,”roll_no”]
However, there is a short-cut. If your columns have names, you can use the $ sign:
students_df$roll_no
Now, let us use the function subset() to select dataframe elements.
1 2 3 |
subset(students_df, select = "roll_no") subset(students_df, select = c("roll_no","marks")) subset(students_df, marks >90) |
Not only is the subset() function more concise, it is probably also more understandable for people who read your code.
Sorting the data frame
Let’s rearrange our students_df data frame such that it starts with the student having least marks and ends with the students having highest marks. A sort on the marks column.
Let’s use order() to sort our dataframe.
1 2 |
positions <- order(students_df$marks) students_df[positions,] |
Creating a list
A list in R is used to gather a variety of objects under one name in an ordered manner. These objects can be matrices, vectors, data frames, even other lists, etc.
To create a list you use the function list():
1 |
my_list <- list(component1, component2 ...) |
The arguments to the list function are the list components. Remember, these components can be matrices, vectors, other lists, …
Let’s create a list, named list1, that contains the variables earnings_vector, earnings_matrix and students_df as list components.
1 |
list1 <- list(earnings_vector, earnings_matrix , students_df ) |
Creating a named list
If you want to name your lists after you’ve created them, you can use the names() function, like we did with vectors and matrices.
1 |
my_list <- list(name1 = your_comp1,name2 = your_comp2) |
This creates a list with components that are named name1, name2, and so on.
Let’s recreate the list we had created earlier to use names.
1 |
list2 <- list(vect = earnings_vector, matr = earnings_matrix, stud = students_df) |
Selecting elements from a list
To select a component of a list, use the numbered position of that component. For example, the first component of list2 can be accessed using list2[[1]] or list2$vect.
1 2 3 4 |
> list2$vect [1] 100 110 120 130 140 150 > list2[[1]] [1] 100 110 120 130 140 150 |
Remember, in order to select elements from vectors, you use single square brackets [ ].
After selecting components, you often need to select specific elements out of these components.
For example, with list2[[1]][1] you select first element from the first component.
1 2 3 4 |
> list2[[1]][1] [1] 100 > list2[[1]][2] [1] 110 |
Adding more information to the list
To add elements to lists you can use the c() function.
In the example below, we are adding a vector my_vec to an already existing list list1.
1 2 |
> my_vec = c(10,20,30,40,50) > list1 <- c(list1,my_vec) |
This will simply extend the original list, list1, with the component my_vec. This component gets appended to the end of the list. If you want to give the new list item a name, you just add the name my_name, as shown in the code snippet below.
1 |
list1 <- c(list1, my_name = my_vec) |
What are the re lational operators in R?
== Equality
!= Inequality
< Less than
Greater than
>= Greater than or equal to
<= Less than or equal to
Note that R is case sensitive: “E” is not equal to “e”.
You may also use expressions while using relational operators.
Let’s compare logical to numeric, are TRUE and 1 equal?
R’s ability to deal with different data structures for comparisons does not stop at vectors or variables only.
Matrices and relational operators also work together seamlessly.
What are the logical operators in R?
AND(&)
OR(|)
NOT(!)
Note
The expression 5 < x < 10 to check if x is between 5 and 10 will not work.
You’ll need to use x > 5 & x < 10 for that.
What is the difference between & and &&, | and ||?
& works on the whole vector, && works on the first values within the vectors.
| works on the whole vector, || works on the first values within the vectors.
In R, to find the last value of a vector, we use the tail function.
Let’s define a vector , and find out its the last value.
Using this last value, we will perform a combination of logical and relational operations to answer the below questions.
Is last under 5 or above 10?
Is last between 15 and 20, excluding 15 but including 20?
1 2 3 4 5 6 7 8 |
example <- c(26, 9, 23, 50, 12, 7, 24) last <- tail(example, 1) # Is last under 5 or above 10? last < 5 | last > 10 # Is last between 15 (exclusive) and 20 (inclusive)? last > 15 & last <= 20 |
Reverse the result using NOT (!) operator
On top of the & and | operators, you also learned about the ! operator, which negates a logical value. Here are some R expressions that use !
1 2 3 |
x <- 50 y <- 70 !(!(x < 40) & !!!(y > 120)) |
When R encounters brackets, it will start execution from the brackets first.So the expression x < 40 will return FALSE, and then we are using NOT on FALSE, which returns TRUE. Again , as part of outer NOT operator, we will apply !TRUE which returns FALSE.
So the output of the above expression will be FALSE.
How to write While loop in R?
Syntax
Let’s get started with building a while loop from the ground up. Have a look at its syntax:
while (condition) {
expression
}
Remember that the condition part should become FALSE at some point during the execution. Otherwise, the while loop will go on indefinitely.
In the below example, the condition of the while loop should check if current_speed is higher than the threshold_speed.Inside the body of the while loop, we print out a warning message, and keep decreasing the current_speed by 20 units. This step is crucial; otherwise your while loop will never stop.
How to write break statement in R?
The break statement is a control statement. It is used to stop the while loop during execution.
When R encounters it, the while loop is abandoned completely.
In the example below, we are using break statement to come out of the while loop when the current_speed reaches 140.
How to write for loop in R?
Let’s see how to write for loop in R.
Syntax
for(var in seq){
expression
}
for loop works on both lists and vectors.
Note that while the break statement can be used to stop a loop execution, the next statement can be used to proceed to next iteration during the loop execution.
To refresh your memory, consider the following for loops that are equivalent in R:
1 2 3 4 5 6 7 8 9 |
evens<- c(2, 4, 6, 8, 10, 12) # loop version 1 for (e in evens) { print(e) } # loop version 2 for (i in 1:length(evens)) { print(evens[i]) } |
How to loop over a list in R?
Looping over a list is similar to looping over a vector. There are again two different approaches here:
1 2 3 4 5 6 7 8 9 10 11 |
evens_list <- list(2, 4, 6, 8, 10, 12) # loop version 1 for (e in evens_list) { print(e) } # loop version 2 for (e in 1:length(evens_list)) { print(evens_list[[i]]) } |
Notice that you need double square brackets – [[ ]] – to select the list elements in loop version 2.
How to loop over a matrix in R?
Let’s define a matrix and loop over each element of it. We’ll need a for loop inside a for loop, often called a nested loop. Simply use the following syntax:
for (var1 in sequence1) {
for (var2 in sequence2) {
expression
}
}
The outer loop should loop over the rows, with loop index i (use 1:nrow(matrix1)).
The inner loop should loop over the columns, with loop index j (use 1:ncol(matrix1)).
Inside the inner loop, let’s make use of print() and paste() to print out information in the following format:
“On row i and column j the matrix has x”, where x is the value on that position.
Let’s now build a for loop from scratch
We defined a variable, sentence. This variable has been split up into a vector that contains separate letters and has been stored in a vector chars with the strsplit() function.
Let’s write code that counts the number of r’s that come before the first u in this sentence.
Initialize the variable rcount, as 0.
Finish the for loop:
if char equals “r”, increase the value of rcount by 1.
if char equals “u”, leave the for loop entirely with a break.
How to find documentation of functions in R?
We had already used a couple of functions before, like list() ,print(), mean().Before even thinking of using an R function, you should be clear on which arguments it expects.
All the relevant details such as a description, usage, and arguments can be found in the documentation.
For example, to consult the documentation on the mean() function, you can use either help(mean) or ?mean in the R console.
To see the arguments of a built-in function or user defined function, use args().
So, to see arguments of the mean() function, we need to use args(mean).
The documentation on the mean() function gives us quite some information:
The mean() function computes the arithmetic mean.
The most general method takes multiple arguments: x and ….
The x argument should be a vector containing numeric, logical or time-related information.
1 2 3 4 5 |
> evens <- c(2, 4, 6, 8, 10, 12, 14) > avg_evens <- mean(evens) > avg_evens [1] 8 > |
What are optional arguments in R built-in functions?
Most built-in functions in R have the optional arguments, like na.rm.
By default, it is set to FALSE.
If you would want a built-in function to get executed without considering the NA values, then set its na.rm=TRUE. That means, all the NA values will be ignored.
Let’s take mean() function , to understand optional arguments.
Check the documentation on the mean() function again.
The ‘Default S3 method’, is
mean(x, trim = 0, na.rm = FALSE, …)
The … is called the ellipsis. It is a way for R to pass arguments along without the function having to name them explicitly. The ellipsis will be discussed in more detail later.
Notice that both trim and na.rm are optional arguments , and have default values.
Using trim in the mean() function will change the output of the function.
When the trim argument is not zero, it chops off a fraction (equal to trim) of the vector you pass as argument x.
See the below example, when I used a value of 0.5 for the trim, the output is 1.5, whereas when I used a value of 0.1 for the trim, the output is 13.25 !
Let’s see the optional argument, na.rm in action.What happens if your vector contains missing values (NA). Executing mean() will return NA, whereas if we add the additional parameter na.rm = TRUE, the output will be correct.
1 2 3 4 5 |
> evens <- c(2, 4, 6, 8, 10, 12, 14, NA) > mean(evens) [1] NA > mean(evens, na.rm = TRUE) [1] 8 |
You already know that R functions return objects that you can then use somewhere else. This makes it easy to use functions inside functions. Here, I have used paste function within print function.
1 2 3 |
> even <- 20 > print(paste("You selected the number ", even)) [1] "You selected the number 20" |
How to define function in R?
Have a look at the following syntax for defining a function:
1 2 3 |
my_function <- function(arg1, arg2) { body } |
Notice that this syntax uses the assignment operator (<-) just as if you were assigning a vector to a variable for example.
Creating a function in R basically is the assignment of a function object to a variable!
In the example above, you’re creating a new R variable my_function, that becomes available in the workspace as soon as you execute the definition. From then on, you can use the my_function as a function.
Let’s create a function pow_three(): it takes one argument and returns that number cubed (that number multiplied thrice by itself), and then call this newly defined function with 5 as input.
1 2 3 4 5 6 7 |
# Create a function pow_three() pow_three <- function(x){ x * x * x } # Use the function pow_three(5) |
Write your own function
Sometimes your function may not require an input. Let’s say you want to write a function that gives us the random outcome of throwing a fair die.
You can define default argument values in your own R functions as well. You can use the following syntax to do so:
1 2 3 |
my_function <- function(arg1, arg2 = val2) { body } |
Let’s modify the function that we wrote before, pow_three, to add a default argument.
1 2 3 4 5 6 7 |
pow_three <- function(x, print_info = TRUE) { y <- x ^ 3 if (print_info == TRUE){ print(paste(x, "to the power three equals", y)) } return(y) } |
What is function scoping in R?
Let’s discuss about function scoping. Variables that are defined inside a function are not accessible outside that function. Try running the following code and see if you understand the results:
1 2 3 4 5 6 7 8 |
> pow_three <- function(x) { + y <- x ^ 3 + return(y) + } > pow_three(4) [1] 64 > y Error: object 'y' not found |
y was defined inside the pow_three() function and therefore it is not accessible outside of that function. This is also true for the function’s arguments of course – x in this case.
Does R support call by value or call by reference ?
R does not support call by reference. It means that an R function cannot change the variable that you input to that function. Let’s look at a simple example:-
Inside the double() function, the argument x gets overwritten with its twice its value.
Afterwards this new x is returned. If you call this function with a variable a set equal to 3, you obtain 6. But did the value of a change? If R were to pass a to double() by reference, the override of the x inside the function would ripple through to the variable a, outside the function.
However, R passes by value, so the R objects you pass to a function can never change unless you do an explicit assignment. a remains equal to 3, even after calling double(a).
What is an R Package?
The functions that we have seen till now, namely mean(), list() are part of R packages.So what is a package?
An R package is a bundle of code, data, documentation and tests.Examples are base , ggvis etc.
Use search() to find the list of packages installed in your session.
When R starts it attaches 7 packages by default to the session.We can see that base package is automatically installed.
Some packages, like ggvis, are not installed by default.
To install them, use install.packages()
The packages will be downloaded from CRAN website.
CRAN means Comprehensive R Archive Network.
How to load an R package?
Loading a package means making it available in your session, thereby attaching to a search list.
You can load packages using library() or require() functions.library() attaches packages to the search list on your R workspace.
1 |
library("ggvis") |
Using search() again will now show that ggvis is installed in your session.
Another function to load a package is require().
What is the difference between library() and require()?
library() will throw an error if a package is not loaded.
require() will not throw an error, it will just print a warning and then return false.
Note that the library() and require() functions are not very picky when it comes down to argument types: both library(ggplot2) and library(“ggplot2”) work perfectly fine for loading a package.
That’s all for now on introduction to R for data science.Let’s talk about advanced concepts during my data science course in hyderabad
Please work out all of these examples and feel free to ask questions in the section below.