Introduction to R for Data Science

In this post, we will start with an introduction to R for data science.We will learn all important  R language concepts to be learnt by a data science aspirant.

These concepts are also a part of my data science online training using R.

What is a vector in R?

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data.

In R, you create a vector using the combine function c(). You place the vector elements separated by a comma between the parentheses.

You can provide a name to the vector elements  using names() function. Take a look at this example:

This code defines a vector employee_vector and then names the two elements.

The first element is named as  Name, whereas the next element is named as Job.

 

Let’s create another vector, called earnings_vector.

In order to provide names for the elements of earnings_vector, we will create another vector containing the names, called days_vector and assign days_vector to the names(earnings_vector)

create vector in R

How to select single element of a vector?

To select single element of a vector (and later matrices, data frames, …), you can use square brackets. Between the square brackets, you indicate what element to select.

For example, to select the first element of the earnings_vector, you type earnings_vector[1].

Notice that the first element in a vector has index 1, not 0 as in many other programming languages.

Another way to select the elements in earnings_vector is by using the names of the vector elements (Monday, Tuesday, …) instead of their numeric positions.

select elements of vector in R

To select multiple elements from a vector, you can add square brackets at the end of it.

You can indicate between the brackets what elements should be selected.

For example: suppose you want to select the earnings on first and the fifth day of the week in the earnings_vector: use the vector c(1, 5) between the square brackets.

You can also use the element names to select multiple elements, for example:

selecting multiple elements of a vector

How to select multiple contiguous elements of a vector?

Selecting multiple elements of earnings_vector, by using multiple indexes , is not very convenient.

What I mean is, in order to select the elements the first three elements of the earnings_vector, it is more convenient to use earnings_vector[c(1:3)] instead of earnings_vector[c(1,2,3)].Both these expressions yield the same result.

selecting multiple contiguous elements

How to select vector elements using a logical vector?

Till now, we have seen selecting vector elements by their index or name. You can select vector elements in another way.

For example, you need to select only those earnings from earnings_vector which are greater than 200.

First, you need to generate a logical vector containing TRUE or FALSE, for the condition you want each element to satisfy.

Next, select elements from the earnings_vector using this logical vector.

Refer to the code below:-

selecting elements using logical vector

R knows what to do when you pass a logical vector in square brackets: it will only select the elements that correspond to TRUE in selection_vector.

How to create a matrix in R?

In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

Since you are only working with rows and columns, a matrix is called two-dimensional.

You can construct a matrix in R with the matrix() function. Consider the following example:

how to create matrix in R

In the matrix() function:

The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).

The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.

The third argument nrow indicates that the matrix should have three rows.

Let’s try to create another matrix. But this time, we will create three numeric vectors first, and then will use them to create the matrix.

how to create matrix using multiple vectors in R

Similar to vectors, you can add names for the rows and the columns of a matrix.

how to add row and column names to matrix

In R, the function rowSums() calculates the totals for each row of a matrix, and the function colSums() calculates the totals at column level.

Both rowSums and colSums return new vectors.

how to calculate sum of columns using colSums and sum of rows using rowSums in R

How to add another column to an existing matrix in R?

You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example:

how to add new column to existing matrix in R

How to add another row to an existing matrix in R?

Similarly you may use rbind() to add rows to an existing matrix.

how to add rows to an existing matrix in R

How to select matrix elements in R?

You can use the square brackets [ ] to select one or multiple elements from a matrix. While vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:

earnings_matrix[2,3] selects the element at the second row and third column.
earnings_matrix[2:4,2:3] results in a matrix with the data on the rows 2, 3, 4 and columns 2, 3.
If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:

earnings_matrix[,2] selects all elements of the second column.
earnings_matrix[2,] selects all elements of the second row.

how to select matrix elements in R

Dataframe in R

Let’s learn about dataframes now. We will create a data frame that describes the main characteristics of student in a college.

Assume the main features of student are:

Student Name.
Roll No.
Marks.
If the student has received Financial Aid or not (TRUE or FALSE).

We will construct a data frame with the data.frame() function. As arguments, you pass the vectors containing the attributes: they will become the different columns of your data frame.

Because every column has the same length, the vectors you pass should also have the same length. But don’t forget that it is possible that they contain different types of data.

Pass the vectors as arguments to data.frame().

dataframe in R

All the elements that you put in a matrix should be of the same type. But, it may not be the case in dataframe.

Within a dataframe, each column may have different data types, but within a specific column, all the content should have same datatype.

Finding structure of a dataframe

When you work with large data sets and data frames, your first task is to develop a clear understanding of its structure and main elements. The function str() shows you the structure of a dataframe.

str() in R

Use the function head() to see the first few observations of a data frame.

Similarly, the function tail() prints out the last few observations in your dataframe.

Selection of dataframe elements

Like vectors and matrices, you select elements from a data frame using square brackets [ ]. By using a comma, you can select the rows and the columns respectively.

For example:

students_df[1,2] selects the value at the first row and second column in students_df.
students_df[1:3,2:4] selects rows 1, 2, 3 and columns 2, 3, 4 in students_df.

students_df[1, ] selects all elements of the first row.

Instead of using numerics to select elements of a data frame, you can also use the column names to select data, like .

students_df[1:3,”roll_no”]

However, there is a short-cut. If your columns have names, you can use the $ sign:

students_df$roll_no

select dataframe elements in R

Now, let us use the function subset() to select dataframe elements.

using subset to select dataframe elements

Not only is the subset() function more concise, it is probably also more understandable for people who read your code.

Sorting the data frame

Let’s rearrange our students_df data frame such that it starts with the student having least marks and ends with the students having highest marks. A sort on the marks column.

Let’s use order() to sort our dataframe.

sort a dataframe in R using order()

Creating a list

A list in R is used to gather a variety of objects under one name in an ordered manner. These objects can be matrices, vectors, data frames, even other lists, etc.

To create a list you use the function list():

The arguments to the list function are the list components. Remember, these components can be matrices, vectors, other lists, …

Let’s create a list, named list1, that contains the variables earnings_vector, earnings_matrix and students_df as list components.

create a list in R

Creating a named list

If you want to name your lists after you’ve created them, you can use the names() function, like we did with vectors and matrices.

This creates a list with components that are named name1, name2, and so on.

Let’s recreate the list we had created earlier to use names.

create a named list in R

Selecting elements from a list

To select a component of a list, use the numbered position of that component. For example, the first component of list2 can be accessed using list2[[1]] or list2$vect.

Remember, in order to select elements from vectors, you use single square brackets [ ].

After selecting components, you often need to select specific elements out of these components.

For example, with list2[[1]][1] you select first element from the first component.

Adding more information to the list

To add elements to lists you can use the c() function.

In the example below, we are adding a vector my_vec to an already existing list list1.

This will simply extend the original list, list1, with the component my_vec. This component gets appended to the end of the list. If you want to give the new list item a name, you just add the name my_name, as shown in the code snippet below.

 

What are the re lational operators in R?

== Equality

Equality operator in R

!= Inequality

Inequality operator in R

< Less than

Less than operator in R

Greater than

Greater than operator in R

>= Greater than or equal to

Greater than or equal to operator

<= Less than or equal to

Less than or equal to operator in R

Note that R is case sensitive: “E” is not equal to “e”.

R is case sensitive

You may also use expressions while using relational operators.

using expressions while using relational operators in R

Let’s compare logical to numeric, are TRUE and 1 equal? 

compare logical to numeric in R

R’s ability to deal with different data structures for comparisons does not stop at vectors or variables only.

Matrices and relational operators also work together seamlessly.

matrix comparison using relational operator

What are the logical operators in R?

AND(&)

logical AND operator

OR(|)

logical OR operator

NOT(!)

logical NOT operator in R

Note

The expression 5 < x < 10 to check if x is between 5 and 10 will not work.

You’ll need to use  x > 5 & x < 10 for that.

What is the difference between & and &&, | and ||?

& works on the whole vector, && works on the first values within the vectors.

| works on the whole vector, || works on the first values within the vectors.

In R, to find the last value of a vector, we use the tail function.

Let’s define a vector , and find out its the last value.

Using this last value, we will perform a combination of logical and relational operations to answer the below questions.

Is last under 5 or above 10?
Is last between 15 and 20, excluding 15 but including 20?

relational and logical operators in R

Reverse the result using NOT (!) operator
On top of the & and | operators, you also learned about the ! operator, which negates a logical value.  Here are some R expressions that use !

When R encounters brackets, it will start execution from the brackets first.So the expression x < 40 will return FALSE, and then we are using NOT on FALSE, which returns TRUE. Again , as part of outer NOT operator, we will apply !TRUE which returns FALSE.

So the output of the above expression will be FALSE.

How to write While loop in R?

Syntax

Let’s get started with building a while loop from the ground up. Have a look at its syntax:

while (condition) {
expression
}

Remember that the condition part should become FALSE at some point during the execution. Otherwise, the while loop will go on indefinitely.

In the below example, the condition of the while loop should check if current_speed is higher than the threshold_speed.Inside the body of the while loop, we print out a warning message, and keep decreasing the current_speed by 20 units. This step is crucial; otherwise your while loop will never stop.

while loop in R

How to write break statement in R?

The break statement is a control statement. It is used to stop the while loop during execution.

When R encounters it, the while loop is abandoned completely.

In the example below, we are using break statement to come out of the while loop when the current_speed reaches 140.

break statement in R

How to write for loop in R?

Let’s see how to write for loop in R.

Syntax

for(var in seq){

expression

}

for loop works on both lists and vectors.

Note that while the break statement can be used to stop a loop execution, the next statement can be used to proceed to next iteration during the loop execution.

To refresh your memory, consider the following for loops that are equivalent in R:

for loop in R

How to loop over a list in R?

Looping over a list is similar to looping over a vector. There are again two different approaches here:

Notice that you need double square brackets – [[ ]] – to select the list elements in loop version 2.

How to loop over a matrix in R?

Let’s define a matrix and loop over each element of it. We’ll need a for loop inside a for loop, often called a nested loop. Simply use the following syntax:

for (var1 in sequence1) {
for (var2 in sequence2) {
expression
}
}

The outer loop should loop over the rows, with loop index i (use 1:nrow(matrix1)).
The inner loop should loop over the columns, with loop index j (use 1:ncol(matrix1)).

Inside the inner loop, let’s make use of print() and paste() to print out information in the following format:

“On row i and column j the matrix has x”, where x is the value on that position.

looping over a matrix in R

Let’s now build a for loop from scratch

We defined a variable, sentence. This variable has been split up into a vector that contains separate letters and has been stored in a vector chars with the strsplit() function.

Let’s write code that counts the number of r’s that come before the first u in this sentence.

Initialize the variable rcount, as 0.
Finish the for loop:
if char equals “r”, increase the value of rcount by 1.
if char equals “u”, leave the for loop entirely with a break.

another for loop example in R

How to find documentation of functions in R?

We had already used a couple of functions before, like list() ,print(), mean().Before even thinking of using an R function, you should be clear on which arguments it expects.

All the relevant details such as a description, usage, and arguments can be found in the documentation.

For example, to consult the documentation on the mean() function, you can use  either help(mean) or ?mean in the R console.

To see the arguments of a built-in function or user defined function, use args().

So, to see arguments of the mean() function, we need to use args(mean).

finding documentation on functions in R

The documentation on the mean() function gives us quite some information:

The mean() function computes the arithmetic mean.
The most general method takes multiple arguments: x and ….
The x argument should be a vector containing numeric, logical or time-related information.

What are optional arguments in R built-in functions?

Most built-in functions in R have the optional arguments, like na.rm.

By default, it is set to FALSE.

If you would want a built-in function to get executed without considering the NA values, then set its na.rm=TRUE. That means, all the NA values will be ignored.

Let’s take mean() function , to understand optional arguments.

Check the documentation on the mean() function again.

The ‘Default S3 method’, is

mean(x, trim = 0, na.rm = FALSE, …)

The … is called the ellipsis. It is a way for R to pass arguments along without the function having to name them explicitly. The ellipsis will be discussed in more detail later.

Notice that both trim and na.rm are optional arguments , and have default values.

Using trim in the mean() function will change the output of the function.

When the trim argument is not zero, it chops off a fraction (equal to trim) of the vector you pass as argument x.

See the below example, when I used a value of 0.5 for the trim, the output is 1.5, whereas when I used a value of 0.1 for the trim, the output is 13.25 !

 

Let’s see the optional argument, na.rm in action.What happens if your vector contains missing values (NA). Executing mean() will return NA, whereas if we add the additional parameter na.rm = TRUE, the output will be correct.

You already know that R functions return objects that you can then use somewhere else. This makes it easy to use functions inside functions. Here, I have used paste function within print function.

How to define function in R?

Have a look at the following syntax for defining a function:

Notice that this syntax uses the assignment operator (<-) just as if you were assigning a vector to a variable for example.

Creating a function in R basically is the assignment of a function object to a variable!

In the example above, you’re creating a new R variable my_function, that becomes available in the workspace as soon as you execute the definition. From then on, you can use the my_function as a function.

Let’s create a function pow_three(): it takes one argument and returns that number cubed (that number multiplied thrice by itself), and then call this newly defined function with 5 as input.

how to declare function in R

Write your own function
Sometimes your function may not require an input. Let’s say you want to write a function that gives us the random outcome of throwing a fair die.

define a function without arguments in R

You can define default argument values in your own R functions as well. You can use the following syntax to do so:

Let’s modify the function that we wrote before, pow_three, to add a default argument.

define a function with default arguments

What is function scoping in R?

Let’s discuss about function scoping. Variables that are defined inside a function are not accessible outside that function. Try running the following code and see if you understand the results:

y was defined inside the pow_three() function and therefore it is not accessible outside of that function. This is also true for the function’s arguments of course – x in this case.

Does R support call by value or call by reference ?

R does not support call by reference. It means that an R function cannot change the variable that you input to that function. Let’s look at a simple example:-

call by value example in R

Inside the double() function, the argument x gets overwritten with its twice its value.

Afterwards this new x is returned. If you call this function with a variable a set equal to 3, you obtain 6. But did the value of a change? If R were to pass a to double() by reference, the override of the x inside the function would ripple through to the variable a, outside the function.

However, R passes by value, so the R objects you pass to a function can never change unless you do an explicit assignment. a remains equal to 3, even after calling double(a).

What is an R Package?

The functions that we have seen till now, namely mean(), list()  are part of R packages.So what is a package?

An R package is a bundle of code, data, documentation and tests.Examples are base , ggvis etc.

Use search() to find the list of packages installed in your session.

how to find list of packages installed in R

When R starts it attaches 7 packages by default to the session.We can see that base package is automatically installed.

Some packages, like ggvis, are not installed by default.

To install them, use install.packages()

how to install packages in R

The packages will be downloaded from CRAN website.

CRAN means Comprehensive R Archive Network.

How to load an R package?

Loading a package means making it available in your session, thereby attaching to a search list.

You can load packages using library() or require() functions.library() attaches packages to the search list on your R workspace.

Using search() again will now show that ggvis is installed in your session.

how to load R package

Another function to load a package is require().

What is the difference between library() and require()?

library() will throw an error if a package is not loaded.

require() will not throw an error, it will just print a warning and then return false.

Note that the library() and require() functions are not very picky when it comes down to argument types: both library(ggplot2) and library(“ggplot2”) work perfectly fine for loading a package.

That’s all for now on introduction to R for data science.Let’s talk about advanced concepts during my data science course in hyderabad

Please work out all of these examples and feel free to ask questions in the section below.

 

 

 

Recent Posts

Menu