10 techniques to load data into R

In this post, we will learn the techniques to load data into R from a variety of sources.

learn how to import different varieties of data into R

Import data from csv  in R

read.csv() function can be used to import csv(comma separated values) data from flat file into R dataframe. The read.csv() function is present in utils package.The utils package is loaded by default. It need not be loaded explicitly. Note that read.csv() is a wrapper function which calls read.table() in the background.

read.csv()

read.csv in R

If the file is present in your current working directory, use the filename directly as value for the file parameter. If not, use the path created by file.path function. The advantage of using file.path is that it allows you to refer to a file path in a platform independent way.

read.csv() with file.path in R

Find structure of the dataframe using str()

str function on dataframe in R

If you see the structure of the dataframe created above, you can find that the first two columns have been imported as factors , instead of character data.This is because the default behaviour of R would be to convert character data into factors, which might make it hard to do such things as replacing the values.

The stringsAsFactors Argument of read.csv()

The second argument stringsAsFactors is used to import the column names either as character data or as factors, which R uses to store categorical variables. By default, the stringsAsFactors argument is set to TRUE, and converts all the string data in your file to factors within your dataframe.

stringsAsFactors in read.csv()

The header argument of read.csv()

The default option for read.csv(…) would be to place the header parameter to TRUE. This usually means that the very first row of values in the  .csv is defined as header info (column titles ). If your data collection doesn’t have a header, place the header parameter to FALSE.

The sep argument of read.csv()

The default delimiter of this read.csv() function is a comma, but you may use different delimiters by providing the ‘sep’ argument  (e.g., typing sep =’;’ permits a semi-colon split file to be properly read ).

read.csv2()

Same as read.csv() , with the only difference being to account for regional/locale differences in representing decimal numbers. In some regions, the number 4.97 is represented as 4,97. We cant use a comma as field delimiter in this case. So we use a semicolon as field delimiter.

 

Import data from textfile  in R

Apart from .csv documents, in addition, there are the .txt documents that are essentially text files. You may import these functions together with read.delim().read.delim() can be used to read tab delimited data from a text file into a R dataframe.It is a wrapper function which calls read.table() in the background.

read.delim()

read.delim() in R

By default, it automatically places the sep argument to “\t” (fields in a record are separated by tabs) and the header argument  to TRUE (the very first row includes the names of each column).

read.delim2()

Same as read.delim() , with the only difference being to account for regional/locale differences in representing decimal numbers. In some regions, the number 3.97 is represented as 3,97. We cant use a comma as field delimiter in this case. So we use a semicolon as field delimiter.

read.table()

If you are working with more exotic flat file formats, then you will want to utilize read.table(). Contrary to read.csv() and read.delim(), the header param defaults to FALSE. Also the sep param is “” by default. Both read.csv() and read.delim() call read.table() in the background.

read.table() in R

Beside column names, it is also possible to specify the column types or column classes of the resulting dataframe. You can do this by placing the colClasses argument beside the col.names and entering a vector of strings reflecting class type of each column.

This strategy can be useful in case you have any columns which should be factors and others which should be characters. You do not need to bother with stringsAsFactors anymore; simply say for every column what the class ought to be.

When a column has been set to “NULL” in the colClasses vector, this column is going to be skipped and won’t be packed into the dataframe. Refer to an example below:-

importing data in R using read.table()

All these utils package functions, part of core R functionality  are a bit slow in performance. There are specialized packages, for importing data into R. readr is one such package.

Using readr package to load data

The readr package is written by Hadley Wickham. It outputs a tibble, a more improved version of dataframe.

To install readr package, use install.packages(“readr”).To load readr into your R session, use library(readr).

read_csv()

To import data into R from a csv file, we used read.csv from utils package before. Now lets use read_csv from readr package.

Remember we dont need to use stringsAsFactors param while using read_csv()

import data into R using read_csv

read_tsv()

To import data into R from a tab delimited file, we used read.delim() from utils package before.Now let’s use read_tsv() from readr package.

read_delim()

While read.table() was the main function from utils package, read_delim() is the main function from readr package.

Both read_csv() and read_tsv() are wrapper functions on top of read_delim.

Note that file and delim are mandatory arguments for read_delim.

col_names, skip, n_max and col_types  options are commonly used across all the functions of readr package while importing data into R. Refer to the example usage below.

col_names

For all readr package based functions, you can use the below additional arguments

col_names argument to specify column names.

skip and n_max

By using skip and n_max you’ll be able to adjust which section of your file you want to import into R.

skip defines the no. of lines you want to leave out  from the file before really beginning to import data.n_max defines the no. of lines you are actually importing.

Let’s assume you have a CSV file containing 20 lines, and you have defined skip = 3 and n_max = 4,  then you are just importing the lines 4, 5, 6 and 7 of this document.

Remember that after when using  skip , you also skip the initial line that could contain column titles!

col_types

While importing data into R using readr package, you could even define what datatypes the columns your imported dataframe need to contain. You can achieve this with col_types. If set to NULL, the default option, all the functions from the readr package will look for the right datatypes themselves. You may manually define the datatypes with a string, where every character refers to the class of a column: l for logical, i for integer , d for double, and c for character. Use _ to skip a column.

import data into R by using collector functions

col_types with collector functions

Another method of setting the column datatypes is by utilizing collectors. Collector functions could be passed into a list() into the col_types argument of read_ functions.

Refer to the collector documentation, for a full list of collector functions that can be used.

import data into R by using col_types

How to use data.table package to import data into R

One of the fastest packages for importing data into R is data.table. It is written by Matt Dowle and Arun Srinivasan. It contains a powerful function fread() to import data into R.

install data.table

fread()

fread() can read datasets with or without column names.It can also understand column data types and separators automatically without specifying them.It is very fast as well. You can think of it as an improved version of read.table().

You ought to know about two parameters of this function:  select and drop, to select or drop variables of interest.

If you have a dataset that has 5 variables and you would like to retain the first and second variable, termed “a” and “b”. These choices will work:

Note that fread returns a “data.table” and “data.frame” when applied on a dataset.

Import data from Excel using readxl

To import data from an Excel spreadsheet, use readxl package.

Before you can begin importing from Excel, you need to determine which sheets can be found in the workbook. You can  use excel_sheets() function for this.

import excel data into R

So once you know the worksheet names, you can import them using read_excel with their names or with the worksheet numbers. read_excel() will return a dataframe.

If there are multiple worksheets in an excel workbook, using read_excel() multiple times to read every worksheet  and then merging them into a list can be a tedious exercise. We can use lapply() in that case.

In addition to the path and sheet arguments, there are other arguments that you can specify for read_excel() function.

The col_names argument

col_names can be TRUE (the default) , FALSE or a character vector.
If TRUE, it means first row of the excel sheet contains column names. If FALSE, R assigns the column names itself. Can also be set to a character vector representing column names.

The col_types argument

An Excel sheet can contain different data types, containing text, numerics and dates.
How these data types are interpreted while being converted  to a R dataframe depends on col_types. By default, col_types is null. That means R chooses the datatypes by itself when col_types is not specified.

You can manually specify the col_types as well.Suppose you want to import  all columns of a worksheet as text. Then use “text” keyword.Other keywords to enforce the data type are numeric, date and blank. If you use blank, then R will simply ignore that column. blank will be useful when you have an excel file with a lot of columns, and you only need some of them.

The skip argument

skip specifies the number of rows in your excel sheet R has to skip before reading into a dataframe.

The readxl package is still under development. The excel_sheets() and read_excel() are the ONLY two functions to learn in readxl package. On top of that, readxl is very fast and its arguments are similar to what we give while using readr package.

Loading  data from Excel into R using gdata package

One of the alternatives to read data from excel is to use gdata package. It is maintained by Gregory Warnes. It is an entire suite of tools for performing data manipulation. It supercharges the basic R to make data manipulation less painful.

Lets learn how to install gdata and then explore its functions.

There is  function read.xls() in gdata to read excel data. By default, the xls format is supported. You can import xlsx data using an additional driver. There is no function, like excel_sheets() of readxl,  to list worksheets in gdata.

To import data from Excel, gdata uses Perl. The Perl code converts data in XLS to CSV file.
Next, the CSV file is imported to an R dataframe using read.csv() function loaded from utils package.Hence , the performance of read.xls() will be slower compared to read_excel()

read.csv() is a wrapper on top of read.table() , so all the arguments of read.table(),  like header, stringsAsFactors, col.names etc. will work in read.xls()

If you have multiple worksheets in the workbook, and  you want to merge them into a single dataframe, then you need to read each worksheet into a separate dataframe and then use cbind() to mere these dataframes.

As you can see, using read.xls() is a slow and tedious process to import excel data. If you have a huge file to import, this is quite some extra work !

Importing data from Excel into R dataframes using XLConnect

Written and maintained by Martin Studer, XLConnect is a comprehensive package for working with excel files through R. You can think of XLConnect as a bridge between Excel and R. Practically, any thing that you would do in Excel can be done in R using this package. Editing excel sheets, formatting data, adapting calculation sheets, you name it, and XLConnect has a function for it !

XLConnect works with xls and xlsx files.XLConnect uses java internally.

excel data import using XLConnect

If you have problems with installation, install Oracle’s Java Development Kit (JDK) or google the error 🙂 There are quite some people using this package, so help is never far away.

To load a workbook into R, use loadWorkbook by simply passing the name of the excel file you want an interface to. It will return a workbook object.

Now this workbook object is the bridge between R and Excel. You can use it to get information about the excel file it links to.To get the names of the different sheets it links to, use getSheets()

getSheets() of XLConnect package will return a character vector containing the worksheet names. This gives same output as excel_sheets() of readxl package.

Apart from sheet names, you can also read data from the sheets, using readWorksheet(). The cool thing with readWorksheet() is you can easily specify from which row and which column , you want to read the information from, by using startRow, endRow , startCol

Adapting sheets

XLConnect has more to offer than just importing excel data into R.This package is an easy tool to modify content of your workbook through R.
Suppose you want to add the content present in a R dataframe df,  to a new worksheet called data_summary.

You can rename existing worksheets using renameSheet function.

You can remove existing worksheets using removeSheet function.

Though these are pretty basic operations that you can easily do in excel as well,  the cool thing is that you can program all these tasks in R in a reproducible way. Apart from the functions I discussed, there are methods to style cells, working with formula, merging and splitting up cells etc.

importing data into R from database

Till now, we have seen importing data from flat files, excel files. In a professional setting though, you will also encounter data stored in a relational database.

Depending on the database you want to connect to, you have to use different package in R.For connecting to MySQL database, you need to use RMySQL package.For connecting to PostgresSQL, use RPostgresSQL package , and for connecting to Oracle database, use ROracle package and so on.

How to interact with a database, which R functions you use to access and manipulate the data in a database, is specified in another R package called DBI. In technical terms, DBI is an interface and RMySQL is the implementation.

The first step to connect to MySQL database is to create a DBIConnection object . Do this with a dbConnect() call.

The dbConnect() creates a connection between R session and SQL database.
Go through a few important functions given below to import data from a database.

importing selective data by using SQL queries from inside R

When you work as a data scientist, you will be working with huge databases that  can contain tables with millions of rows.With dbReadTable(), you are importing an entire table.For tables containing a small number of records, it may not be a problem.

What if the table that you are trying to import contains millions of records, but you need only a fraction of the data for analysis. If the data selection work happens inside the database itself, R only needs to import the selected data.

To import selective data, use either the subset() function or dbGetQuery() function.
But there is a difference in the way both these functions execute. subset() function imports the entire table and then does the selection.

In dbGetQuery(), you are sending a query to the database, query runs on the database side and only the results are imported into  R. If you are dealing with huge tables, using dbGetQuery() is way more efficient.Note the syntax of dbGetQuery below.

dbGetQuery() is a virtual function in DBI package , and is implemented by RMySQL package.Internally, dbGetQuery() calls these three functions.

dbSendQuery(con, “SQLQuery”)  sends the query to the database engine.
dbFetch()  fetches result of the executed query. You can specify the maximum number of records to retrieve per fetch.This can be useful when you want to process a ton of records and do it on chunk by chunk basis.
dbClearResult() clears the results.

importing data from web into R dataframes

reading csv files from web

More and more of the information that you want to work with , as a data scientist , resides on the web.If you have URLs that point to flat files on the web, use readr package functions, like read_csv() and read_tsv() to import these flat files into R dataframes.

You can also use read.csv() function from utils package to read csv files from the web.

reading excel files from the web

If you try to use read_excel() function of readxl package to read an excel file stored on the web, it would not work. This is because read_excel() can read local files only.

A workaround to fix this error is to use download.file(), a function from the utils package, to download the excel file to your local directory.The download.file() performs a GET request to connect to the server where the excel file is hosted.It can download any kind of file, using http, https etc.

In the below example, ~ specifies the current directory and fileName represents the name with which I want to save the downloaded file in current directory.Then you can use read_excel() on the downloaded excel file.

You may also use read.xls() function of gdata package to read excel files through a direct URL on the web.

what is RData file?

A RData file is an efficient format to store data.You can load only a local RData file using load() function.Remote RData file can be loaded using url() function.

using httr to download a file from web

Downloading a file means sending a GET request and receiving the file. The httr package, written by Hadley Wickham, provides a convenient function GET() to execute the GET request. The result is a response object, using which you can get the status code, content type, and the actual content.

Use content() function to read the content from a response object. The content() function has a argument as , which allows content to be retrieved as a raw object, a character vector or an R object.

APIs and JSON

The JSON format is very simple. It is concise and well structured.It is human readable , easy to interpret and generate by computers. This makes it perfect to communicate in Web APIs.

To convert a json to a R datastructure, you can use jsonlite. It  is a package developed by Jeroen Ooms. The fromJSON function can convert strings that represent JSON to R list. fromJSON works if you pass a string representing JSON or path to a local file containing JSON data or a URL containing JSON.

how to convert R data to json format

you can use toJSON() function to convert R data to JSON format. When you pass an R object to this function, it converts to a character string representing a JSON.

JSONs can be in minified format or pretty format (with indentation).By default, toJSON() returns the JSON in minified format. You can change the default behaviour by setting pretty argument inside toJSON() to True. You could also use prettify() or minify() to adapt already existing JSON string to pretty format or a minified format.

mtcars is a dataset which is already present in R. Let’s see how to convert this data  to pretty format and minified format.

Importing data from statistical software packages

Till now, we haven’t how to import data from statistical software packages.The most commonly used are

SAS – Statistical Analysis Software . Produces file type .sas7bdat , .sas7bcat
STATA – STatistics and daTA . Uses .dta
SPSS – Statistical Package for Social Sciences. Generates file type .sav and .por

Which of these packages are used depends on the field of study and personal preference.No matter which package the data comes from, R is prepared for every file that can come along.

R has packages called haven and foreign which can import data from statistical software packages.
haven has been written by Hadley Wickham and foreign by R Core team. haven is consistent, easier and faster to use compared to foreign, which has support for many data formats.

haven

This package can deal with SAS, STATA and SPSS data files, as it wraps around the ReadStat C library , developed by Evan Miller. It accepts a single argument, which is the path to file. The result is a R dataframe !

To load  SAS data into R, we can use the function read_sas()

haven can import STATA13 and STAT14 data files.To load STATA data into R, we can use read_stata() or read_dta().

To load SPSS data into R, use the functions read_sav() or read_por(), depending on the file type.

With STATA and SPSS files , you might see one of the columns being imported as a “labelled” vector in the dataframe.This is done to retain the labelling information.

In order to efficiently continue working on the data in R, change these labelled vectors into a standard R class, such as a factor. You need  haven’s as_factor() function to do this.

foreign

Written by R Core team.Although foreign package is less consistent with naming and usage, it is a very comprehensive tool that can work with all kind of data formats.It can handle SAS, STATA, SPSS, Systat, Weka data formats.

foreign cannot import .sas7bdat files.  So how can you import from .sas7bdat files, other than using haven package. You can use sas7bdat package.

foreign package can handle SAS libraries with the format .xport.

When it comes to STATA, foreign can read .dta files of versions STATA 5 to 12. You can use read.data() function for this.

convert.factors: converts labelled STATA values to R factors. If it is set to TRUE, it will automatically convert “labelled” data to factors. Remember, we used haven’s as_factor() to this manually.
convert.dates: convert STATA dates and times to R Date and POSIXCt objects.
missing.type: If you are familiar with STATA 8 and later, there is a support for 27 types of missing values ! In R there is only one type of missing value, NA. If missing.type is set to FALSE, it converts all types of STATA missing values to NA. If set to TRUE, a list with information on how different values for different variables are missing  , are included in the attributes returned dataframe.

You can import SPSS data  using read.spss() function.

use.value.labels :- TRUE by default. It specifies whether variables with values as labels(labelled variables) in SPSS should be  converted to R factors. It is similar to convert.factors argument in read.dta()
to.data.frame :- FALSE by default, tells R whether or not to return the SPSS data as a dataframe, instead of a list.

Hope this article has been informative.Please share in case you like it 🙂

Recent Posts

Menu