Identifying dirty data and techniques to clean it in R

In this post, we will learn to identify dirty data and learn the techniques to perform data cleaning in R.

Let us take a look at a dirty dataset example –  a weather dataset, collected in the Boston city, USA for a period of 12 months.

sample dataset for cleaning in R

In the first look, this data might appear clean, as it is arranged into columns and rows. The columns are named and the rows have numbers. The data is already in the format of an RDBMS table. But, a detailed look at it shows many problems that need to be solved before building a predictive analytics model on it.

The column X, on the extreme left, seems to be meaningless. What do the columns with the labels X1, X2, X3…etc represent? If they indicate days of a month, it means that time is present in both columns and rows.

The different weather metric names present in each row of the measure column needs to to be present as a separate column. You can also find NA’s at the tail of the data.

Why should we care about cleaning data?

You may have seen dirty data everywhere. Most of the real-time datasets, that we work with, were dirty in the beginning. When we start working with them, most of those datasets would have been cleaned.

When you start to do data analysis or modeling,  the availability of clean data is of utmost importance. Hence you need to learn the different techniques to clean messy data.

With the advent of big data, it is critical to understand that data cleaning is an important part of any data science project. You can categorize any data science project into 4 simple steps.

  1. Acquiring or collecting the data.
  2. Data Cleaning.
  3. Modeling or Analyzing data.
  4. Reporting insights to the relevant audience.

If we attempt to omit the data cleaning step, we will run into issues in any data science project, as raw data is tough to deal with, using traditional tools like Python or R.

Data cleaning is not only an essential component but also it is the one which takes most of the time in any data science project. As this NewYork Times article reported in  2014 –

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

Exploring raw data

Data cleaning process starts with exploring the raw data. Data exploration is a 3 step procedure. These 3 steps are,

1.Understand the structure of data

Before you start data cleaning, you need to understand the structure of the data. Several functions are available in R to understand the structure of data.

functions used for data cleaning in R

  • class(df) – We use class() function to check that the weather object is a dataframe or a 2-d table having rows and columns. Every column in a dataframe should have a single data type only – character, numeric etc.
  • dim(df) – We can see the dimensions of the weather dataframe using the dim() function. dim() shows the number of rows (286) followed by number of columns(35).
  • names(df) – We can see the dataframe column names using names() function.
  • str(df) – The str function(Structure), is an important function in R language.Any R object could be passed to it. It will show a  summary of the object’s internal structure. When a dataframe is passed to it, str() displays
    • how many rows(observations) and columns(variables) are present.
    • column names and their respective data types.
    • a preview of the data present in each column.
  • library()  – The library function is used to load R packages in your current working environment.

glimpse function in R

  • glimpse(df) – The dplyr package has a function called glimpse(), which is an improved version of str(). The glimpse() function shows the same information as str() , but you can see as many values per column as will neatly fit on your terminal.

summary function in R

  • summary(df) – The summary() function provides a statistical summary per column.

2.Looking at your data

We have seen the summaries of the data, using the R functions above. But we can explore the data more thoroughly by looking at data.

head() function

Use head() function to look at the top 6 rows of a dataframe by default. You can specify how many total no.of rows to display from the top, by adding an additional argument n.

tail() function

Use tail() function to look at the last 6 rows of a dataframe by default. You can specify how many total no.of rows to display from the bottom, by adding an additional argument n.

3. Visualizing the data

Viewing the head and tail of data may give you only a few insights. An alternate way to find issues with data is by visualizing it.

Use hist() function to plot the histogram of a vector or a dataframe column. The generated histogram gives you an understanding of variable distribution.

hist() - visualizing data

histogram in R

If you want to view the relationship between two variables or vectors, you can generate a scatterplot with the plot() function.

plot() - generating scatterplot

Tidying data

The concepts of tidy data have been around for decades.It may seem familiar if you have ever worked with relational databases.

In 2014, Hadley Wickham wrote a paper in the journal of statistical software called Tidy Data which summarizes these concepts in a clear and concise way. We will introduce the concepts of tidy data and review some simple and practical methods of implementing them in R.

Here is an example of tidy data.

Example of tidy data

For each of four people, we have a name, age and eye_color. Looking across the first row, we see that Aditya is 33 years old and has Blue colored eyes. This row is called an observation.

Looking down the second column we see the distribution of ages among our 4 subjects. This column is called a variable or attribute. Each individual age is called value of the age variable.

If we were to give this dataframe a name we will call it people, since each observation describes the characteristics of a single person, also known as, an observational unit.

If some rows, instead of describing individual people, described characteristics of their pets or individual movies, then the table will be said to contain more than one type of observational unit. This is always true of tidy data. Observations are represented as rows and variables as columns. There is exactly one type of observational unit per table.

Said differently, a dataset is a collection of values. Each value belongs to both a variable and an observation. A variable contains all values that measure the same attribute across units and an observation contains values measured on the same unit across attributes.

Symptoms of messy data

Now that we have seen the principles of tidy data, lets look at a simple example of messy data and try to figure out what is wrong with it.

Example of messy data

In this example, observations are still in rows, and we only have one type of observational unit, people. It even appears that each column is a variable. Notice, however, that the columns brown, blue and other are actually values of what was previously the eye_color variable. This is a common symptom of messy data, column headers are values, not variable names.

When talking about datasets, it is often convenient to refer to them as either wide or long. Although these definitions are imprecise, they generally refer to situations when you have more columns than rows, or more rows than columns respectively.

example of wide dataset

However, a less strict interpretation is simply that a wide dataset tends to represent key attributes of the data horizontally in the table instead of vertically, similar to the last example which you saw. The opposite is true for long datasets.

example of long dataset

Refer to this blog post on introduction to tidyr to learn about using it to clean up common symptoms of messy data.

Addressing common symptoms of messy data

We will now take a look at a few common symptoms of messy data. In other words, these examples will show some of the most common violations of the principles of tidy data.

column headers are values, not variable names

You saw this example before. For each of 4 people, we have information on name, age, eye color. In the below representation of data, eye color is specified with a 1 or 0 for brown, blue or other. But these column headers are actually values of what should be a single eye_color variable.

Example of messy data

Here is a tidy version of the same data which could be achieved using gather function from tidyr.

Example of tidy data

variables are stored in both rows and columns

Here for each person, we show the number of pets they have using n_dogs, n_cats, and n_birds. This is essentially the opposite problem we have on the last example, since the value of measurement column, n_dog etc should really be variables and thus represented as column headers.

example for dataset where variables are stored in both rows and columns

So prior to making the adjustment, we have variables represented in both rows and columns. But after adjustment, each column represents exactly one variable. As you may have guessed by now, this result could be achieved by spread function from tidyr

example of data cleaning using spread function

Multiple variables are stored in one column

Another common issue arises when multiple variables are stored in a single column. Here the sex and age of each person have been combined in a single column called sex_age.

Example dataset for multiple variables stored in single column

To get to a tidy dataset, we simply split the sex_age column into two columns sex and age, which could be easily achieved using tidyr’s separate function.

Other common symptoms

Finally, there are two additional messy data symptoms that arise frequently enough to mention here. The first is, when you have a single type of observational unit, for example people stored in more than one table.

The second is, when you have more than one type of observational unit, people and pets, stored in the same table people_pets. Lets see this scenario with an example.

messy data example

In addition to having names and ages, we also have information on each of their pets, including pet type and pet name.A tell-tale sign that something is wrong is that we have one row for each pet, which requires us to have multiple copies of name and age of each person. For example, Bindu’s name and age is repeated 3 times here.

This duplication is inefficient and is often an indication that a table should be broken into multiple tables, each representing only one type of observational unit.

In this case, there should be a table of people and a table of pets, however there must be some way of connecting pets to their owners, so we would be sure to list to owners name in pets table.

If you’ve ever worked with RDBMS, this concept called primary key may be familiar to you.

Preparing data for analysis

Till now, you have seen how to tidy your data – making sure observations are in rows, variables are in columns, and each table represents one type of observational unit.Once this is done, the next step is to prepare your data for analysis.

Often, preparing your data for analysis involves working with different types of variables. This is to make sure that everything is stored and displayed in a proper format.

Types of variables in R

Lets quickly review some examples of each type of R variable.

The class function is used to inspect the type of variable you are dealing with.In addition to being called on single values or  scalars , class() can also take vectors or dataframe columns as input.

Type conversions in R

Often, you need to adjust the class of say, a particular column of data. Lets walk through of some simple examples of type conversions, also called coercions.

Overview of lubridate package

One of the important data type we have not seen till now is the dates.Working with dates in R can be messy, but the lubridate package by Garrett Grolemund & Hadley Wickham helps considerably.

As with tidyr, I won’t introduce you to the full functionality of lubridate, as it’s quite a powerful package with many features. It is useful to examine just a few examples  of commonly used functions for  coercing strings to dates.

 

The first two calls to the function ymd( year-month-day) show its versatility. Since we present the same date 31st March 2019 in two different formats, but wind up with same standardized output in year month day format followed by the default time zone of UTC.

While parsing dates, lubridate will always do its best to determine the format of the date, making it much more flexible and convenient than the date functionality contained in base R.

The third example here shows two different functions, mdy (month-day-year) , and dmy(day-month-year) but again gives us the standardized year month day output.

Next, the hms function(hour-minute-second) reads in a time and produces a standardized time output in hour minute second format.

The mdy_hm function(month-day-year-hour-minute) reads in a date with no seconds and produces standardized output

Finally, we parse a datetime using ymd_hms function(year-month-day-hour-minute-second)

The common theme among all of these functions is that, their names represent the format of the date or the datetime input that they are designed to parse.

For example, ymd takes the date in year month date format.However, not every combination exists. So you may need to consult the lubridate package documentation from time to time.

String manipulation using the stringr package

In addition to working with different types of variables, knowing how to manipulate strings is a crucial skill when cleaning data in R.

Similar to the way lubridate package makes working with dates easier, the stringr package written by Hadley Wickham makes working with strings easier.It provides a suite of useful functions all sharing a  consistent interface.

Let’s look at some simple examples of  stringr functions that will come in handy throughout your data cleaning adventures.

str_trim()

The str_trim() function trims all leading and trailing whitespaces for either a single character string or a vector of character strings.

str_pad()

str_pad is somewhat opposite to str_trim . It helps you to add characters or white space to either end or both ends of a character string.

Assume that each employee in your company has a unique numeric seven digit employee id, often with leading zeros. The spreadsheet software your co-worker used to do some analysis thought the employee id was just another number , and because of this,  ignored the leading zeros.

You now need to add them back, which you can do, by adding the employee id, to str_pad() along with the arguments width = 7, side = “left” and pad = “0”.

This says that the id should be 7 digits in total, and if it is not, add the zeros to the left side. Pretty handy, right ?

You may also use str_pad() on vectors, refer to example below

str_detect()

We can use the str_detect function to detect the presence of a particular string  in the vector. The result is a vector of the same length as the input vector, with a TRUE everywhere there is a match and a FALSE everywhere else.

str_replace()

Instead of simply detecting the presence of a string, we can even go one step further and replace the string , with something else. In this case, you can replace the string “Kishore” with “Ranjit” using the str_replace function.

Like every function in stringr, the first argument is the input vector. The second argument is the string being replaced, and the third is what we are replacing it with.

A couple of simple, but often useful functions from base R are

tolower() – Make strings lowercase
toupper() – Make strings uppercase

Missing and special values

When cleaning data, you will often come across missing or special values.It turns out that missing data in particular is a very deep topic. Lets understand few useful tools for finding and dealing with missing and special values.

Not all missing data is created equal. Data may be missing for many reasons.Often the reason they are missing determines how you should deal with them.

In R, missing data are represented with NA(Not Available). But you have to be careful, as missing data may be in many other forms.

If your data is imported  from Excel, missing data appears as #N/A. If your source is SPSS or SAS, missing data appears as a dot (.) , sometimes missing data may also appear as an empty string .

Special values in R

There are a couple of special values to watch out for as well.Although they are generally less common than NA. These are Inf – Infinity value(indicative of outliers?) and NaN – Not a Number (rethink a variable?) .

Inf may occasionally but certainly not always, be indicative of extreme outliers in your data. NaN may mean you need to think through how something is being computed. You may ask – am I accidentally dividing something by zero in my data?

Finding missing values.

I am going to walk you through a few ways of finding missing values.

We use a small dataset in the examples, but these methods become particularly helpful when you can’t easily view the entire dataset by just printing it to the console.

cleaning data by finding missing values in R

We begin by creating a 4X3 dataframe , called df.

Passing df to the function is.na() returns a dataframe of the same size as df, but with TRUE anywhere NA was found, and FALSE anywhere it wasn’t.

This is interesting but again it may not be useful if we are dealing with a much larger dataset. Instead we may be curious to know if there are any NA’s , anywhere in the data.

By surrounding is.na with the any() function , we are asking if there are any TRUE’s in the result from is.na(df) , in this case there are.

Once we know that there are NA’s, a very natural follow up question is, how many?

Under the hood, TRUE and FALSE are represented numerically by 1 and 0 respectively. This allows us to sum the results from is.na(df) to count the number of TRUE’s, which is 3.

Also helpful is the summary() function. It tells you the number of NAs for each variable in the dataset. Here we clearly see two of the NAs are in column A and one NA is in column B.

You can also use the table function on a dataframe column to find the information on NAs

Use which() function to locate the missing values in a column for a particular logical condition.You can find the indices  containing missing values using which() function.

Dealing with missing values

If you have come to the decision that removing the missing data is appropriate, there are a lot of ways to go about it.

The complete.cases() function is used to see which rows have no missing values. It returns a vector with one element(either TRUE or FALSE) for each row of data.It returns TRUE if there are no missing values in a row, i.e., the observations considered a complete case, and returns FALSE otherwise.

This vector can be used to subset the original dataframe , keeping only rows with no missing values.

Alternatively, the na.omit() function does both steps by automatically removing any rows with missing values.

Outliers and obvious errors

We will now explore outliers and obvious errors. This is a box plot displaying the distribution of simulated data. If this is the first time, you are seeing a box plot, the idea is pretty simple.

Values are represented along the  X axis. The box in the middle represents the middle 50% of values. The thick black vertical line at the center of the box is the median. The lines extending horizontally
out from the box , capture all , but the most distant values.

Any values beyond a certain distance from the majority of data are represented as single points.

I have purposefully added some extreme values or outliers to the data, and these values are clearly represented as standalone points, two to the right of the box and one on the left.

using horizontal boxplot to view outliers as part of data cleaning in R

In the real world, outliers may be caused by a number of different things, they may represent extreme but perfectly valid measurements, such as, a human being who is more than 100 years old.

They could be caused by variability in measurements, due to imperfect equipment. They could be caused by experimental error, or even something as simple as a mistake when entering data in a computer spreadsheet.

As a data scientist, it will often be your job to figure out why there are outliers in your data. Depending on what you find, they may be discarded or retained. This decision could have a significant impact, on your final analysis.

Looking back at the box plot example, how do things change, if I told you, these values represent the ages of the people.  The extreme value at the left is less than zero, so something must be wrong. This is a case of an obvious error in the data because there is no such thing as negative age.

Obvious errors

So errors may appear in many forms , including values so extreme that they can’t be plausible, for example a person aged 243 years , or values that simply make no sense , like negative age.

As with outliers, it is important to assess the cause of errors in the data. they may be caused by measurement error , data entry error , or they could be intentional , if , for example , the person entering the data was sinister enough , that he uses -1 to represent missing data !

In most cases, the obvious error should be removed or replaced.

I have simulated another small dataset called df, to contain some normally distributed variables , and a couple of extreme values.

exploring a dataset containing outliers

we can use summary to quickly see that the value 300 is extreme , relative to most other values in column B, and that there is a negative value -1 in column C.

view summary of dataset before doing data cleaning

But our view of the distribution of values of column B is limited , in summary() output. For a more complete view, we can view of histogram of B , using hist() function.

identify outliers by visualising it with a histogram

Here, its clear that all values are concentrated well below 100, but that there is a single value at 300 , suggesting that perhaps an extra 0 was accidentally added, to what should have been a 30.

Notice that , in the call to hist() we can control the number of buckets or breaks into which the data are broken up with breaks argument.

data cleaning by visualising outliers in a histogram

Here, we apply a box plot to the full dataset df. R generates a box for each column of the data. Again we clearly see the extreme value 300 in column B and negative value -1 in column C

using boxplot to visualise outliers during data cleaning

Cleaning a real-world messy data set in R

Let’s use everything we have learned till now to take the real world dataset containing weather data from raw to ready for analysis. There are two main issues to be addressed here from a tidy data perspective.

The first issue is that column names X1-X31 are actually values , of which should be a new variable called “day”. Let’s use tidyr’s gather function to create two new columns day and value, from the columns X1 to X31

 data cleaning - column names are values

The second issue is that we have variable names represented as values. Specifically, the unique values in measure column should each have its own column. Let’s use tidyr’s spread function to do this.

variable names represented as values - data cleaning

Prepare the data for analysis

Now that you’ve tidied the data, the next step is to prepare data for analysis.
We will remove X from the day column, using str_replace() of stringr.

Then we will combine the year, month and day columns into a new column called date, using the separator “-“.We shall use tidyr’s unite() function to do this.

preparing data for analysis by cleaning it

Now, let’s coerce the date column to proper date format using ymd() function from lubridate package.

We shall now reorder the columns of weather4 using dplyr’s select function.

Let us see the structure of weather5 dataframe.Observe the first 20 rows, and you can see that all the rows are in character data type, where they should have been numeric. Try to convert the PrecipitationIn column to numeric, and you can see that NAs have been introduced.

This is because there is a character “T” (signifying Trace Amount) present in the PrecipitationIn column.When R tried to coerce “T” to a numeric, it created NA, as it was unable to do so.

So, let us replace T with a “0”

weather5$PrecipitationIn <- str_replace(weather5$PrecipitationIn, “T”, “0”)

Next step is to convert all the columns from CloudCover to WindDirDegrees to numeric using mutate_at call , and save the result to weather6

weather6 <- mutate_at(weather5, vars(CloudCover:WindDirDegrees), funs(as.numeric))

It is time to find missing, extreme and unexpected values in our weather6 dataframe.

We will use sum() and is.na() to find the number of NA values in our dataframe.

sum(is.na(weather6))

Now let’s use summary() function to identify which variables contain missing values. As per this, we can see that the Max.Gust.SpeedMPH column contains NAs.

Let’s use which() function to identify row numbers(indices) where Max.Gust.SpeedMPH column contains NAs and then view the full rows for the missing  Max.Gust.SpeedMPH

Besides missing values, we will need to find out if there are any extreme values in our dataset.Again we can use summary() function to identify extreme values.

You can see that the Max.Humidity column has a max value of 1000.Lets use which() function to find that row index.

You find that an extra zero was accidentally added. Lets clean this data.

You would have seen another obvious error in the dataset, with regards to Mean.VisibilityMiles column. The Min value for this column is -1. How can VisibilityMiles be -1 ? Lets correct this as well, by looking at the full row for this day, and check the other variables for this day to find the correct value of Mean.VisibilityMiles

Check other extreme values using visualisation

You can check for  extreme values using hist() function, in addition to using the trusted summary() function. Let us view the histograms of MeanDew.PointF, Min.TemperatureF and Mean.TemperatureF to compare the distribution of these variables.

Finishing touches

There are a couple of conventions in R language. It is recommended to use lowercase letters, and separate each word with an underscore , when you have multiple words in column names.

As of now, this is not the case in our weather dataset. So let us clean up the column names.I have created a vector containing column names that obey these conventions, called new_colnames. Assign it to names(weather6)

You may have also noted that the events column has spaces(“”), lets clean up this column and replace spaces with  “None”

Clean weather data

Your data are clean now !! Lets take another look at clean data.Here we see the head and tail of the data.

Each row is a single observation or day of the year. Each column is a variable. The first column date contains a neatly represented date in year-month-day format.

Each of remaining columns represents a different weather metric.We replaced the empty cells in events column , since the empty’s that were there did not represent the missing data, rather the lack of precipitation on a given day. So it is better to be explicit in that scenario.

Most modeling tools would happily accomodate data in this format, unlike what we started with in the beginning.

Summary of the data cleaning tasks completed

We started off with inspecting the weather data. Then you tidied it, according to principles of tidy data. Next, you improved the representations of dates, using lubridate. Then you dealt with incorrect variable codings by way of type conversions. You located and dealt with missing data, identified and corrected obvious errors, and finally visualized the result.

Recent Posts

Menu