Data visualization using ggplot2

In this post, we will learn the basics of data visualization using ggplot2 in R.Using data visualization will make it easier to identify patterns in your data and plan analyses accordingly.

Data visualization is one of the most essential skills for a data scientist.We will explore the principles of data visualization and learn to write R code to visualize trends in data in different ways.

ggplot2 is one of the most popular tidyverse packages used for data visualization. The ggplot2 package is so popular among R users because of its consistent syntax and the efficiency with which you can use it to create high-quality visualizations.

The gg in ggplot2 stands for “Grammar of Graphics”, which refers to a system for data visualization first described by Leland Wilkinson.

Hadley Wickham, chief data scientist at RStudio, used the principles of the Grammar of Graphics to develop ggplot2 to allow systematic, consistent, time-efficient creation of data visualizations.

Syntax of ggplot

Notice how we add each new layer to the graph using a + at the end of the preceding line of code. This syntax is consistent for any type of visualization you’ll create using ggplot2.

Using ggplot2, you can add geometric objects of different types to a graph depending on what type of data you’re working with and the relationships between variables you’re looking to explore.

Let’s load gapminder dataset and dplyr package first,  and then we will start visualizing the trends within this dataset.

install gapminder and dplyr packages in R

gapminder is a special type of dataframe , called a tibble.It has 1704 rows and 6 columns.Let’s derive some insights out of this. You can see that Afghanistan life expectancy has gone up from 1952 to 1997.

We will use dplyr and ggplot2 to perform interactive data visualization.Let’s load ggplot2 now.

Scatterplots

Let’s filter the gapminder data for the year 2007, and draw a scatterplot using ggplot2, with x-axis being gdpPercap and y axis being lifeExp.

You can observe that higher income countries have higher life expectancy.

scatterplot using ggplot2

One problem with this plot is that a lot of observations get cramped on the left most part of the x-axis. This is because the distribution of gdpPercap  stands several orders of magnitude, with some countries in the range of more than 10000 dollars while others in range of 100 dollars. It is useful to work with a logarithmic scale in such scenario.

Logarithmic scale

When one of your axis has such a distribution as described in previous example , it is useful to work with a logarithmic scale – A scale  where each fixed distance represents a multiplication of the value.

If you change x-axis to a log scale, each unit on the x-axis represents a change of 10 times the gdp.

To achieve this, we add a call of scale_x_log10().

Let’s observe how the plot changes when we use log scale on both x and y axis

 

scatterplot with log scale using ggplot2

Line Type

To create a line graph using different styles of lines, you’d use the parameter lty =  within aes call. The argument lty stands for “line type”.To change the line types you’ll need to add another layer: scale_linetype_manual(). Refer to a code example containing lty:-

Scale Limits

When you want to hone in on an interesting subset of your data for further investigation, one way to do so is to set scale limits. Changing the scale limits changes the range of your axes so you can display only a portion of your data.Refer to a code example containing xlim and ylim:-

Coloring

You can add few more aesthetics to this plot. Let’s color each dot by continent, and change the size of dot by pop.

The color attribute here would publish the graph in the default colours of ggplot2.

add aesthetics using ggplot2

So far, you have used the default colors and line types when working in the aes() layer. However, ggplot2 allows you to customize these arguments extensively:

As you’ve now seen is often the case with creating graphs with ggplot2, modifying line colors and types involves adding another layer to your graph. To change the colors you used in a graph , you’d add a layer called scale_color_manual().

To change the background color of your graph, use theme(panel.background = element_rect(fill = “background color”)), refer to an example below

Faceting

ggplot allows you to divide your plot into a number of sub-plots, usually using a  categorical variable.This is called faceting. You facet a plot by using facet_wrap(), and specify the variable by which you want the faceting to work using tilde (~), inside facet_wrap().

using faceting in ggplot2

Add title to graph

If you want to add a title to your graph, use ggtitle. So the code used to generate box plot would change to

You can also change the title of the graph, x -axis label and y-axis label using labs().Refer to a code sample below:-

 

Add title to your graph

Types of graphs in ggplot2

Scatterplots

We have used geom_point() earlier to make a scatterplot.

Line plots

Line plots are used for visualizing a trend over time. Use geom_line() to make a line plot.

Let’s create a line plot showing the change in medianGdpPercap over different years.

line plot using ggplot2

Bar charts

Bar charts represent grouped data summaries using bars with heights proportional to values of a summary variable such as the average/median etc.

The layer that distinguishes a bar chart from other graphs is the layer in which you’ll specify the geometric shape used to display the data. While before you used geom_line(), now you’ll use geom_bar() (or) geom_col().

geom_col(…) is equivalent to geom_bar(stat = identity, …)

In the code above, we specify stat = “identity” within the geom_bar() layer. This is because, by default, using geom_bar() creates a bar graph where the height of the bars corresponds to the number of values in the specified y-variable. Using stat = “identity” overrides the default behavior and creates bars equal to the value of the y-variable

In a bar plot, the x axis usually contains categorical variables.Unlike scatterplots or line plots, bar plots always start at zero.

bar plot using ggplot2

Histograms

Unlike bar charts and line graphs, histograms are used to understand characteristics of one variable rather than the relationship between two variables.

Histograms depict the frequency with which values of a variable occur, otherwise known as the distribution of the variable.

The syntax for the data and aesthetics layers are similar to what you have used to generate line graphs and bar charts.

Use geom_histogram() to create histogram.It can be used to investigate one dimension of data at a time.

It has only one aesthetic, the x-axis , and the width of each bin in a histogram is chosen automatically. Width of the bin(binwidth parameter) has a large effect on how the histogram conveys the distribution. Can be controlled using binwidth parameter.

histogram using ggplot2

Remember when you create a histogram, the independent variable count is calculated for you.

The geom_histogram() layer specifies creation of a histogram to represent the independent variable. The argument binwidth = 1 specifies the size of the categories used to bin the values of the independent variable.

Within the geom_histogram() layer, you can use two different arguments to specify the number of categories for binning the independent variable.

binwidth parameter allows you to specify the size of the bins, and is useful for instances where you want categories to span specific intervals.

bins parameter allows you to specify the number of bins, which can be useful to experiment with when deciding how much detail you want to use to display your data.

If you don’t use any arguments within the geom_histogram() layer, ggplot2 will use a default number of bins.Another option for using aesthetics to map to different values  is to use the argument fill = instead of color =. Instead of outlines, fill = depicts bars filled in with different colors.

You’ll often use histograms in your data science career for initial explorations of your data. Knowing how to visualize and interpret distributions will become increasingly important later on when you learn about statistics and modeling.

Box plots

Like bar graphs, box plots provide a summary of data by group. Like histograms, they provide information about how data are spread.

A box plot helps to visualize both the center of and the variation in your data.

Like bar graphs, box plots provide a summary of data by group. Like histograms, they provide information about how data are spread.

Use geom_boxplot() to create boxplot.

 

boxplot using ggplot2

A box plot has two aesthetics.

The black line in the middle of each box within a box plot indicates median of the distribution.The top and bottom line of each box plot represents the 75th percentile and 25th percentile of the group.

The lines going up and down of the box are called whiskers. The dots above/below the whiskers represent outliers, observations which are unusually apart from rest of the distribution.

When should you use these different types of plots ? You will probably explore different options for visualizing each new data set, and doing so is a good idea. However, here are some general guidelines:

Bar charts may be used for showing a quick summary of your data, such as averages or counts of the number of instances of a value that occur for a given variable.

Histograms are useful for visualizing distributions of data when you want to know the shape of a distribution (in other words, where most values are clustered).

Box plots provide an informative summary of the shape, spread, and center of your data.

Recent Posts

Menu