R
No Comments

Data visualization using ggplot histogram

ggplot histogram

In this article we will explore about what is a histogram, creating histogram using ggplot2 and its various customization techniques.

What is a histogram?

A histogram is a type of graph commonly used to visualize the univariate distribution of a numeric data. Here the data is displayed in the form of bins which represents the occurrence of datapoints within a range of values. These bins and the distribution thus formed can be used to understand some useful information about the data such as central location, the spread, shape of data etc. It can also be used to find outliers and gaps in data.

A basic histogram for age looks as below.

ggplot histogram

From the above histogram it can be interpreted that most of the people fall within the age range of 50-60 and there seems to be less number of people for the range 70-80 and 90-100 .There is also a gap in the histogram for the range 80-90 which indicates that the data for the age range 80-90 might be missing or not available. So, a histogram as above can be used to visualize useful information about a continuous numeric variable. Let’s see more about these histograms, how to create them and its various customization options below.

Histogram and Bar Charts

Histograms are sometimes confused with bar charts. Although a histogram looks similar to a bar chart, the major difference is that a histogram is only used to plot the frequency of occurrences in a continuous data set that has been divided into classes, called bins. Bar charts, on the other hand, is used to plot categorical data.

Creating histogram using ggplot2

To create a histogram first install and load ggplot2 package.

We will be using the below dataset to create and explain the histograms. The dataset has two columns namely cond and rating. The variable cond is categorical with two categories A and B and rating is a continuous numeric variable.

The dataset looks as below.

Using ggplot2 histograms can be created in two ways with

  • qplot() and
  • geom_histogram()

Histogram using qplot()

Histogram using qplot can be created as below by passing one numeric argument.

ggplot2 histogram qplot

Histogram using geom_histogram()

Histogram using geom_histogram() is also created by passing just the numeric variable.

ggplot2 histogram using geom_histogram

Although the plots for both the histograms looks similar in practice geom_histogram() is widely used since the options for qplot are more confusing to use.

Note that while creating the histograms the below warning message.

stat_bin() using bins = 30. Pick better value with binwidth.

was triggered which needs to be addressed by changing the binwidth.

Adjusting binwidth

To construct a histogram, the first step is to bin the range of values i.e., divide the entire range of values into a series of intervals and then count how many values fall into each interval.

So, a histogram basically forms bins from numeric data where the area of the bin indicates the frequency of occurrences. Hence changing the bin size would result in changing the overall appearance and would result in histograms with different distribution and spread of the values.

Note that the height of the bin does not necessarily indicate how many occurrences of scores there were within each individual bin. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin. So, only in case of equally spaced bins(bars), the height of the bin represents the frequency of occurrences.

In ggplot2, binsize can be can changed using the binwidth argument. Now let’s explore how changing the binsize affects the histogram by creating two histograms with different binsize.

Let’s first create a histogram with a binwidth of 0.5 units.

Creating the second histogram with a bandwidth of 0.1 units.

Now let’s compare the histograms.

Using binwidth in ggplot histogram

 

As we can see changing the binsize has created histograms with different distribution and spread of data. So, choosing the right binsize is important to get useful information from the histogram.

Customizing histogram

Now let’s see how to customize the histogram by changing the outline, colors, title, axis labels etc.

Changing histogram outline and fill colors

The outline and color of a histogram can be changed using the color and fill arguments of geom_histogram().

Color represents the outline color and fill represents the color to be filled inside the bins.

For the above basic histogram, lets change the outline color to red and fill color to grey.

histogram outline and fill color in ggplot

Adding title

Title can be added to a histogram using the ggtitle() of ggplot2.Let’s set the title of above histogram as “histogram with ggplot2”.

adding title to histogram

Customizing axis labels

Labels can be customized using scale_x_continuous() and scale_y_continuous(). We add the desired name to the name argument as a string to change the labels.

customizing axis labels in ggplot histogram

Changing axis ticks

Let’s change the x-axis ticks to appear at every 3 units rather than 2 using the breaks = seq(-4,4,3) argument in scale_x_continuous. seq() function indicates the start and endpoints and the units to increment by respectively.

Let’s also change where y-axis begins and ends where we want by adding the argument limits = c(0, 100) to scale_y_continuous.

The histogram with new axis ticks looks as below.

histogram with customized axis ticks

Using transformed scales

Let’s transform the x and y axis and see how transformation affects the ggplot histogram .

Transforming x-axis

Let’s first transform the x-axis by taking the square root of them using the scale_x_sqrt().

The histogram with new transformed x-axis looks as below.

transforming x axis in histogram

While applying the above transformation all the infinite values resulting from the transformation have been removed.

Hence the transformed scales for negative x-values are not displayed in the above histogram.

Transforming y-axis

Lets now transform the y-axis by taking the square root of them and then reversing them.
This can be done using scale_y_sqrt() and scale_y_reverse() as below.

And the histograms for the transformed y-axis looks as below.

Note that for the transformed scales, binwidth applies to the transformed data and the bins have constant width on the transformed scale.

transforming y axis in histogram

Adding lines to a histogram

Vertical and horizontal lines can be added to a histogram using geom_vline() and geom_hline() of ggplot2.

Now let’s see how to add a vertical line along the mean rating to the above histogram.

And the histogram looks as below,

adding mean line to histogram

Histogram with density

We can also create histograms with density instead of count on y-axis. This can be done by changing the y argument of geom_histogram() as y=..density..

As we can see the histogram has been plotted with density instead of count on the y axis.

histogram with density

Let’s customize this further by adding a normal density function curve to the above histogram.

Adding a normal density curve

We can also add a normal density function curve on top of our histogram to see how closely it fits a normal distribution. In order to overlay the normal density curve, we have added the geom_density() with alpha and fill parameters for transparency and fill color for the density curve. We have used alpha=.2 and fill color as yellow in this case. Note that the normal density curve will not work if count is used instead of density.

And the code to overlay normal density curve looks as given below.

geom_density in ggplot histogram

As we can see the above histogram seems to perfectly fit a normal distribution.

Customizing gradient

We can also add a gradient to our color scheme that varies according to the frequency of the values using the scale_fill_gradient(). To add gradient also change the aes(y = ..count..) argument in geom_histogram to aes(fill = ..count..) so that the color is changed based on the count values. For lower count values lets set the color as yellow and red for the higher ones.

The code to customize gradient looks as below.

customize gradient

As we can see, in the above histogram the color is changed from yellow to red based on the count of values.

Histogram with categories

Using ggplot2 it is possible to create more than one histogram in the same plot. Now let’s see how to create a stacked histogram for the two categories A and B in the cond column in the dataset.

Stacked histograms can be created using the fill argument of ggplot().Let’s set the fill argument as cond and see how the histogram looks like.

Histogram with categories

We can see two histograms has been created for the two categories A,B and are differentiated by colors. By default , ggplot creates a stacked histogram as above. Let’s customize this further by creating overlaid and interleaved histogram using the position argument of geom_histogram.

Overlaid histogram

Overlaid histograms are created by setting the argument position=”identity”. We have also set the alpha parameter as alpha=.5 for transparency.

Overlaid histogram

Interleaved histogram

Interleaved histograms can by created by changing the position argument as position=”dodge”.

interleaved histogram with ggplot

Using facets

Facets can be created for histogram plots using the facet_grid().Here lets create a facet grid for the histograms created based on the categories A and B of cond by adding facet_grid(cond ~ .)to ggplot

facets in histogram plots

As we can see we have created a facet grid with two histograms for the categories A and B of cond. This can be used in cases where the histograms need to be compared or more than one histogram needs to be plotted in a same graph.

Summary

In this article we have discussed how to create histograms using ggplot2 and its various customization options. We first created a basic histogram using qplot() and geom_histogram() of ggplot2.

We then discussed about bin size and how it affects the appearance of a histogram .We then customized the histogram by adding a title, axis labels, ticks, gradient and mean line to a histogram. We also discussed about density curve and created a histogram with normal density curve to see how it fits a normal distribution.

We then moved on to multiple histograms by creating stacked, interleaved and overlaid histograms for the two categories A and B. Finally, we created a faced grid with two histogram plots.

Hope this article helped you get a good understanding about ggplot2 histogram. Do let us know your  feedback about this article below.

Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book

data-science-hand-book


You must be logged in to post a comment.
Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book


data-science-hand-book

Arm yourself with the most practical data science knowledge available today.

KEEP LEARNING

Menu