R

# Data visualization using ggplot histogram

In this article we will explore about what is a histogram, creating histogram using ggplot2 and its various customization techniques.

## What is a histogram?

A histogram is a type of graph commonly used to visualize the univariate distribution of a numeric data. Here the data is displayed in the form of bins which represents the occurrence of datapoints within a range of values. These bins and the distribution thus formed can be used to understand some useful information about the data such as central location, the spread, shape of data etc. It can also be used to find outliers and gaps in data.

A basic histogram for age looks as below.

From the above histogram it can be interpreted that most of the people fall within the age range of 50-60 and there seems to be less number of people for the range 70-80 and 90-100 .There is also a gap in the histogram for the range 80-90 which indicates that the data for the age range 80-90 might be missing or not available. So, a histogram as above can be used to visualize useful information about a continuous numeric variable. Let’s see more about these histograms, how to create them and its various customization options below.

### Histogram and Bar Charts

Histograms are sometimes confused with bar charts. Although a histogram looks similar to a bar chart, the major difference is that a histogram is only used to plot the frequency of occurrences in a continuous data set that has been divided into classes, called bins. Bar charts, on the other hand, is used to plot categorical data.

## Creating histogram using ggplot2

To create a histogram first install and load ggplot2 package.

We will be using the below dataset to create and explain the histograms. The dataset has two columns namely cond and rating. The variable cond is categorical with two categories A and B and rating is a continuous numeric variable.

The dataset looks as below.

Using ggplot2 histograms can be created in two ways with

• qplot() and
• geom_histogram()

### Histogram using qplot()

Histogram using qplot can be created as below by passing one numeric argument.

### Histogram using geom_histogram()

Histogram using geom_histogram() is also created by passing just the numeric variable.

Although the plots for both the histograms looks similar in practice geom_histogram() is widely used since the options for qplot are more confusing to use.

Note that while creating the histograms the below warning message.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

was triggered which needs to be addressed by changing the binwidth.

To construct a histogram, the first step is to bin the range of values i.e., divide the entire range of values into a series of intervals and then count how many values fall into each interval.

So, a histogram basically forms bins from numeric data where the area of the bin indicates the frequency of occurrences. Hence changing the bin size would result in changing the overall appearance and would result in histograms with different distribution and spread of the values.

Note that the height of the bin does not necessarily indicate how many occurrences of scores there were within each individual bin. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin. So, only in case of equally spaced bins(bars), the height of the bin represents the frequency of occurrences.

In ggplot2, binsize can be can changed using the binwidth argument. Now let’s explore how changing the binsize affects the histogram by creating two histograms with different binsize.

Let’s first create a histogram with a binwidth of 0.5 units.

Creating the second histogram with a bandwidth of 0.1 units.

Now let’s compare the histograms.

As we can see changing the binsize has created histograms with different distribution and spread of data. So, choosing the right binsize is important to get useful information from the histogram.

## Customizing histogram

Now let’s see how to customize the histogram by changing the outline, colors, title, axis labels etc.

### Changing histogram outline and fill colors

The outline and color of a histogram can be changed using the color and fill arguments of geom_histogram().

Color represents the outline color and fill represents the color to be filled inside the bins.

For the above basic histogram, lets change the outline color to red and fill color to grey.

Title can be added to a histogram using the ggtitle() of ggplot2.Let’s set the title of above histogram as “histogram with ggplot2”.

### Customizing axis labels

Labels can be customized using scale_x_continuous() and scale_y_continuous(). We add the desired name to the name argument as a string to change the labels.

### Changing axis ticks

Let’s change the x-axis ticks to appear at every 3 units rather than 2 using the breaks = seq(-4,4,3) argument in scale_x_continuous. seq() function indicates the start and endpoints and the units to increment by respectively.

Let’s also change where y-axis begins and ends where we want by adding the argument limits = c(0, 100) to scale_y_continuous.

The histogram with new axis ticks looks as below.

### Using transformed scales

Let’s transform the x and y axis and see how transformation affects the ggplot histogram .

#### Transforming x-axis

Let’s first transform the x-axis by taking the square root of them using the scale_x_sqrt().

The histogram with new transformed x-axis looks as below.

While applying the above transformation all the infinite values resulting from the transformation have been removed.

Hence the transformed scales for negative x-values are not displayed in the above histogram.

#### Transforming y-axis

Lets now transform the y-axis by taking the square root of them and then reversing them.
This can be done using scale_y_sqrt() and scale_y_reverse() as below.

And the histograms for the transformed y-axis looks as below.

Note that for the transformed scales, binwidth applies to the transformed data and the bins have constant width on the transformed scale.

#### Adding lines to a histogram

Vertical and horizontal lines can be added to a histogram using geom_vline() and geom_hline() of ggplot2.

Now let’s see how to add a vertical line along the mean rating to the above histogram.

And the histogram looks as below,

#### Histogram with density

We can also create histograms with density instead of count on y-axis. This can be done by changing the y argument of geom_histogram() as y=..density..

As we can see the histogram has been plotted with density instead of count on the y axis.

Let’s customize this further by adding a normal density function curve to the above histogram.

#### Adding a normal density curve

We can also add a normal density function curve on top of our histogram to see how closely it fits a normal distribution. In order to overlay the normal density curve, we have added the geom_density() with alpha and fill parameters for transparency and fill color for the density curve. We have used alpha=.2 and fill color as yellow in this case. Note that the normal density curve will not work if count is used instead of density.

And the code to overlay normal density curve looks as given below.

As we can see the above histogram seems to perfectly fit a normal distribution.

We can also add a gradient to our color scheme that varies according to the frequency of the values using the scale_fill_gradient(). To add gradient also change the aes(y = ..count..) argument in geom_histogram to aes(fill = ..count..) so that the color is changed based on the count values. For lower count values lets set the color as yellow and red for the higher ones.

The code to customize gradient looks as below.

As we can see, in the above histogram the color is changed from yellow to red based on the count of values.

### Histogram with categories

Using ggplot2 it is possible to create more than one histogram in the same plot. Now let’s see how to create a stacked histogram for the two categories A and B in the cond column in the dataset.

Stacked histograms can be created using the fill argument of ggplot().Let’s set the fill argument as cond and see how the histogram looks like.

We can see two histograms has been created for the two categories A,B and are differentiated by colors. By default , ggplot creates a stacked histogram as above. Let’s customize this further by creating overlaid and interleaved histogram using the position argument of geom_histogram.

#### Overlaid histogram

Overlaid histograms are created by setting the argument position=”identity”. We have also set the alpha parameter as alpha=.5 for transparency.

#### Interleaved histogram

Interleaved histograms can by created by changing the position argument as position=”dodge”.

#### Using facets

Facets can be created for histogram plots using the facet_grid().Here lets create a facet grid for the histograms created based on the categories A and B of cond by adding facet_grid(cond ~ .)to ggplot

As we can see we have created a facet grid with two histograms for the categories A and B of cond. This can be used in cases where the histograms need to be compared or more than one histogram needs to be plotted in a same graph.

#### Summary

In this article we have discussed how to create histograms using ggplot2 and its various customization options. We first created a basic histogram using qplot() and geom_histogram() of ggplot2.

We then discussed about bin size and how it affects the appearance of a histogram .We then customized the histogram by adding a title, axis labels, ticks, gradient and mean line to a histogram. We also discussed about density curve and created a histogram with normal density curve to see how it fits a normal distribution.

We then moved on to multiple histograms by creating stacked, interleaved and overlaid histograms for the two categories A and B. Finally, we created a faced grid with two histogram plots.

###### Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book