Ultimate guide to create Scatterplots with Seaborn

If you’re looking for a smart way to plot well informative charts, like scatter plots, then you are surely going to love Seaborn.

What is Seaborn?

Seaborn is a data visualization library of python. Using seaborn we can draw attractive and informative graphics for statistical analysis. Seaborn is based on matplotlib, another data visualization library of python.

You may like to ask if we already have matplotlib, why we need additional library seaborn for data visualization. The answer is seaborn works with data frames relatively easier than matplotlib as it is closely integrated with pandas. Also Seaborn has several high level interfaces and customizable themes which are not present in matplotlib.

What is Scatterplot?

Scatter plot is a graph that indicates how much one variable is affected by presence of another. Scatter plots are draw with two variables as input.

The relationship between two variables is called correlation between the variable in statistics.

We often need to visualize the correlation between two quantitative variables and for this purpose we use scatter plots.

Seaborn provides a number ways to create scatter plots that provides data insights.

How to install Seaborn?

To install seaborn we can either use pip or conda install as below:

pip install seaborn

or

conda install seaborn

However there are below prerequisite for the seaborn installation:

Numpy version >= 1.9.3

Scipy version >= 0.14.0

Matplotlib version >= 1.4.3

Pandas version >= 0.15.2

After installation we can import seaborn as below:

Import seaborn as sns where sns is an alias for seaborn

Steps to create scatterplots with Seaborn

The basic steps to creating scatter plots with Seaborn are as below:

1. Import libraries:

To create a scatterplot we need to import essential libraries as below. These libraries are used to load in the data which in this case is the famous tips dataset.

import seaborn as sns

2. Get the data

The seaborn library offers built-in data sets. One of that is tips dataset. We can load that dataset under tips_data variable using load dataset function. The code is as below.

tips_data = sns.load_dataset (“tips”)

3. Plot the basic graph

We can draw the basic scatterplot graph between data in two columns called tip and total bill using the seaborn function called scatter plot. The scatterplot function of seaborn takes minimum three argument as shown in the below code namely x y and data

sns.scatterplot(x=’tip’, y=’total_bill’, data=tips_data)

4. Add the marker:

In the image above we are using default marker. The circle used to represent the data points is called. Hence the default marker here is blue circles. The first customization that we are going to try is to change the marker to ‘D’. We can do so by adding marker=’D’ as a parameter to the scatter plot function. This can be demonstrated by below code.

sns.scatterplot(x=’tip’, y=’total_bill’, data=tips_data, marker=’D’)

5. Add the hue

Next thing that we can add to scatter plots is hue parameter. Using the hue parameter as third column in the data (like time in this case) we can generate the scatter plot with third variable time as well. In this case we can see how the tip and total bill are related to the whether it was lunch time or dinner time. Please note the legend outside of time which says blue data points are for lunch and orange colour is for dinner.

sns.scatterplot(x=’tip’, y=’total_bill’, data=tips_data, marker=’D’, hue=’time’)

6. Add the colour palette:

Next thing that we can add to scatter plots is palette parameter. Using the palette we can generate the scatter plot with different colours as background. In this case we can see how the palette value is set to deep. Palette supports the colours to be used for the different levels of the hue variable.

sns.scatterplot(x=’tip’, y=’total_bill’, data=tips_data, marker=’D’, hue=’time’, palette=’deep’)

7. Add the size parameter:

sns.scatterplot(x=’tip’, y=’total_bill’, data=tips_data, marker=’D’, hue=’time’, size=”size”)

We can also add size parameter to scatter plots as shown in with line of code above. Using the size parameter we can generate the scatter grouping variable that will produce points with different sizes. Can be either categorical or numeric.

8. Add the style parameter:

sns.scatterplot(x=’tip’,y=’total_bill’,data=tips_data, marker=’D’, hue=’time’, style=”day”)

We can also add style parameter to scatter plots as shown in the line of code above. Using the style parameter we can generate the scatter grouping variable that will produce points with different markers. The style can have a numeric type but will always be considered as categorical.

9. Add the legend parameter:

sns.scatterplot(x=’tip’, y=’total_bill’, data=tips_data, marker=’D’, hue=’time’, style=”day”, legend=False)

We can also annotate legend parameter to scatter plots as shown in the line of code above. Using the legend parameter we can turn on (legend=full) and we can also turn off the legend using (legend = False).

10. Add the opacity parameter:

sns.scatterplot(x=’tip’, y=’total_bill’, data=tips_data, alpha=0.3)

We can also add opacity parameter named as alpha to scatter plots as shown in the line of code above. Using the alpha we can control the opacity of data points. The float value of alpha is proportional to opacity of the points.

How to draw Scatterplot using regplot function:

Scatterplot function of seaborn is not the only method to draw scatterplot using seaborn. We can create scatter plots using seaborn regplot method as well. However as regplot is based on regression by default it will introduce a regression line in the data as shown in the medium figure size below.

sns.regplot(x=’tip’, y=’total_bill’, data=tips_data)

1.Adding fit_reg parameter:

Though the regplot function of seaborn adds a line to the data points by default we can remove that line from the plot using fit_reg parameter. We just need to set this parameter as false as shown below.

sns.regplot(x=’tip’, y=’total_bill’, data=tips_data, fit_reg=False)

2.Adding color parameter:

Another important parameter that can be used with regplot function is color. You can change the color of data points using color parameter value like g for green and r for red. We have a matrix of color valueshere. This is shown in the code and image below.

sns.regplot(x=’tip’, y=’total_bill’, data=tips_data, color=’g’)

sns.regplot(x=’tip’, y=’total_bill’, data=tips_data, color=’r’)

3.Adding marker parameter:

To our pleasant surprise marker parameter works as it is in regplot function as well.

In the below code we are using + as marker which transforms each data point to be represented as + instead of circular bubble.

sns.regplot(x=’tip’, y=’total_bill’, data=tips_data, color=’r’, marker=”+”)

4.Adding CI parameter:

CI stands for confidence interval. In statistics, confidence interval is an interval estimate, computed by the statistics of the observed data. The confidence interval has an associated confidence level that quantifies the level of confidence that the deterministic parameter is captured by the interval. The below code and graph shows how to add CI parameter to the lmplot function.

sns.regplot(x=’tip’, y=’total_bill’, data=tips_data, color=’g’, ci = 90)

5.Adding bins parameter:

Plot with a continuous variable divided into discrete bins.

Bins parameter can be used to divide the graph into discrete interval. For this we can use the parameter x_bins and pass in an integer value like 5 to it.

sns.regplot(x=’tip’, y=’total_bill’, data=tips_data, x_bins=5)

6.Adding means parameter:

We can plot with seaborn as a discrete x variable showing means and confidence intervals for unique values. Means parameter can be used to divide the graph into discrete interval. For this we can use the parameter x_estimator and pass in its value as np.mean. The below code and graph shows how to add means parameter to the lmplot function.

sns.regplot(x=’tip’, y=’total_bill’, data=tips_data, x_estimator=np.mean)

Scatter plot using lmplot function:

Scatterplot and regplot are not only function that can be used to draw scatterplot with the help of seaborn. We have another function called lmplot provided by seaborn to draw scatterplot using seaborn. However as lmplot is based on regression by default it will introduce a regression with line in the data as shown in the figure below.

sns.lmplot(x=’tip’, y=’total_bill’, data=tips_data)

1.Adding fit_reg parameter to lmplot:

Though the lmplot function of seaborn adds a line to the data points by default we can remove that line from the plot using fit_reg parameter. We just need to set this parameter as false as shown below. The below code and graph shows how to add fit_reg parameter to the lmplot function.

sns.lmplot(x=’tip’, y=’total_bill’, data=tips_data, fit_reg=False)

2.Adding means parameter:

Using lmplot we can plot with seaborn as a discrete x variable showing means and confidence intervals for unique values. Means parameter can be used to divide the graph into discrete interval. For this we can use the parameter x_estimator and pass in its value as np.mean. The below code and graph shows how to add x_estimator parameter to the lmplot function.

sns.lmplot(x=’tip’, y=’total_bill’, data=tips_data, x_estimator=np.mean)

3.Adding bins parameter:

Using lmplot we can plot with a continuous variable divided into discrete bins.

Bins parameter can be used to divide the graph into discrete interval. For this we can use the parameter x_bins and pass in an integer value like 5 to it. The below code and graph shows how to add bins parameter to the lmplot function.

sns.lmplot(x=’tip’, y=’total_bill’, data=tips_data, x_bins=5)

Conclusion:

Thus we saw how seaborn can be used to draw seamless graphs using datasets for exploratory data analysis. The reason seaborn is so popular is because labels from a data frame are automatically propagated to graphs as you saw in that column name tips comes on x-axis and column name total bill comes on y-axis. This feature is not available in matplotlib.

Hence we can say that seaborn is one of the best libraries for exploratory data analyses for data scientist till date.

Recent Posts

Menu