R
No Comments

Confusion Matrix in R

Confusion Matrix in R

What is a Confusion Matrix?

A confusion matrix, also known as error matrix is a table layout that is used to visualize the performance of a classification model where the true values are already known.

A typical confusion matrix looks as below:

confusion matrix

As seen above a confusion matrix has two dimensions namely Actual class and Predicted class. Each row of the matrix represents the number of instances in a predicted class while each column represents the number of instances in an actual class (or vice versa).

Confusion matrix for Bank Marketing dataset

In this section I will explain the confusion matrix of the Bank Marketing dataset. 

The dataset can be downloaded from the below link  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/.

Here I have used the bank.csv with 10% of the examples (4521 rows) and 17 inputs.

The goal of the Bank Marketing dataset is to predict if the client will subscribe to the term deposit.

The train data contains the following categorical variables such as

  1. job – job type of the clients,
  2. marital – marital status of the clients,
  3. education – education level,
  4. default – if the credit is in default(yes/no),
  5. housing – if there is a housing loan(yes/no),
  6. loan – if the customer currently has a personal loan(yes/no),
  7. contact – type of contact,
  8. poutcome –  result of the previous marketing campaign contact, and
  9. y – if the client actually subscribed to the term deposit(yes/no).

Here Attributes (1) through (8) are input variables, and (9) is considered the outcome.

The outcome “y” is either yes (meaning the customer will subscribe to the term deposit) or no (meaning the customer won’t subscribe). For example,the confusion matrix of a Naive Bayes classifier on 100 clients to predict whether they would subscribe to the term deposit looks as below.

From the above table it can be seen that of the 11 clients who actually subscribed to the term deposit, the model predicted 3 subscribed and 8 not subscribed. Similarly, of the 89 clients who did not subscribe to the term, the model predicted 2 subscribed and 87 not subscribed.

Now let’s see the basic terminologies from a confusion matrix which could be used to analyze our classifier results.

Terminologies from a confusion matrix

True positives (TP) are the number of positive instances the classifier correctly identified as positive. For the above bank marketing case this is the number of correct classifications of the “subscribed” class or potential clients that are willing to subscribe a term deposit which is 3 in this case.

False positives (FP) are the number of instances in which the classifier identified as positive but in reality, are negative. For the above case this is the number of incorrect classifications of the “subscribed” class or potential clients that are not willing to subscribe to a term deposit but the model has predicted as belonging to the “subscribed” class.

In the above case 2 customers have been predicted as “non-subscribed” but in reality, they belong to “subscribed” class.

True negatives (TN) are the number of negative instances the classifier correctly identified as negative. For the above case,this is the number of correct classifications of the “Not Subscribed” class or potential clients that are not willing to subscribe to a term deposit which is 87 in this case.

False negatives (FN) are, the number of instances classified as negative but in reality, are positive.For the above case, this is the number of incorrect classifications of the “Not Subscribed” class or potential clients that are not willing to subscribe to a term deposit.

In the above case, 8 customers have been predicted as not willing to subscribe to the term deposit but in reality, they belong to “subscribed” class.

TP and TN are the correct guesses. A good classifier should have large TP and TN and small (ideally zero) numbers for FP and FN.

Accuracy: It is the percentage of number of correctly classified instances among all other instances. It is defined as the sum of TP and TN divided by the total number of instances.

A good model should have a high accuracy score, but having a high accuracy score alone does not guarantee the model is well established.

Let’s see the other measures that can be used to better evaluate the performance of a classifier.

True positive rate (TPR) also called as recall shows the percentage of positive instances correctly classified as positive.

In the above case TPR is the percentage of customers correctly predicted as “subscribed”.

False positive rate (FPR) shows the percentage of negative instances incorrectly classified as positive. The FPR is also called the false alarm rate or the type I error rate.

In the above case FPR is the percentage of customers of the “non subscribed” class who has been incorrectly classified as “subscribed”.

The false negative rate (FNR) shows what percent of positives the classifier marked as negatives. It is also known as the miss rate or type II error rate. Note that the sum of TPR and FNR is 1.

 

In the above case FNR is the percentage of customers who are willing to subscribe to the term deposit but the model has predicted as belonging to the “not-subscribed” class.

A well-performed model should have a high TPR that is ideally 1 and a low FPR and FNR that are ideally 0.

Precision is the percentage of correctly classified positive instances among all other positive instances.

In the above case , Precision is the percentage of correctly classified customers of “subscribed” class among all the other customers classified as “subscribed”.


Use My Data Science Course To Take Your R Skills To The Next Level


Confusion Matrix in R

Now let’s see how to create a confusion matrix in R and analyze the performance of a classifier.

Here we will be using the Bank Marketing dataset discussed above.

First importing the libraries and the dataset

The bank marketing dataset looks as below


Now let’s explore the structure of the dataset.

Now let’s create a new dataframe bank_new with only Age, Job, Marital, Education, Housing, Loan, contact, poutcome and y.

Converting all these variables except the outcome variable(y) to numeric and storing it in the new dataframe bank_new.

Renaming the columns as below

Now let’s split the dataset into train and test data with 80% train data and the remaining 20% as test.

Now let’s use the Random Forest algorithm to predict the customer classes y.

Random Forest output:

The Random Forest output has by default displayed the confusion matrix for the above predicted values of the train data. We will discuss more about the results of confusion matrix below.

Now let’s use the above model to predict the classes for test data.

Now predicted has the predicted results and test$y has the actual results. Lets use them to create a confusion matrix using the confusionMatrix() function of the “caret” library.

The results of the above confusion matrix looks as below

As we can see the accuracy of the above predicted results seems to be Accuracy : 0.8994.ie, 89.8% of the above predicted results seems to be correctly classified.

Now lets see how to create another confusion matrix using the table() function.

The results of the above table look as below

Now let’s use the confusion table to calculate accuracy, precision and recall.

So, as mentioned 89.9% of the instances have been correctly classified using the above Random Forest Model.

The above precision means that only 65.38% of the customers belong to the actual “subscribed” class among all the customers predicted to be “subscribed”.Now let see the recall results.

The above recall means that only 17% of the “subscribed” customers have been correctly classified as “subscribed”.

From the above results, although the overall accuracy of the above model seems to be good, the precision and recall results shows that the model still needs some improvement in predicting the positive instances better.

Conclusion:

In this article we discussed about confusion matrix and its various terminologies. We also discussed how to create a confusion matrix in R using confusionMatrix() and table() functions and analyzed the results using accuracy, recall and precision.

Hope this article helped you get a good understanding about Confusion Matrix.  Do let me know your feedback about this article below.

 

 

Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book

data-science-hand-book


You must be logged in to post a comment.
Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book


data-science-hand-book

Arm yourself with the most practical data science knowledge available today.

KEEP LEARNING

Menu