Table of Contents

## What is a Confusion Matrix?

A confusion matrix, also known as error matrix is a table layout that is used to visualize the performance of a classification model where the true values are already known.

A typical confusion matrix looks as below:

As seen above a confusion matrix has two dimensions namely Actual class and Predicted class. Each row of the matrix represents the number of instances in a predicted class while each column represents the number of instances in an actual class (or vice versa).

## Confusion matrix for Bank Marketing dataset

In this section I will explain the confusion matrix of the Bank Marketing dataset.

The dataset can be downloaded from the below link https://archive.ics.uci.edu/ml/machine-learning-databases/00222/.

Here I have used the bank.csv with 10% of the examples (4521 rows) and 17 inputs.

The goal of the Bank Marketing dataset is to predict if the client will subscribe to the term deposit.

The train data contains the following categorical variables such as

- job – job type of the clients,
- marital – marital status of the clients,
- education – education level,
- default – if the credit is in default(yes/no),
- housing – if there is a housing loan(yes/no),
- loan – if the customer currently has a personal loan(yes/no),
- contact – type of contact,
- poutcome – result of the previous marketing campaign contact, and
- y – if the client actually subscribed to the term deposit(yes/no).

Here Attributes (1) through (8) are input variables, and (9) is considered the outcome.

The outcome “y” is either yes (meaning the customer will subscribe to the term deposit) or no (meaning the customer won’t subscribe). For example,the confusion matrix of a Naive Bayes classifier on 100 clients to predict whether they would subscribe to the term deposit looks as below.

From the above table it can be seen that of the 11 clients who actually subscribed to the term deposit, the model predicted 3 subscribed and 8 not subscribed. Similarly, of the 89 clients who did not subscribe to the term, the model predicted 2 subscribed and 87 not subscribed.

Now let’s see the basic terminologies from a confusion matrix which could be used to analyze our classifier results.

**Terminologies from a confusion matrix**

True positives (TP) are the number of positive instances the classifier correctly identified as positive. For the above bank marketing case this is the number of correct** **classifications of the “subscribed” class or potential clients that are willing to subscribe a term deposit which is 3 in this case.

False positives (FP) are the number of instances in which the classifier identified as positive but in reality, are negative. For the above case this is the number of incorrect classifications of the “subscribed” class or potential clients that are not willing to subscribe to a term deposit but the model has predicted as belonging to the “subscribed” class.

In the above case 2 customers have been predicted as “non-subscribed” but in reality, they belong to “subscribed” class.

True negatives (TN) are the number of negative instances the classifier correctly identified as negative. For the above case,this is the number of correct classifications of the “Not Subscribed” class or potential clients that are not willing to subscribe to a term deposit which is 87 in this case.

False negatives (FN) are, the number of instances classified as negative but in reality, are positive.For the above case, this is the number of incorrect classifications of the “Not Subscribed” class or potential clients that are not willing to subscribe to a term deposit.

In the above case, 8 customers have been predicted as not willing to subscribe to the term deposit but in reality, they belong to “subscribed” class.

TP and TN are the correct guesses. A good classifier should have large TP and TN and small (ideally zero) numbers for FP and FN.

Accuracy: It is the percentage of number of correctly classified instances among all other instances. It is defined as the sum of TP and TN divided by the total number of instances.

1 2 3 4 |
TP+TN * 100 Accuracy = ------------- TP+TN+FP+FN |

A good model should have a high accuracy score, but having a high accuracy score alone does not guarantee the model is well established.

Let’s see the other measures that can be used to better evaluate the performance of a classifier.

True positive rate (TPR) also called as recall shows the percentage of positive instances correctly classified as positive.

1 2 3 4 |
TP * 100 TPR = --------- TP + FN |

In the above case TPR is the percentage of customers correctly predicted as “subscribed”.

False positive rate (FPR) shows the percentage of negative instances incorrectly classified as positive. The FPR is also called the false alarm rate or the type I error rate.

1 2 3 |
FP * 100 FPR = ---------- FP + TN |

In the above case FPR is the percentage of customers of the “non subscribed” class who has been incorrectly classified as “subscribed”.

The false negative rate (FNR) shows what percent of positives the classifier marked as negatives. It is also known as the miss rate or type II error rate. Note that the sum of TPR and FNR is 1.

1 2 3 |
FN * 100 FNR = ------------ TP + FN |

In the above case FNR is the percentage of customers who are willing to subscribe to the term deposit but the model has predicted as belonging to the “not-subscribed” class.

A well-performed model should have a high TPR that is ideally 1 and a low FPR and FNR that are ideally 0.

Precision is the percentage of correctly classified positive instances among all other positive instances.

1 2 3 |
TP * 100 Precision = ----------- TP + FP |

In the above case , Precision is the percentage of correctly classified customers of “subscribed” class among all the other customers classified as “subscribed”.

Use My Data Science Course To Take Your R Skills To The Next Level

## Confusion Matrix in R

Now let’s see how to create a confusion matrix in R and analyze the performance of a classifier.

Here we will be using the Bank Marketing dataset discussed above.

First importing the libraries and the dataset

1 2 3 4 |
#importing libraries library(randomForest) #importing the bank marketing dataset bank <- read.csv("Bank.csv", sep=";") |

The bank marketing dataset looks as below

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
> head(bank) age job marital education default balance housing loan contact day month 1 30 unemployed married primary no 1787 no no cellular 19 oct 2 33 services married secondary no 4789 yes yes cellular 11 may 3 35 management single tertiary no 1350 yes no cellular 16 apr 4 30 management married tertiary no 1476 yes yes unknown 3 jun 5 59 blue-collar married secondary no 0 yes no unknown 5 may 6 35 management single tertiary no 747 no no cellular 23 feb duration campaign pdays previous poutcome y 1 79 1 -1 0 unknown no 2 220 1 339 4 failure no 3 185 1 330 1 failure no 4 199 4 -1 0 unknown no 5 226 1 -1 0 unknown no 6 141 2 176 3 failure no |

Now let’s explore the structure of the dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
str(bank) ## 'data.frame': 4521 obs. of 17 variables: ## $ age : int 30 33 35 30 59 35 36 39 41 43 ... ## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ... ## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ... ## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ... ## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ... ## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ... ## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ... ## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ... ## $ day : int 19 11 16 3 5 23 14 6 14 17 ... ## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ... ## $ duration : int 79 220 185 199 226 141 341 151 57 313 ... ## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ... ## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ... ## $ previous : int 0 4 1 0 0 3 2 0 0 2 ... ## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ... ## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... |

Now let’s create a new dataframe bank_new with only Age, Job, Marital, Education, Housing, Loan, contact, poutcome and y.

Converting all these variables except the outcome variable(y) to numeric and storing it in the new dataframe bank_new.

1 2 3 4 5 6 7 8 9 |
bank_new <- data.frame(as.numeric(as.factor(bank$age)), as.numeric(as.factor(bank$job)), as.numeric(as.factor(bank$marital)), as.numeric(as.factor(bank$education)), as.numeric(as.factor(bank$housing)), as.numeric(as.factor(bank$loan)), as.numeric(as.factor(bank$contact)), as.numeric(as.factor(bank$poutcome)), bank$y) |

Renaming the columns as below

1 2 |
colnames(bank_new) <- c("Age", "Job", "Marital", "Education", "Housing", "Loan","contact","poutcome" ,"y") |

Now let’s split the dataset into train and test data with 80% train data and the remaining 20% as test.

1 2 3 4 |
set.seed(2262) train_ind <- sample(seq_len(nrow(bank_new)), size = floor(0.80 * nrow(bank_new))) train <- bank_new[train_ind, ] test <- bank_new[-train_ind, ] |

Now let’s use the Random Forest algorithm to predict the customer classes y.

1 2 |
#random forest classifier
fitRF <- randomForest(y~., train) |

Random Forest output:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
> fitRF Call: randomForest(formula = y ~ ., data = train) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 11.31% Confusion matrix: no yes class.error no 3169 25 0.007827176 yes 384 38 0.909952607 |

The Random Forest output has by default displayed the confusion matrix for the above predicted values of the train data. We will discuss more about the results of confusion matrix below.

Now let’s use the above model to predict the classes for test data.

1 2 |
#predicting the classes and storing it in predicted
predicted <- predict(fitRF, test$y) |

Now predicted has the predicted results and test$y has the actual results. Lets use them to create a confusion matrix using the confusionMatrix() function of the “caret” library.

1 2 3 |
#confusion matrix
library(caret) confusionMatrix(predicted, test$y) |

The results of the above confusion matrix looks as below

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
> confusionMatrix(predicted, test$y) Confusion Matrix and Statistics Reference Prediction no yes no 797 82 yes 9 17 Accuracy : 0.8994 95% CI : (0.878, 0.9183) No Information Rate : 0.8906 P-Value [Acc > NIR] : 0.2137 Kappa : 0.2373 Mcnemar's Test P-Value : 4.432e-14 Sensitivity : 0.9888 Specificity : 0.1717 Pos Pred Value : 0.9067 Neg Pred Value : 0.6538 Prevalence : 0.8906 Detection Rate : 0.8807 Detection Prevalence : 0.9713 Balanced Accuracy : 0.5803 'Positive' Class : no |

As we can see the accuracy of the above predicted results seems to be Accuracy : 0.8994.ie, 89.8% of the above predicted results seems to be correctly classified.

Now lets see how to create another confusion matrix using the table() function.

1 |
confusion_table <- table(predicted, test$y) |

The results of the above table look as below

1 2 3 4 5 |
> confusion_table predicted no yes no 797 82 yes 9 17 |

Now let’s use the confusion table to calculate accuracy, precision and recall.

1 2 3 4 |
n <- sum(confusion_table) # number of instances
diag <- diag(confusion_table) accuracy <- sum(diag) / n # Calculate the Accuracy |

1 2 |
> accuracy [1] 0.8994475 |

So, as mentioned 89.9% of the instances have been correctly classified using the above Random Forest Model.

1 2 3 4 5 |
TP = confusion_table[2,2]
FP = confusion_table[1,2]
FN = confusion_table[2,1]
precision <- TP/(TP+FP) # Calculate the Precision |

1 2 |
> precision [1] 0.6538462 |

The above precision means that only 65.38% of the customers belong to the actual “subscribed” class among all the customers predicted to be “subscribed”.Now let see the recall results.

1 |
recall <- TP/(TP+FN) # Calculate the Recall |

1 2 |
> recall [1] 0.1717172 |

The above recall means that only 17% of the “subscribed” customers have been correctly classified as “subscribed”.

From the above results, although the overall accuracy of the above model seems to be good, the precision and recall results shows that the model still needs some improvement in predicting the positive instances better.

### Conclusion:

In this article we discussed about confusion matrix and its various terminologies. We also discussed how to create a confusion matrix in R using confusionMatrix() and table() functions and analyzed the results using accuracy, recall and precision.

Hope this article helped you get a good understanding about Confusion Matrix. Do let me know your feedback about this article below.