Evaluating classification models

5 min readJun 6, 2021

Classification questions are the heartbeat of data science for business applications. Will this customer purchase my product? Is this transaction fraudulent? Should we interview this job candidate?

Context is key to determining the best way to evaluate a classifier’s performance. Before relying on accuracy score to evaluate all your models, read on!

What’s a classifier?

Classification questions ask what group a given element is a part of. Binary classifications have two possible options: yes or no, true or false, 1 or 0. Multi-class classifications have more than two possible options: Is this customer a Baby Boomer, Gen X, Millennial, or Gen Z? Is patient healthy, prediabetic, or diabetic?

Source: https://medium.com/analytics-vidhya/ml06-intro-to-multi-class-classification-e61eb7492ffd

In data science, classification modeling is a supervised machine learning activity. The data scientist gives the computer a labeled dataset, and the computer builds a formula based on the patterns it identifies. When given a new unlabeled element, the computer runs it through the formula to predict which class the element is a part of. This formula is called a classifier, or classification model.

For example, if a data scientist wanted to identify fraudulent transactions, they would train an algorithm by giving it a bunch of historical transactions labeled real or fraudulent. The computer would then build a formula that accounts for patterns in the labeled data it was trained on. When presented with a new transaction, the computer would use the formula to predict whether the transaction is fraudulent or not. The formula would weight certain features — like time and amount — more important than others to maximize accuracy of the model’s predictions.

How do I build a classifier?

Scikit-learn is the industry pick for machine learning in Python. Sklearn has multiple classes for classification modeling, each with different underlying strategies.

A few common classification methods include:

K-Nearest Neighbors — predicts the class of an observation by evaluating data points mathematically close to the one in question. “k” number of neighbors are found, then the average of their classes determines the model’s vote for the element in question.

Nearest neighbors classifier. Source: https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/

Decision Tree — creates branches based on features that best split the sample space in two, then continues branching and splitting until the dataset has been cleanly partitioned (or until max_depth or min_samples_split parameters are reached, if set).

Decision tree classifier. Source: https://towardsdatascience.com/an-introduction-to-decision-trees-with-python-and-scikit-learn-1a5ba6fc204f

Random Forest — creates multiple decision trees on subsets of the training data, then use the prediction made by the most trees in the forest to classify new elements. Random forests use the laws of probability and central limit theorem to improve their accuracy.

Random forest with 50 trees. Source: https://towardsdatascience.com/an-introduction-to-random-forest-with-python-and-scikit-learn-acf44e514034

Support Vector Machine — maximize the decision boundary between points to balance underfitting and overfitting. The slack parameter determines the balance between prioritizing accuracy or maximum margin.

Support Vector Machine with slack parameter. Source: https://learnopencv.com/svm-using-scikit-learn-in-python/

Best practice for supervised learning is to split your total sample set into two groups.

80% of the sample becomes the “training set” (X_train /y_train) which is given to methods like the ones above to identify patterns and build the formula.
The remaining 20% becomes the “testing set” (X_test / y_test)which is used to evaluate how the model does when presented with new data elements.

Is my classifier any good?

Sklearn also provides a handful of metrics to evaluate the model’s performance. We’ll consider a binary classification example to explain some of these metrics.

Confusion Matrix

The Confusion Matrix is an easily interpretable visual to indicate how a classifier is performing. When given the testing set, the confusion grid visualizes the Predictions on the x-axis and the actual classes on the y-axis. Along the diagonals of the grid, you can see counts or percentages of:

True Negatives (TN)— the actual class was 0, and the model correctly predicted this
True Positives (TP) — the actual class was 1, and the model correctly predicted this
False Positives (FP) — the actual class was 0, but the model incorrectly predicted the element was part of the 1 class
False Negatives (FN) — the actual class was 1, but the model incorrectly predicted the element was part of the 0 class.

Using skikit-learn, you can pass the model, X_test, and y_test into the plot_confusion_matrix() function to visualize this matrix.

from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(model, X_test, y_test, *cmap='Blues')

*I recommend updating the color using the optional cmap parameter as shown above.

Accuracy Score

Accuracy score explains how good the model is at making correct predictions.

Accuracy = (TP + TN) / (TN + FP+ FN+ TP)

Using scikit-learn, you can pass y_test and y_preds = model.predict(X_test) to the accuracy_score() function.

from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

Precision and Recall. Source: https://en.wikipedia.org/wiki/Precision_and_recall

Precision Score

Precision score explains how good a model is at making True Positive predictions, and punishes for making False Positive predictions.

Precision = TP / (TP + FP)

Using scikit-learn, you can pass y_test and y_preds = model.predict(X_test) to the precision_score() function.

from sklearn.metrics import precision_score
precision_score(y_true, y_pred)

Recall Score

Recall score explains how good a model is at correctly predicting Positives, and punishes for making False Negative predictions.

Recall = TP / (TP + FN)

Using scikit-learn, you can pass y_test and y_preds = model.predict(X_test) to the recall_score() function.

from sklearn.metrics import recall_score
recall_score(y_true, y_pred)

Context is key

Before finalizing a model, consider industry context and how your model will be applied. These factors may change how you evaluate your model’s performance.

Source: https://www.reddit.com/user/ugnes_404/

While accuracy is always a good, consistent metric for understanding holistic model performance, depending on the purpose and context of your model, you may care a lot more about Precision or Recall.

For example, I recently created a diabetes diagnosis classifier so that a stakeholder could target investment and outreach for a health & wellness program.

In this scenario, I cared way more about recall than precision. The consequences of a False Negative prediction meant the difference in a person’s quality of life, and potentially even life or death. Conversely, the consequences of a False Positive were just a possible misallocation of funds.

Check out my diabetes diagnosis model on github to see a classifier in action!