Confusion Matrix Calculator

Calculate accuracy, precision, recall, F1 score, MCC, and other classification metrics from a confusion matrix

About the Confusion Matrix Calculator

The Confusion Matrix Calculator is an essential tool for data scientists, machine learning engineers, and researchers who need to evaluate the performance of classification algorithms. While many systems output raw accuracy, a confusion matrix provides the granular detail necessary to understand where a model is failing. By organizing predictions into a 2x2 grid of true positives, true negatives, false positives, and false negatives, users can see exactly how many times a model confused one class for another. This tool is particularly valuable when working with imbalanced datasets, where standard accuracy figures can be deceptive.

This calculator computes a comprehensive suite of performance metrics beyond simple accuracy, including precision, recall (sensitivity), specificity, and the F1 score. It also provides more robust statistical indicators like the Matthews Correlation Coefficient (MCC) and the False Positive Rate. Whether you are tuning a logistic regression for medical diagnosis or evaluating a neural network for fraud detection, this tool translates raw frequency counts into actionable insights, helping you decide if your model is ready for deployment or requires further optimization of its decision threshold.

Formula

Accuracy = (TP + TN) / (TP + TN + FP + FN) | Precision = TP / (TP + FP) | Recall = TP / (TP + FN) | F1 = 2 * (Precision * Recall) / (Precision + Recall)

The formula uses four primary inputs: True Positives (TP) are cases where the model correctly predicted the positive class; True Negatives (TN) are correct predictions of the negative class; False Positives (FP) occur when the model incorrectly predicts a positive result (Type I error); and False Negatives (FN) occur when the model misses a positive result (Type II error). Advanced metrics like the Matthews Correlation Coefficient (MCC) and Balanced Accuracy use these same four values to provide deeper insights into model performance across imbalanced datasets.

Worked examples

Example 1: A fraud detection model is tested on 1,000 transactions. It correctly identifies 20 fraudulent ones (TP) and 930 legitimate ones (TN). It incorrectly flags 10 legitimate ones as fraud (FP) and misses 30 fraudulent ones (FN).

1. Accuracy = (20 + 930) / 1000 = 0.95\n2. Precision = 20 / (20 + 10) = 0.6667\n3. Recall = 20 / (20 + 30) = 0.40\n4. F1 = 2 * (0.666 * 0.4) / (0.666 + 0.4) = 0.50

Result: Accuracy: 95%, Precision: 66.67%, Recall: 40%, F1 Score: 50%. Even though accuracy is high, the model is actually quite poor at identifying the minority class.

Example 2: A marketing model predicts if 200 customers will subscribe. It accurately predicts 70 subscribers (TP) and 90 non-subscribers (TN). It incorrectly predicts 10 people will subscribe (FP) and misses 30 who actually did (FN).

1. Total = 70 + 90 + 10 + 30 = 200\n2. Accuracy = (70 + 90) / 200 = 0.80\n3. Precision = 70 / (70 + 10) = 0.875\n4. Recall = 70 / (70 + 30) = 0.70\n5. F1 = 2 * (0.875 * 0.7) / (0.875 + 0.7) = 0.7777

Result: Accuracy: 80%, Precision: 87.5%, Recall: 70%, F1 Score: 77.7%. This model shows a balanced performance with a slight bias toward precision.

Common use cases

Evaluating a medical screening test where missing a sick patient (False Negative) is much more dangerous than a false alarm.
Assessing a spam filter where moving a legitimate email to the spam folder (False Positive) must be strictly minimized.
Comparing two different machine learning models to see which one handles class imbalance more effectively.
Determining if a classification threshold needs to be adjusted to prioritize either precision or recall.

Pitfalls and limitations

The calculator assumes a binary classification; for multi-class problems, you must calculate metrics for each class individually using a one-vs-rest approach.
If any of the denominator sums (like TP + FP) result in zero, the corresponding metric will be undefined or division-by-zero.
A very high accuracy can hide a model that has zero predictive power for a rare minority class.

Frequently asked questions

What is a good F1 score for a confusion matrix?

A high F1 score indicates that a model has a good balance between precision and recall. It is especially useful when you have an uneven class distribution, as it penalizes models that favor one metric significantly over the other.

Why is accuracy not enough for model evaluation?

Accuracy is the percentage of total correct predictions (TP + TN) / Total. It is often misleading if your dataset is imbalanced; for example, if 95% of users don't churn, a model that always predicts 'not churn' is 95% accurate but useless for finding churners.

difference between precision and recall in confusion matrix

Precision measures how many of the positive predictions were actually correct, whereas Recall measures how many of the actual positive cases the model managed to find. High precision avoids 'false alarms,' while high recall avoids 'missed opportunities'.

What does MCC tell you about a classifier?

The Matthew's Correlation Coefficient (MCC) is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories. It ranges from -1 to +1, where +1 is a perfect prediction.

is sensitivity the same thing as recall

Recall and Sensitivity are identical metrics. They both calculate the ratio of accurately predicted positive observations to all observations in the actual class.