Performance Evaluation

Given a classifier trained on a specific training set (Training Set), it is necessary to evaluate it on another set (Validation Set or Certification Set). From this comparison, it is possible to extract metrics that allow for the assessment of the classifier and enable the comparison of different classifiers with one another. It is absolutely essential that the performance metrics are calculated on a set of samples not used during the training phase (the validation set) in order to detect issues such as data overfitting or lack of generalization.

Once the parameters of the classifier are set, the contingency table (Confusion Matrix) can be created:

True Value

p n

Classification p' VP FP

n' FN

It seems that you've provided a LaTeX command that indicates the end of an HTML-only section. If you have specific text or content that you would like translated from Italian to English, please provide that text, and I will be happy to assist you with the translation while maintaining the LaTeX commands and structure.

False Positives (FP) are also referred to as False Alarms. False Negatives (FN) are known as misses.

From the table, several performance metrics are typically extracted, such as:

Accuracy is the ratio of the Number of Correct Predictions to the Total Number of Predictions = (VP+VN)/(VP+VN+FN+FP);
Error Rate is the Total Number of Incorrect Predictions divided by the Total Number of Predictions = (FP+FN) / (VP+VN+FN+FP);
Precision (or specificity or PPV) is the probability that a positive returned by the classifier is correct = VP / (VP+FP);
Recall (or hit-rate or TPR) is the percentage of positives correctly identified = VP / (VP+FN);
Miss-Rate or FNR is the complement of Recall = 1-Recall = FN/(VP+FN).

Each classifier has one or more parameters that, when modified, change the ratio between correct recognitions and the number of false positives. Consequently, it becomes challenging to objectively compare two classifiers, as one may exhibit a higher number of correct detections at the same threshold, but also a higher number of false positives. Therefore, to compare the performance of different binary classifiers obtained from various training sessions, it is common to use curves that vary with this internal threshold of the classifier.

The performance curves that can be found are:

The ROC curve (Receiver Operating Characteristic) is a Cartesian graph where the abscissa indicates the number of false positives (percentage FPR, per frame or absolute) and the ordinate displays the percentage of correct recognitions (True Positive Rate TPR), generated by the classifier as the threshold varies. Every classifier must have a ROC curve that outperforms the random classifier, represented by the line connecting points and on the ROC graph.
Precision-Recall (PRC) focuses the analysis primarily on the positives. The area under the PRC is referred to as Average Precision (AP). In real-world detection problems, the number of True Negatives is extremely high, making it essential to concentrate on false positives. The PRC has the significant advantage of obscuring the scale of False Positives: typically, these are indicated per frame, per minute, or in another unit of measurement.
Detection Error Tradeoff (DET) allows for the representation of negative errors (misses) and positive errors (false alarms) on the axes. It is a curve that focuses the analysis purely on errors.

It is important to note that these indices pertain to any class of problems that involves the concept of correct or incorrect results. Therefore, they are applicable not only to classifiers but also, for example, to associations of characteristic points and more.

Recently, to enable a more streamlined comparison of classifier performance, functions have been proposed that, when applied to ROC curves, yield a single scalar representing a score of classification quality. These functions are typically averages of samples taken from the ROC curve in the regions of practical interest.

Paolo medici
2025-10-22