Not every classification error is the same

In this article, I would like to talk about a common mistake new people approaching Machine Learning and classification algorithm often do. In particular, when we evaluate (and thus train) a classification algorithm, people tend to consider every misclassification equally important and equally bad. We are so deep into our mathematical version of the world that we forget about the consequences of classification errors in the real world.

But let’s start from the beginning. Imagine a simple binary classifier. It takes some input x and return a Boolean value telling use if x belongs to a certain class C or not. When we pass through the algorithm a number of elements, we can identify only 4 possible outcomes:

  • T_PTrue Positive – The algorithm correctly classifies x as a member of C.
  • T_NTrue Negative – The algorithm correctly classifies x as a not a member of C.
  • F_PFalse Positive – The algorithm mistakenly classifies x as a member of C. That is, it says that x belongs to C while this is not true.
  • F_NFalse Negative – The algorithm mistakenly classifies x as a not a member of C. That is, it says that x does not belong to C while this is actually true.

T_P and T_N are the “correct” answers. They represent when the algorithm assigns the element to the right class. Intuitively, we want these two outcomes to represent the majority of our classification outcomes.

On the other hand F_P and $F_N$ are the “wrong answers”  (also called Type I and Type II errors). With the same reasoning, we want the number of this outcomes as close to zero as possible.

To represent our first naïve evaluation of the algorithm, we can produce a “score” function encoding the above reasoning.

F = \frac{2 T_P}{2T_P + F_N + F_P}

This score function is 1 when both F_N and F_P are zero, signaling that our classification is perfect (at least in our testing set). It also go to zero when F_N + F_P represents the full set of outcomes, signaling that our classification algorithm classifies exactly everything wrong.

Note1: a binary classification algorithm that does everything wrong is actually an amazing classifier! You just have to consider the opposite of what it says and you get a perfect classifier!

Note2: Note that, by definition, T_P + T_N + F_P +F_N must be equal to the full set of outcomes. That’s also the reason we can ignore T_N.

Now, you may be satisfied with the result. This seems a good score. But, in reality, it is an awful score for many applications. Not only, by using it for evaluation (and thus tuning and training) will produce an algorithm that it is potentially dangerous.

I’ll let you think “why” for a couple of minutes.

Look at the errors. The above score function considers F_P and F_N equally important. It doesn’t matter if an algorithm commits only false positives or only false negatives. For this score, they are performing the same. This is often bad. To convince yourself, here a couple of examples:

  1. Imagine you are writing a machine learning powered algorithm for bridges and infrastructure maintenance.  The goal is to optimize the government resources and increasing the safety of people by identifying “collapsing” bridges in advance, so that you can send a specialized team to the bridge.
    To do that, the algorithm takes as input a bunch of sensor’s inputs (such as vibration frequency, camera images of connections and pillars, and so on) and produce a binary answer to the question: is the bridge collapsing or not?
    Consider now the effect of false positives or false negatives. To have a false positive means that the algorithm classifies as “collapsing” a bridge that it is actually fine. For this reason, you send a squad to the location and the squad discovers that the bridge is fine. At worst, you have wasted a day of work for your team.
    Consider now a false negative. This means that the algorithm classifies a “collapsing” bridge as “safe”. In this case, you do nothing. But the bridge is collapsing so, some weeks later it actually collapses. The result are a lot of damage and probably deaths.
  2. Imagine now you are writing a medical application to diagnose a deadly infectious disease (I don’t know, Ebola or some other plague). The algorithm takes as input a bunch of medical exams and returns the binary answer to the question “is this person infected?”.
    Consider again the effect of false positives and false negatives. A false positive means that the algorithm classifies as “infected” a sane guy. You start more accurate exam, you put the guy in quarantine and continue with your protocol. After a week of additional exams, you discover that this guy is actually fine. The consequences? You probably gave to this guy a rough week, but other than that is a good ending.
    Consider now a false negative. Your algorithm classifies an infected person as “sane”. So you do nothing. The person goes home, infecting other people and reducing his/her chance of survival because of late intervention. I think you will agree that it is a much worse scenario.

There is in fact a precautionary principle that must be taken into account. In the above examples a false positive is always better than a false negative, but this is not the point. If it is better a false negative or false positive depends on the scenario and on the question we are asking to the algorithm (try to reverse the class of example 1 from “the class of collapsing bridges” to “the class of solid bridges”).

The point is we should consider much more the error whose leads to reversible actions. The famous better safe than sorry. In the case of the bridge is indubitably better to check a safe bridge than ignoring a collapsing one. In the virus example, it is indubitably better to be more scrupulous on a sane patient we think may be infected than ignoring a really infected one.

In my opinion there are very few scenarios where both kind of classification errors are equally bad. In real life, there is always a situation that is marginally better than the other.

So, how do we do it? Stop considering the F-score a good enough score. Start thinking about the consequence of decisions based on your algorithm outcome. A better approach, is to use the F_\beta score.

F_\beta = \frac {(1 + \beta^2) \cdot T_P}{(1 + \beta^2) \cdot T_P + \beta^2 \cdot F_N+ F_P}

With \beta < 1 you are increasing the impact of false positives. With \beta > 1 you are increasing the impact of false negatives. Common choices are \beta=0.5 and \beta=2, but, in the end, it is your call.