How hidden variables in statistical models affect social inequality

Use of machine learning is becoming ubiquitous and, even with a fancy name, it remains a tool in the statistical modeler belt. Every day, we leak billions of data from ourselves to companies ready to use it for their affair. Modeling through data get more common every day and mathematical model are the rulers of our life: they decide where we can work, if we can get a loan, how many years of jails we deserve, and more.

While this is ethically problematic by itself, a deeper, simple problem is polluting mathematical modeling around the world: hidden variables, variables that are a common root cause of some data we are sampling. To understand why they are such a big problem, let’s start with an example. Suppose we want to write a program to understand if a certain guy is a good worker for our company. We want to use an automated system because we don’t want that human weakness and prejudices to affect our hiring process! Right?

So we start collecting data about the candidates with a questionnaire. We put into it many common sense questions. For simplicity, assume we just have 3 questions: “How good was your school curriculum?”, “Have you ever had problem with the law?” and “Have you ever missed a payment with your creditors?”. They seem good questions. After all, our model is quite clear. Being good at school, being a good citizen and paying debts in time are clearly variables correlated to the variable “It is a good worker”.

This is the causality diagram for the “job candidate” example. The three sampling variables “School”, “Law Problems” and “Debts” are not independent variables. The hidden variable “Race” is a common cause for all of them. This has huge implications on the fairness of the model.

However, after some time we discover that the system is hiring mostly middle class white men. Apparently, being white a Caucasian man is directly correlatedĀ  to being a good worker. It makes no sense. That’s because of confounding hidden variables. In fact, even if we have not put race as an explicit variable, it can still affect all the other variables we are sampling. Race affects all the above variables. Race influence on average the wealth of your family. In turn, this will affect the quality of your education, the neighbor you grew up and how much you get targeted by the police.

The overall effect is that you are screwing individuals on the basis of the indirect effect of their race on the average outcome of your sampling variables.

But wait, it gets worse.

The second problem of this kind of model is that they are self-validating. If we use this model to select good job candidates, some people will have less job opportunity and therefore less money, and, in the end, less chance to pay debts on time, pushing them and their families in bad neighbors with more law problems and worse schools. In short, the model will amplify the same issues that the model got wrong in the first place, and doing so, it validates itself. A problem called retrocausality.

Sometime, we in AI community end up too much caught in the models charm, and we forget the effect that such models can have on the people. Machine LearningĀ  is not immune to these problems. Machine Learning can learn the world inequalities and use them to confirm its internal model. And when we will apply those machine learning algorithm, we will contribute to amplify such inequalities.

This and many other problem are discussed in the book “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy” by Cathy O’Neil. While I think the book is overly pessimistic in some parts, it is a good reading to look at statistical modeling from a different angle. It definitely helped me to consider the implications of bad mathematical modeling to people’s life. Sure, the book often is too focused on the risks of models respect to their benefits. But when we are talking about people life, I think even one innocent victim is just too much.

Update 19th September 2017

Some days ago a friend of mine showed me a recent real-life example of what I described here. I said that Machine Learning is not immune from discriminatory biases. As an example, let’s look at this tweet (image copy):

This image is the result of what a machine learning algorithm (word2vec) learns when trained on a Google News corpus. In particular, it shows which adjectives are associated to the word “he” and which adjectives are associated to the word “she”. As you can see, we are in stereotype-land!

The point is that the algorithm is trained on an already biased world, and therefore learn to be biased itself. It is just math and algorithm, but it is sexist. If we are not aware of this possibility, and we apply such ML algorithms, we may end up in amplifying the inequalities we are trying to avoid by using math and algorithm!