Ranking Systems 01 - What is Skill

A Ranking Systems is a core element of every competitive game. Every game in which a player plays against another player need to have some way to rank them and compare the relative skill of each other. It happens for video games, but also sport (e.g., the FIFA ranking for football) or other games (e.g., FIDE Chess Ranking). Comparing stuff is in the human nature.

However, not all ranking systems are alike: while every system requires to measure the player’s skill, if you apply the system used in Chess to Magic The Gathering, you will probably end up with a very bad system. Different games follows different rules and, therefore, measuring player’s skill must use a system that is suitable for such rules.

Automatic Skill Rating is a topic I have always been interested in; after all it is a topic very close Machine Learning but applied to something intangible and mutable as ”skill”. Now, it is time to share such knowledge. In this small series we will explore the math and the design choices behind commonly used ranking system so that you will be able to avoid the most common mistakes and appreciate the difficulty of the task.

Why we use Ranking Systems?

This is a good question. After all, when playing at home (card games, football in the street, and so on) we never spent a lot of math evaluating skill. Nevertheless, we measured skill all the time. When I was a kid playing football with my friend, everyone knew who was a strong player and who was very bad. When we formed the teams, we were used to “spread” the strong player among the two teams. We were kids, but we already knew one of the most important rule of game design: very unbalanced games are fun only for one team (and not for long).

Of course, at the time we did not use math. We used intuition and experience (it easy to understand that a kid scoring 10 goals in a single match is on the “strong side”). However, when we need to precisely and automatically rank thousands of players, math becomes a necessity. However, the goals for skill evaluation are always the same.

To provide the players with a tangible way to measure their own performance (internal motivation).
To pairing players against each other in a way that is not too unbalanced (and, therefore, not fun for one party and boring for the other) (external motivation).

The Challenge of Measuring Skill

Now we will start with some definition. This will get a bit formal, but do not worry: this is just for defining the scope of the problem from a mathematical point of view. It will not be required in the next articles.

The first question is how we define an abstract concept as skill. Formally speaking, we can imagine that, for each player, there is a hidden variable representing the true skill of each player. Given a player A we call the true skill of that player \( \hat{s}_A \).

As you can imagine, measuring this value directly is often impossible. For simple sports in which each player plays by themselves, skill and ranking are often based on the performance itself (e.g., running; it easiest to see that ”the best runner” is the one with the best time). However, in games involving interaction between 2 or more players, skill is much more vague. How can you tell exactly how much a man or a woman is good at chess? You can’t.

Instead, we need to consider skill as a random variable \( S_A \) distributed according to a probability density function \( \theta_{A}(s) \) indicating our guess about player’s skill. If we are pretty sure about the player’s skill level then \( \theta(s) \) will be concentrate around the true skill point, if we are pretty unsure then \( \theta(s) \) will be spread along the possible values.

Figure 1. In this example player A is expected to be less skilled than B (the bell maximum point comes before the max of B’s curve; at the same time the skill level of A is more uncertain (the bell is much wider than B’s)

Measuring skill, then, means to have a way to collapse such probability function around the true skill value on the basis the player’s measured performance. Such algorithms are called rating system and, technically speaking, they are nothing more than a statistical estimators.

As we can see later, most of the times the “measured performance” is just the outcome of a game match with another player. This is intuitively a good way to measure skill, in fact, a skill measure should be able to statistically predict the winner of the game because we assume that, in general, the more skilled player should win on the less skilled.

This gives us a simple way to learn skill over time. Suppose we have a two-player zero-sum game in which player A and player B challenge themselves in 10 matches. If our model predict that player A should win over player B seven times (thus, player B should win only 3 matches) but we observe that player A wins only 5 times (thus, player B winning the other 5) we may reasonably think to modify our skill guesses so that player A skill is a bit less than before (because it performed worse than expected) and player B skill is a bit more than before (because it performed better than expected).

This seams easy. However, there are a lot of unanswered questions:

How we can update the skill prediction?
What if we have more than two players? What if we have teams? How can we infer the team’s member personal skill from a set of team victories?
What if we have randomness?
What if skill changes over time?
What if there are roles? Think about MOBAs, a player may be better as a carry than as a support: skill is no more a scalar but a more complex structure.

We will explore all and of these kind challenges in the following articles.

Conclusion

Now that we have explored the basic theory behind ranking and skill estimation, it is time we start exploring existing rating systems. In the next article, we will start with the grandfather of rating systems – the ELO System – and we will explore its motivation, its implementation and its drawbacks. Stay with me. See you the next time!

Photo by JESHOOTS.COM on Unsplash