Mastodon Icon GitHub Icon LinkedIn Icon RSS Icon

On the procedural generation of a proto-language

As you probably know, last week I was at the DiGRA-FDG conference in Dundee, Scotland. The conference ended last week, but I have the urge to add something to a really interesting presentation I’ve attended during the first day at the workshop on Procedural Contents Generation.

Before FDG, I was reading a small book on how the Chinese language has evolved during the centuries. It is extremely interesting. Especially if we look at the clear and wonderful evolution from pictures on paleolithic wood artifacts to the modern Mandarin Language. This triggered in me the idea to explore the literature on teleological algorithms for the procedural generation of languages (if you don’t know, the teleological algorithms in PCG are the ones that try to generate a “thing” by simulating the physical and evolutionary processes through which the “thing” is generated in the Real World™).

I was thinking about that just two days before the conference. Therefore, you can maybe understand my surprise when the first presentation I attended was titled: “Diegetically Grounded Evolution of Gameworld Languages”.

This small paper describes very effectively the simulation of some aspects of how languages evolve in the real world. It is inspiring and worth read it! However, I feel that the thing I was interested into was not really covered. So, I’ll try to dump my brain in this article.

What we have.

First of all, I will put a link to the paper here. I will also put a link to the Twitter account of the paper’s author, James Ryan (by the way, he presented so many cool stuffs in this conference).

I don’t want to go deep in details on the paper because I had no time to study it in details and I will say something terribly wrong, for sure. But I can give you the general idea.

In the paper, languages are abstract entities (and, in fact, are represented by binary string with no clear connection to human languages). These binary strings “evolve” while the people that can speak it go over their lives. For instance, when they grew up, travel, meet other people who speak other languages, and so on.

I think the paper is extremely on point on simulating how languages evolve from a macroscopic point of view. However, when I read about the old Chinese wood inscriptions, I had another problem in mind: how a language is born? If we know this answer, it is possible to use this knowledge to generate a language “from scratch”?

In practice, before starting with the evolution of languages, we need to generate a “proto-language”.

How languages are born?

One of the first evidences of the Chinese language.
Figure 1. One of the first evidences of the Chinese language.

Creating this proto-language from scratch is hard because we don’t how languages are born. There is no theory that can explain how a bunch of super-intelligent monkeys went from “noise” to “languages”. There should be a singularity time in which the first word was spoken, but we don’t know when or how that happened. For historical linguists this is still a really open research field. The only thing we can say is that the language development process consists in three phases.

From Noise to Words

In the first phase we need to do the huge step from animal cries to proper primordial words. This is probably the most mysterious phase. Looking for the literature on this topic I found just a bunch of speculative theories with really funny names. For instance, and I’m not kidding, we can find the bow-wow (languages as imitation of the cries of beasts), pooh-pooh (languages as exclamation of primordial feelings like anger, pain and pleasure), ding-dong (languages as imitation of the resonance of the objects of the world), yo-he-ho (languages as the attempt to express rhythm in collective activities) and the ta-ta (languages as the imitation with the tongue of manual gestures).

This is funny! But it can give to us also some insight on how languages are born, probably with a combination of all the theories above.


After the initial words, a language pass through a process called grammaticalization. This is a never-ending process in which words usage is clustered into commonly accepted rules. As an example, suppose a primitive population in which the word for “wolf” is represented by the sound “X”. When there are two wolf there is the need to explain the plurality concept. The population could decide to add a prefix or another sound to the word (something like “Y X”), or use a suffix (“X Y”, like in English with the “s”), or repeat the sound twice (“X X”, like in Chinese).

Probably, at the beginning of a language, there is no “winner”, but as soon as one of these three formalism becomes the dominant one, it became a proto-grammar rule for the proto-language.

From sounds to writing

Finally, there is writing. This is the process through which a population learn to represent the words of their own language in a graphical way. This process is a bit more clear and studied. We go from direct representation of the object the word is referred to (hieroglyph), to some more abstract and stylized representation (ideograms), to the ability to represent the “sounds” of the word instead of the word itself.

The way writing evolves and interact with neighbors cultures is worth an article by itself!

Ok, cool. But how can I model this?

I have not had much time (after all, I just came back to Italy two days ago), but I sketched some very basic script for the bow-wow approach that kinda-work (with some push). This is the basic idea:

  1. I create a set of symbols. These symbols represent a set of phonemes. They are the basic sound units of the world. I used a subset of the International Phonetic Alphabet.
  2. I put in the word many animals. Every animal has its own set of phonemes representing their cries.
  3. When a man want to go hunting and see an animal, he has to alert his companions. The basic idea here is that more the sound is “similar” to the cry of the animal, more the alert is effective.
  4. There are some basic rules to add some challenge. If you want to eat an animal you have to say its name or you die of starvation. If a predator (such as tigers or wolfs) is near, you have to alert your friends or you’ll be eaten and the language is gone.
  5. After some generation the successful populations will have “words” for each relevant animal in the world.

The reality is way more complex than this. Moreover, the bow-wow alone is not enough to explain complex languages. But for a 30 minutes script in a hotel room, I think it is a good starting point. :)

I hope to see more experiments in this direction. I would really like to see more teleological algorithms for languages generation. Imagine a future in which a game will generate a language with a full grammar and vocabulary. A language in which “native” speakers are only the in game characters! That would be really awesome. :)  

comments powered by Disqus