Digital linguists use algorithms to attack huge collections of text to see what patterns they are able to detect. Without human guidance, data-driven language models can search through large amounts of Wikipedia articles, news articles and other digital texts.
In its own ingenious way, Andrey Kutuzov's algorithm can determine which words change meaning over time.
Kutuzov, a postdoctoral fellow at the Department of Informatics, University of Oslo, was impressed when his language model showed that it understood something important about the Indian rebel group United Liberation Front of Assam.
“The model knows, in some definition of knowing, that Taliban and Afghanistan are in the same relation as the United Liberation Front of Assam is to India”, Kutuzov says.
It may not sound too impressive, but the model has found out this just by training itself on the texts it gets access to.
“This is really a very subtle knowledge. This semantic relationship is not very clear”, Kutuzov says to Titan.uio.no.
He already knew that such language models capture that the words "sister" and "brother" are connected in a similar way as the word "mother" is to "father".
“But it was quite surprising that even the information about which armed groups are active in each particular region can be inferred from the model, given that the models are just trained on sequences of words, Kutuzov says.
Words change their meaning
Language changes all the time, and in his doctoral thesis Kutuzov has tried to see if it is possible to automatically detect which words change meaning over time. He must enable the machines to solve such a task on their own.
“Usually it is formulated as a ranking task. From a list of for example 100 words, which ones have changed more than others?”
“If we compare the English language of the 19th century and the English language of the 21st century, the word “cell” – as an example – has definitely experienced a strong semantic change. In the 20th century it acquired the sense of the biological cell, while it in the previous century mostly meant a monastery cell or a prison cell. Now it also has the sense of the mobile phone.”
This does not mean that the algorithms are able to define the meaning of the words. They have another way of finding such connections.
Meaning is use
Kutuzov brings with him the tradition from Ludwig Wittgenstein, which says that the meaning of a word is determined by how it is used. The language technologist Kutuzov is not looking to define what is the right or wrong use.
“We just look at the data without saying that something is correct or incorrect. We just observe.”
“If we assume that the meaning of words can be inferred from the way words are used, we have lots of examples of how words are used. Large amounts of written text can be acquired form the internet. We have lots of data”, Kutuzov says.
The first step is to make the algortihm look for words that often appear together or near each other.
“To simplify a little, we just look at which words are to the left and to the right of these words and what the frequencies are.”
Tea or coffee?
The models will quickly discover that the word "tea", for example, often has "cup" or "hot" near it. The same, of course, has the word «coffee».
“We can only say that the word “tea” is more similar to the word “coffee” than to the word “juice”. The models capture that tea is something we drink and that it is normally served hot.”
Maybe they will see the difference between tea and coffee if they also check for words like "leaves" and "beans", but Kutuzov is not really interested in finding a correct description of what tea is. When he has found these frequencies for how often two words occur together, he can compare different eras.
Words like "tea", "coffee" and "cup" have hardly changed that much. But with the before mentioned "cell", the story is quite different.
“The same can be said of the word “broadcast”. It used to mean to broadcast seeds in the ground. Nowadays we broadcast television programmes or Netflix series”, Kutuzov says.
“We can compare changes in word usage and hope that changes in usage mean changes in word meaning. Of course, it’s difficult to establish what is the cause and what is the consequence. Do words change their meaning because of usage or is the usage changing because the meaning is changed?”
Kutuzov has created a web service where you can play with distributional word embeddings for English and Norwegian languages: WebVectors
Digital humanities and Culturomics
Dictionaries are more and more data driven. The authors like to check how often individual words appear in large collections of text and find examples of use for each word. However, Kutuzov's algorithms are not yet ready to provide satisfactory definitions for such use.
“Maybe in the future we will see dictionaries in which entries are being created using some of our language models. That would be nice”, Kutuzov says.
But it is no lack in applications.
“If you are doing machine translation it is easy to argue for the usefulness”, Kutuzov says.
If you do a Google search for a word that is not very common, technology of this kind is probably already in place to help you. If you get few hits, the search engine will also show pages that contain words that often appear with your keyword.
“If you search for “university”t, Google will probably return documents containing the words “student” and “professor” if there aren't enough documents with the word “university”.
In relatively recent research fields such as the digital humanities and Culturomics, surveys such as Kutuzov's will be very useful.
“If a word is becoming more frequent, this word is probably becoming more important in culture. Our work gives these people a much more powerful tool to study changes in general culture and society”, Kutuzov says.
Historical data and modern AI research
“The use of computers to track changes in word meaning has been the subject of increasing interest in language technology in the last ten to twelve years. Especially when it comes to how words change meaning over time”, says Professor Lilja Øvrelid.
“The interest in this phenomenon is not limited to the field of language technology, but also subjects such as history, linguistics and digital humanities”, Øvrelid says to Titan.uio.no.
She describes Kutuzov's research as a very important contribution to this field of research.
“He has systematically compared a number of factors that can contribute to modelling of change in language based on machine learning. Kutuzov has used the very latest neural machine learning methods and thus connects historical data to modern AI research”, Øvrelid says.
Andrey Kutuzov: Distributional word embeddings in modeling diachronic semantic change, Department of Informatics, Faculty of Mathematics and Natural Sciences, University of Oslo, 2020.