– We have to pave the way for non-commercial research on language technology. It is a too important field to be left alone to the computer industry, says Professor Stephan Oepen, head of the “language technology group” at the University of Oslo.
Oepen believes this is fight for the democracy of the internet, a fight for net diversity, a fight to preserve language and language technology and – in the end – a fight to be able to analyze the gathered knowledge in the Norwegian and Nordic internet archives – what you can call the ”wisdom of the crowd”.
Common Crawl Foundation is an international non-profit organization that wish to reinstate the “democracy of the internet” by making the vast amount of data on the internet available to the public.
Common Crawl is also trying to tear down the technological barriers preventing non-commercial scientists using the internet’s raw data.
They now wish to create a Nordic collaboration involving Common Crawl for the preservation and research of Nordic languages.
Siri and Google translate creates laughter
Sometimes we get a really good laugh using Google Translate or the speech recognition service Siri on our iPhones.
But comprehensive research on gigantic quantities of text, advanced models of statistic and the use of heavy computations have made the service substantially more intelligent than previously.
And they will get even better. Oepen explains that the research on language technology is roughly 10 years ahead of the technology we use today.
How to interpret "rett"?
One simple explanation on why computers struggle with languages is the fact that a word can have several meanings.
The Norwegian word “rett” can mean a number of things. Translated to English it can mean «court», «correct», «dish», «course», «entitlement» or «straight».
The correct translation is something you find by interpreting the context. If, for example, the word is used in a sentence with pancakes or jam it is probably correctly translated with «dish». Consequently, if it is in a sentence with lawyer and procedure we are probably looking for the word «court».
A human being used to different languages would see this automatically. A computer would not – it needs enormous quantities of data and heavy computation to reach the same conclusion.
"Recognize speech" or "wreck a nice beach"
Understanding the spoken word is even more demanding than textual understanding. The sentences: "It's hard to recognize speech" and It's hard to wreck a nice beach” may sound deceivingly similar spoken out loud, but their meaning are totally different.
However, what is said together with the sentence increases the possibility that the computer will understand what the person is saying.
But still, different dialects and the personality of the speaker can also fool the machine.
Heavy computation and analysis
In both translation and speech recognition (and other services involving language technology) the other words that are involved in the sentence are thoroughly analyzed – both in the sentence before and after.
On top of this you need to initiate comprehensive probability calculations.
To achieve the correct calculations you need an enormous amount of text of good quality.
The juggernauts of the IT industry, like Google and Facebook, handle gigantic quantities of data every day in their computer servers, this data is among other things used to develop new and better language technology.
A difficult task to find own data
– All this data is not accessible to the language technology experts at UiO and other non-commercial research, says Oepen.
Oepen and his colleagues have to by themselves develop systems that «crawl» around the net and pick out high quality Norwegian texts. Furthermore, the data has to be organized and stored before it can become a part of the research on –for example – a Norwegian model of language.
– We are in a difficult position of competition competing with juggernauts like Google, that have access to a fantastic amount of language data and that serve 50 billion documents (with trillion of words) through their services.
An increasingly international internet
New languages are entering the internet and need translation – for example Chinese that even have a completely other charset.
Ten years ago approximately half of the information on the internet was English. Five years ago that number was 30 percent. And the trend is continuing.
The Scandinavian languages are minuscule in comparison.
It is far from certain that developing language technology services in these small languages will be prioritized by the IT industry in the future.
Scientists and other non-commercial contributors need to show their faces in order to preserve the Norwegian language into the digital future. The Nordic countries share common ground here.
– Luckily, a lot of people are invested in safeguarding the small Scandinavian languages; especially The Nordic Council of Ministers is advocating a groundbreaking science collaboration , says Oepen.
Consequently, UiO-scientists have taken the initiative for a Nordic collaboration with the non-profit organization Common Crawl Foundation, an organization that fights for the internet democracy and safeguarding free and open data to the public. The first meetings have already been held.
Plenty of research and heavy calculation behind language models
Oepen confirms that large parts of the Common Crawl data will be made accessible for processing in the Nordic countries. Of course, specific and unique models must be made for each individual language. Not only words and sign are different. There are different grammar, sentence structure, local words and expressions and so on.
– There is a lot of research and heavy computations behind our language models. The future of language technology is in Big Data, says Oepen.
The language technology experts at UiO are among the 10-15 research groups that need most computer power as part of the Norwegian alliance of heavy computation. Norway has for a long time prioritized securing computer power for their scientist.
– We have good access to heavy computer power and top range data storage facilities. Additionally, we have a lot of competent researchers. However, the bottleneck is the availability of raw data. It is as essential as raw oil, we are totally dependent on this raw data to take part in international language technology research, he explains.
Translated into English by Espen Haakaas.
The article can be read in Norwegian here: Vil redde nordiske språk i dataalderen