A Brief History Of Machine Translation

Posted by

This is part 1/2 of a series about machine translation. Read part 2 here (online 9 August 2020).

If you want to avoid one thing when translating a text, it’s to sound like Google Translate, right? Well, here’s the twist: Most of us would not even be capable of recognizing a text that was translated with Google Translate. That’s not specific to Google’s engine, by the way. Their German competitor DeepL delivers fantastic results, too. Every big player has invested in their own machine translation engine, such as Microsoft, Baidu, Yandex and many more. Since 2016, all of them have gradually switched their engines from statistical to neural engines which definitely made their output more human-like (read further on why I want to avoid comparing machine-translated output to a ‘human quality standard’). However, neural machine translation still has its flaws – they are just much harder to recognize than with earlier techniques. Let’s revisit how machine translation started, and where we are standing now.

The Rules-Based Approach

It has always been the dream of mankind to reverse the story of the tower of Babel and to be able to understand every human being on Earth again. When the first rules-based machine translation system was publicized in the 1950s, the hopes were high that this could be done in the near future.

How Rules-Based Translation Works

A rules-based engine is always built for one specific language pair. It consists of two main parts: A set of grammatical rules, as well as a dictionary. The focus on a deep grammatical structure that needs to be filled with words from a dictionary is based on the theory of Universal Grammar as presented by Noam Chomsky. His theory was (and still is) very famous, especially when competitive theories came up during the 1950ies (such as the theory of generative semantics). While the whole topic of Universal Grammar is as complex as it could be, and has lives through many interpretations and alterations over the last decades, we can say that the rules-based approach follows this theory in its core.

Based on this assumption, the dream of the rules-based approach (as well as Universal Grammar) was to define the grammatical set of rules for one language in its exhaustion, and fill the resulting slots with words from one or many dictionaries. As great as that sounds, it turned out to be very complicated. Not only can words have different meanings, transferring a grammatical structure from one language into the other led to artificially sounding sentences at best, and to nonsensical results at worst. For example, the German language allows for very long and complex sentences with several subclauses. If you transfer this structure into English, you will get a very long and messy sentence. The same reason excluded Asian languages from the rules-based approaches – due to their highly different grammatical structure, it was simply not possible to yield usable results from such engines.

A Failed Approach

However, hopes went down during the next decades when it became obvious that rules-based machine translation would never be on par with human translation. There were several reasons why the quality of the output remained static at a level that was far below any output produced by humans:

  • The more grammatical rules you define, the more complex it gets. Describing a language exhaustively as a set of rules – which would be the basis for creating a perfect rules-based translation engine – will never be possible. Language structures are flexible and change over time, and new structures pop up to fill formerly unknown gaps.
  • One word can have many different meanings, some get obsolete, and some are invented anew. Keeping up with these changes is extremely difficult and time-consuming.
  • Rules-based engines need to be adapted to a certain use case by a language professional, which means that they will always have high maintenance costs – and cannot be accessed and used as a universal language translator.

Of course, the engines yielded results that were understandable. As long as you were trying to get the gist out of a text, they were extremely helpful. Nonetheless, the enthusiasm faded over time as the quality level remained static and no improvements could be foreseen.

Better Rely On Statistics

In the 1980ies, advances in technology opened up new possibilities for translation. With computers becoming available on a wider range, the new dream of reversing the tower of Babel began to take shape. Based on large bilingual datasets, a computer calculated the statistical significance of one word or phrase appearing in the source language with words or phrases appearing in the target language. For instance, if you always translated ‘hello’ in the same way, the engine would recognize a probability of a hundred percent that ‘hello’ had this translation.

The Need To Control The Language

In reality, nothing is ever at a hundred percent. The result of the engine’s calculation was an extremely large phrase table with tons and tons of statistical data. Based on this, a language model was compiled that offered the statistically most relevant result for the requested source word or sentence.

Grammatical structures that were produced often were very similar to the source sentence, and longer sentences or subclauses usually ended up in a mess. As the limit of what was possible by using statistics was reached, research teams tried to enhance the output with preparation and post-processing steps. If the authentic structure of a German sentence was problematic for the engine, why not reshuffle the German words to mimic the structure of the source language already? The process of writing text specifically for the purpose of having it machine-translated is known as writing in ‘controlled language’. It’s a simple concept: The less surprises for the engine, the better the output. Especially in domains where creativity is not needed (user manuals, administration guides, and so on), writing in controlled language in order to get better machine translation results was widely introduced to technical writers during the last decades.

By applying a dictionary additionally to the engine’s calculations, terminology could even be enforced in the final output. That was great for companies which needed to ensure that company names, job titles or the name of their products were always translated in the same manner. Companies which went global were one of the biggest user groups of this kind of machine translation; if you want to expand to a new region, translating all the technical manuals for your products is mandatory and quite cost-heavy. By applying machine-translation, optionally with a human looking over the results (a process called ‘post-editing’), time and costs could be minimized.

How Statistical Engines Work

When the first statistical engines came up hopes went up again that the tower of Babel problem would be resolved soon. After all, this new technique promised quite a lot: Due to the underlying statistics, the context of a word would be considered for translation. That is because statistics didn’t stop at the word border. After every word of a sentence was weighted and analyzed, the system would then do the same with the direct neighbors, and then the neighbors of those. This way, not only single words gained a statistical value, but also the so-called bigrams and trigrams (and sometimes even higher numbered n-grams). In the phrase table, it looked like this:

in the
in the forest
in the city
the city
the city is

Every line then was assigned statistical values for their appearance with similar words, bigrams or trigrams in the target language. For example, for a translation into German, the trigram in the city would probably have a high statistical value with in der Stadt. Whenever the engine now encountered the trigram in the city in a formerly unknown sentence, it would suggest to translate it with in der Stadt because this is, statistically seen, the result with the highest probability.

While this system works quite well for shorter sentences with easy grammatical structures and a limited vocabulary, such as user manuals, it encounters problems with unknown words (that are just transferred to the target sentence) and more complex structures. The engine will always look at the trigrams (or any higher n-gram that it was trained on), which means that there is no way to connect more than three words at a time. The engine also does not know what it did in the past. Imagine you would translate a sentence together with three of your colleagues, but each one of you just gets three words to translate that you just write down after each other. That’s basically how a statistical engine builds a target sentence. No wonder that those sentences often sounded awkward, had a verb too many or was lacking one completely, or translated less frequent words in a very funny way. The lower the frequency, the lower the statistical value for the matches – at some point, the engine just blindly guesses which of the matches it saw in the training data could possibly fit.

Go on with part 2 on 9th August 2020!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s