Since I started working as a machine translation specialist, one of the most complex and interesting questions that impacted my daily work was this one: How can machine translation achieve human quality? This article is not a technical description of the numerous options you have to measure human quality, like BLEU score or other evaluation methods. No, in this post, I want to discuss a much more complicated question: What is human quality? Spoiler Alert: Human quality should be called Schrödinger’s quality instead, because it always has different states that are only distinguishable once they are in the past. I will present three reasons for this behavior.
When Google upgraded their translation engine to neural machine translation back in 2016, they claimed on their developer blog that they were able to reach human parity with this new technique. Microsoft claimed the same a few months later for their translation system. And really – the first reactions on this were extremely enthusiastic, leading to a revival of the question when human translators would be obsolete. Yes, it’s just a revival. We went through this phase of enthusiasm two times before, one time with rules-based, and one time with statistical machine translation engines. Both times, the enthusiasm faded over time when the flaws of the systems became apparent. As an industry professional, I had quite some doubts related to Google’s claim in the first place. Once we ran our own tests and evaluations on neural machine translation, I knew that what they claimed didn’t reflect on our results.
Are The Google Developers Wrong?
No, they are not wrong. Come on, a multi-million-dollar company’s developer team, consisting of highly specialized and trained people, versus the results of a small-scale study conducted by one millenial? Of course they are not wrong. Asking this question basically is part of the problem, because what we realized during our tests was that neural machine translation produces a Schrödinger’s quality. We will not know whether it’s human quality (cat is alive) until it’s not human quality anymore (cat is dead). The reason is as simple as dissatisfying: Because human quality is fluid.
Everybody talks about human quality as if it were a constant that we could just apply to a sentence or text and see whether it fits. We tend to forget that human quality is subjective. Try to answer the question for yourself: What is human quality to you? Sooner or later, you will come to the conclusion that something can be considered ‘human quality’ if you as a human being feel comfortable to believe that another human being was the creator of said artifact. This applies to all kinds of things: Songs, art, translation, and as a latest trend: text-to-speech engines. It means that neural machine translation in fact is of human quality as long as somebody reads the results and can imagine that a human has translated this text.
One reason why neural machine translation is considered to be much closer to human quality than any machine translation system before is thanks to our brains: As long as the form of a sentence sounds human to us, we will be much more forgiving for semantical errors than the other way round (see this paper). Statistical machine translation often produced correct semantics, but cringeworthy grammatical structures – so our brains started the emergency protocol to warn us about this non-human threat. Neural machine translation produces (nearly) flawless grammar and seems to be correct, as long as you don’t look into the semantics too much. Our brain accepts the grammatical structure as something that a human would produce, and greenlights the whole sentence. That is the main reason for the neural machine translation hype that we saw in 2016 and 2017 – ‘quick tests’ as conducted by untrained people basically just confirmed this phenomenon. They saw the nice grammatical structures, and failed to recognize the errors and flaws. Now that we know that there are indeed differences between a human translation and a machine translation, let’s look into the reasons why machine translation still can be considered ‘human quality’.
Reason #1: Context Matters Most
It’s simple why something that Google considers to be human quality may not suffice for us in the localization industry. Google analyzed the results of their engine on a sentence-level. If you just read one sentence and have to decide whether it was translated by a human or not, you will have more ‘human’ results than if you have to assess the same sentences with context and their surroundings. For example – the sentence ‘Hey, what’s up?’ is a perfect sentence, so if I ask you whether it was created by a human, you will most probably say yes. However, if the next sentence is ‘The queen entered the room and sat down on her throne’, it is suddenly quite unlikely that neither the queen nor one of the people in the room would utter this sentence, so maybe it was a bot who was not aware who the Queen is… In linguistic terms, this sentence is correct in a syntactical and semantical sense, but the register is wrong. Register, style, and text coherence can only be evaluated with enough context – and their evaluation is very tricky because there are even more factors involved that are so subjective!
Long story short – that is why two people looking at the same sentence can evaluate it as either human quality or below human quality. Whether you consider something to meet human quality is not related to your own intelligence. It is, however, influenced by your surroundings in which you assess the object in question. If you’re familiar with the weaknesses of neural machine translation, you will most likely identify translations as being machine-translated in many more cases than someone who still has that outdated picture of the old Google Translate in mind. If you are on the lookout for translation errors, you will identify more than if you’re just trying to enjoy a blog post, or find the shoes of your dreams in your favorite online store (which happens to have a Google Translate widget you never noticed).
Reason #2: Humans Are Flawless…?
This is my most favorite reason. If it comes to human quality, I have had a lot of conversations about making errors in the past. Ask yourself: How many mistakes are allowed for a translation, a text, a tweet, a book, to be considered ‘human quality’? Errors are the ultimate evil – it starts in school or even earlier, when children learn that they will be punished for making mistakes (by a lower grade, by having to redo their homework, etc.). Errors are wrong, so naturally, if something is of ‘human quality’, it should be error-free… because that’s what we’re pursuing.
In reality, no one is free of errors. And I am not even talking about your character or your behavior – I am talking about what you say and write. There are errors all around us, in newspapers, books, letters, text messages… and guess where you’ll find the biggest pool of errors? Right. On the internet. You will find so many typos, words used wrongly, missing words, wrong grammatical structures… that you cannot possibly count them. And still, no one would evaluate a tweet with a missing word as below human quality… right? It’s paradox that when we’re talking about human quality, we always imply that something needs to be flawless, while in reality, human quality is quite a messy thing.
That is also the reason why machine translation in many, many cases indeed is at a human quality level – because we’re much more forgiving when we’re on the internet and not constantly looking out to find errors. We may not even be thinking about machine translation when we read something that is horribly wrong. I was surprised when I once talked to someone, and they mentioned that they don’t like the poor way in which many Chinese shops on Amazon translate their product texts into English, and that they should consider to take a language course again. It had never come to that person’s mind that these texts were translated automatically! This example also shows that we’re much more surrounded by automatically generated texts than we are aware of. Here we are at reason number one again: Context matters most. If we’re not aware that we’re reading automatically translated material, our tolerance for what we consider to be ‘human quality’ is much larger than we think.
Reason #3: Next Step: Singularity?
There is one other aspect that cannot be underestimated when it comes to the evaluation of machines that perform ‘human’ tasks: The fear of being less capable than a machine. The nice thing about statistical machine translation is that it is so clearly distinguishable from human translations. No human ever would produce that kind of jibberish that Google Translate did! That’s the difference to reason number two where we saw that our error tolerance is quite high as long as it’s out of our focus: As soon as we consciously evaluate something according to its quality, we will be much more picky. Not only, but also because we’re afraid that if we do not find any errors, we might overlook something and look stupid to our peers. Neural machine translation has some in that sense ‘dangerous’ qualities that qualify it as a direct competitor of humans:
- It produces great grammatical structures, even in complex sentences. While statistical machine translation struggled with subclauses or chained sentences, neural engines have an enormous potential to represent even complex structures in the target language. That is also true for language pairs that show extreme differences in their grammatical structures, like English and Japanese.
- It can guess words and understand abstract rules of language, without the programmers explicitly teaching them to the engine. Similar (but not in the same way) as a toddler learns abstract grammatical rules, the engine can derive rules from its data and apply it to formerly unknown words. It could for example guess the participial form of a new verb (i.e., adding -ed to the verb stem).
- It is creative. This is probably the characteristic that hurts us humans the most. A neural engine can even be creative with its translations and will regularly invent new words. This is of course a consequence of the previous point and happens if the derived rule is applied to words and structures where a human would not apply it. We once had an engine that produced the verb flipcharten in German, which is understandable, but not established (…yet?).
Neural machine translation still sucks at understanding humorous text or metaphors, by the way. The humans can keep those domains (at least for now).
The question whether something is human quality can never be answered for once and for all times. It will always depend on the context of its asking, of the people present, and of the goal of that conversation. In this blog post, I discussed three factors that influence how ‘human quality’ will be understood:
- The scope of the evaluation.
- The understanding of how imperfect human beings are.
- The fear to be less valuable than a machine.
All three influence how close we consider machine translation and human translation and how tolerant we are when it comes to translation errors or anormalities. As all three may change depending on day, time and context, there will never be a definite answer for an individual whether a machine translation delivers a ‘human quality’. The same sentence will always be below and on par with human quality, depending on your point of view.
Different Topic, Same Narrative
While the discourse related to machine translation is already going on for several years, I can see the same happening with text-to-speech engines nowadays. We all were used to those robotic voices that clearly were not human beings. They sounded awful and listening to them reading a text for a longer period of time was horrible and exhausting. Text-to-speech will probably undergo the same transformation as machine translation – powered by neural networks, the quality will come much closer to what we’re used to be the human domain. The arguments are the same. The evaluation process is the same. So no matter whether you’re evaluating machine translation, text-to-speech engines, or any other new fancy tool that is powered by a neural network and doing a job that was previously reserved for human beings: Don’t forget that the human quality you’re trying to measure or evaluate will always be fluid and subjective.