When it comes to measuring quality, we are surprisingly unsuspicious once a metric comes into the play. As soon as someone hands you numbers, or a chart, there is a good chance that you will trust in those numbers – especially if they support what you already believe. It is always important to know where those numbers come from, and what exactly they measure. Especially in the field of (neural) machine translation, trusting numbers blindly can have severe consequences.
There are other measures out there in the world, but there is one which you will always find: the BLEU score. BLEU, which is short for bilingual evaluation understudy is an algorithm that has been widely used since decades to measure the quality of machine-translated output. In order to understand BLEU scores, let us quickly look into the algorithm.
How BLEU scores work
BLEU basically counts words or sequences of words (so-called n-grams) in the machine translation output and compares them to one or more reference translations. The machine translation Hello, how you are? would therefore achieve very high results for the single words when compared against the reference translation Hello, how are you?, because every word appears in exactly the frequency as in the reference translation (= one time). Comparing sequences of words, however, reveals that there is an error in the machine translation which will lower the score significantly:
hello, how how you you are
From the three n-grams above, only one appears in the reference translation. Please note that this is a simplified example – calculations are a tad more complicated than this! Still, you see where this is going: The closer a machine translation is to a reference translation, the higher the BLEU score. The score will always be between 0 and 1, or, if percentages are given, 0 and 100. 100, or 1, would imply that the machine translation is an exact match with a reference translation.
The Limits Of BLEU Scores
As I said above – BLEU scores have been around since decades, and were originally used in research. This also means that they were previously used on rules-based or statistical machine translation output. Both techniques tend to stick closely to a reference translation, because they either rely on hard-coded rules and dictionaries, or on n-grams that were extracted from a reference corpus and then ranked according to their probability of appearing in a certain new sentence. With both techniques, measuring the distance between reference translations and the machine-translated result makes perfect sense.
With the rise of neural machine translation, the prerequisites change. The strength of neural machine translation is its possibility to translate quite independently from any source translations. Instead of strictly following the structure and wording of a source sentence, neural machine translation treats the sentence as a whole and recreates it in the target language. Similar (but still not as good as) a human translator, the result can be close to the source or interpreted more freely – and still perfectly represent the meaning of the source. Check out my post on neural machine translation for more details! There are basically endless possibilities to translate a sentence, and different from statistical or rules-based machine translation, neural models have a larger repertoire from which they can draw from. Look at the following sentences:
Reference: The cat is in the basket. Machine Translation 1: The the the the the the. Machine Translation 2: It's over there.
Machine translation 1 is a famous example of the boundaries of BLEU scores. Depending on the version of BLEU scores, the result would at least be 0.33, or 33 %, although the result is ridiculous and certainly not usable. Machine translation 2, however, would have an even lower result – because none of the words it used are found in the reference translation. I could imagine that it’s still much more usable, because it seems to refer to what was said before. A human translator might have decided that repeating all the words would sound clumsy, and has adapted the translation to be more fluent. Now, relying exclusively on BLEU scores, I would have to ask the translator to change their translation because it seems to be incorrect.
There are other measures out there which are not as widely used as BLEU scores, but still are very famous in the localization industry. Among those are the translation error rate (TER), the NIST metric (which is based on BLEU scores), or editing distance. All have a similar goal, but use different algorithms to achieve this. The funny thing is: Most of my criticism of BLEU scores also accounts for those scores, too! While I do not want to say that numerical measures in general are not useful – because they are – I do want to emphasize that blindly trusting numbers in the realm of quality evaluation is dangerous and can lead you, your company, or your client, into very delicate situations that could have been avoided.
Context Matters Most
So, the first problem for BLEU scores nowadays is that neural machine translation is widely used – which disrupts the original environment in which BLEU scores were developed and tested. Neural machine translation behaves differently than any machine translation system before – applying the same measures without testing how this influences the scores is problematic, especially outside of research facilities were the scores usually are only applied, and not questioned in any way.
The second problem with BLEU scores – and any other measurements really – is that they will never take into account the context of a sentence. As I explained in part II of this series, one major flaw of neural machine translation is not on the sentence level – where it often behaves quite well – but on the text level, namely consistency and coherence. To my knowledge, there are no widely spread measures with which you can reliably measure text coherence. This means that even for a text where your BLEU scores are quite high, the efforts that go into post-editing might be tremendous. On the other hand, a lower BLEU scores does not mean that post-editing is not possible at all. There is a nice paper that sums it all up, with the title Are we estimating or Guesstimating Translation Quality?, by Shuo Sun. Nowadays, it’s mostly guesstimates when it comes to applying measures. The most reliable evaluation whether machine-translated output is good or not, and viable for post-editing or not, is to have a trained human looking over the results (just skimming a text will not work either).
Using Translation Quality Frameworks
If you have someone looking over the machine-translated output, and you need a reliable answer from them, the next step to take is to optimize the questions you ask. Imagine the scenario that you are asked to evaluate a text, and the only instruction you have is: Please evaluate the text. On what basis? What for? How detailed should I be? I immediately have so many questions, and I would not be able to fulfill the request. The potential for misunderstanding is also immensely high – also see part I of this mini series. So, how can you optimize your evaluation request?
The first thing would be to link it to an existing quality framework. The one that is widely used in the localization industry is the TAUS Dynamic Quality Framework (and check out TAUS’ blog post on other metrics as well). Using this or any of its derivations is a good starting point. If you ask someone to evaluate machine-translated output, try to ask questions that will point to the error categories that are mentioned in the quality framework. If your company has its own framework, all the better – these individual frameworks may be adjusted to fit the needs of your specific clients much better. If you specifically ask about wording errors, or fluency errors, or any other category that you think is important, you will minimize the probability of receiving an answer based on the evaluator’s gut feeling. Of course, it is nice to know if the evaluator thinks the quality is ‘good enough’ for post-editing, but without any further details, this evaluation is strictly bound to that specific post-editor. If you want to engage other post-editors on the long run, having some details of the major flaws of the output in question will help you to get an objective impression of what you and other post-editors can expect.