Welcome to Frauke's journey.

How To Measure Machine Translation Quality

When it comes to measuring quality, we are surprisingly unsuspicious once a metric comes into the play. As soon as someone hands you numbers, or a chart, there is a good chance that you will trust in those numbers – especially if they support what you already believe. It is always important to know where those numbers come from, and what exactly they measure. Especially in the field of (neural) machine translation, trusting numbers blindly can have severe consequences.

Two Different Perspectives On Post-Editing

Back in the days when I was a machine translation specialist, it was part of my job to make sure that the machine translation output we used had a certain quality. I was positioned between the Sales and Production departments of the company, because that certain quality was important for both: As the content usually got post-edited, I had to check if the post-editors would actually be able to work with the output. And as machine translation and post-editing (MTPE) was a cheaper product than good old translation, the Sales guys wanted to know how much they could go down with our rates.

Measuring Machine Translation Quality: “Is It Good Enough?”

No, this is not yet another article with motivating mantras about you being good enough. You are! Trust me. This is a blog post about quality assurance. Before I became Head of Technology, my position was Machine Translation Specialist. As such, I was confronted with this question on a daily, nay, hourly basis regarding raw or post-edited machine translation output. I often struggled to answer it, and I could imagine that I am not the only one. So here’s my thoughts – maybe they help you the next time when someone asks you exactly this question.

The Human Quality Paradox

Since I started working as a machine translation specialist, one of the most complex and interesting questions that impacted my daily work was this one: How can machine translation achieve human quality? This article is not a technical description of the numerous options you have to measure human quality, like BLEU score or other evaluation methods. No, in this post, I want to discuss a much more complicated question: What is human quality? Spoiler Alert: Human quality should be called Schrödinger’s quality instead, because it always has different states that are only distinguishable once they are in the past. I will present three reasons for this behavior.