When it comes to measuring quality, we are surprisingly unsuspicious once a metric comes into the play. As soon as someone hands you numbers, or a chart, there is a good chance that you will trust in those numbers – especially if they support what you already believe. It is always important to know where those numbers come from, and what exactly they measure. Especially in the field of (neural) machine translation, trusting numbers blindly can have severe consequences.
Back in the days when I was a machine translation specialist, it was part of my job to make sure that the machine translation output we used had a certain quality. I was positioned between the Sales and Production departments of the company, because that certain quality was important for both: As the content usually got post-edited, I had to check if the post-editors would actually be able to work with the output. And as machine translation and post-editing (MTPE) was a cheaper product than good old translation, the Sales guys wanted to know how much they could go down with our rates.
No, this is not yet another article with motivating mantras about you being good enough. You are! Trust me. This is a blog post about quality assurance. Before I became Head of Technology, my position was Machine Translation Specialist. As such, I was confronted with this question on a daily, nay, hourly basis regarding raw or post-edited machine translation output. I often struggled to answer it, and I could imagine that I am not the only one. So here’s my thoughts – maybe they help you the next time when someone asks you exactly this question.
Mini Series on the history of machine translation! Find out what it means when someone says that something sounds like Google Translate, how Google Sings Songs and what neural machine translation is in part 2 of the series.
Mini Series on the history of machine translation! Find out how machine translation started and how statistical engines work in part 1 of the series.
Since I started working as a machine translation specialist, one of the most complex and interesting questions that impacted my daily work was this one: How can machine translation achieve human quality? This article is not a technical description of the numerous options you have to measure human quality, like BLEU score or other evaluation methods. No, in this post, I want to discuss a much more complicated question: What is human quality? Spoiler Alert: Human quality should be called Schrödinger’s quality instead, because it always has different states that are only distinguishable once they are in the past. I will present three reasons for this behavior.