There is a longer version of this blog post at https://kv-emptypages.blogspot.com/2019/04/understanding-mt-quality-what-really.html
This is the second post in our posts series on machine translation quality. The first one focused on BLEU scores, which are often improperly used to make quality-based decisions where it clearly is not the best metric.
The use of machine translation (MT) in the translation industry has historically been heavily focused on localization use cases, with the primary intention to improve efficiency, that is, speed up turnaround and reduce word cost. Indeed, machine translation post-editing (MTPE) has been instrumental in helping localization workflows achieve higher levels of productivity.
Many users in the localization industry select their MT technology based on two primary criteria:
The most common way to assess the quality of an MT system output is to use a string-matching algorithm score like BLEU. As we pointed out previously, equating a string-match score with the potential future translation quality of a system in a new domain is unwise and quite likely to result in disappointing results. BLEU and other string-matching scores offer the most value to research teams building and testing MT systems.
Many users consider only the results of comparative evaluations – often performed by means of questionable protocols and processes using test data that is invisible or not properly defined – to select which MT systems to adopt. Most frequently, such analyses produce a score table like the one shown below, which might lead users to believe they are using the “best-of-breed” MT solution since they selected the “top” vendor (highlighted in green).
English to French
English to Chinese
English to Dutch
Vendor A – 46.5
Vendor C – 36.9
Vendor B – 39.5
Vendor B – 43.2
Vendor A – 34.5
Vendor C – 37.7
Vendor C – 42.5
Vendor B – 32.7
Vendor A – 32.5
While this approach looks logical at one level, it often introduces errors and undermines efficiency because of the administrative inconsistency between different MT systems.
The first post in this blog series exposes many of the fallacies of automated metrics that use string-matching algorithms (like BLEU and Lepor), which are not reliable machine translation quality assessment techniques as they only reflect the calculated precision and recall characteristics of text matches in a single test set, on material that is usually unrelated to the enterprise domain of interest. The issues discussed challenge the notion that single-point scores can really tell you enough about long-term MT quality implications.
The enterprise value equation is much more complex and goes far beyond linguistic quality and Natural Language Processing (NLP) scores. To truly reflect business value and impact, evaluation of MT technology must factor in non-linguistic attributes including:
To effectively link MT output to business value implications, we need to understand that although linguistic precision is an important factor, it often has a lower priority in high-value business use cases. This view will hopefully take hold as the purpose and use of MT is better understood in the context of a larger business impact scenario, beyond localization.
But what would more dynamic and informed approaches look like? MT evaluation certainly cannot be static since systems must evolve as requirements change. Instead of a single-point score, we need a more complex framework that provides an easy, single measure that tells us everything we need to know about an MT system. Today, this is unfortunately not yet feasible.
While single-point scores do provide a rough and dirty sense of an MT system’s performance, it is more useful to focus testing efforts on specific enterprise use case requirements. This is also true for automated metrics, which means that scores based on news domain tests should be viewed with care since they are not likely to be representative of performance on specialized enterprise content.
When rating different MT systems, it is essential to score key requirements for enterprise use, including:
Ultimately, the most meaningful measures of MT success are directly linked to business outcomes and use cases. The definition of success varies by the use case, but most often, linguistic accuracy as an expression of translation quality is secondary to other measures of success.
The integrity of the overall solution likely has much more impact than the MT output quality in the traditional sense: not surprisingly, output quality could vary by as much as 15% on either side of the scale without impacting the real business outcome. In fact, there are reports of improvements in output quality in an eCommerce use case that actually reduced the conversion rates on the post-edited sections, as this content was viewed as being potentially advertising-driven and thus less authentic and trustworthy.
Global enterprise communication and collaboration
Customer service and support
Social media analysis
In upcoming posts in this series, we will continue to explore the issue of MT quality assessment from a broad enterprise needs perspective. More informed practices will result in better outcomes and significantly improved MT deployments that leverage the core business mission to solve high-volume multilingual challenges more effectively.
If you'd like to find out more, please register for our upcoming webinar on April 25.
Published on April 23, 2019, in Machine Translation