95% – AI’s Magic Number? The Human-Machine Collaboration to go Beyond this limit

24 Jan 2019

By Mihai Vlad / December 6, 2018

Almost a day doesn’t go by without a new, exciting application for Artificial Intelligence (AI) making the headlines. Take detecting and diagnosing eye problems – AI recently reached a new milestone, correctly identifying different types of eye disease for further treatment 94.5% of the time. This breakthrough puts these algorithms “on a par with an expert performance at diagnosing OCT (optical coherence tomography) scans,” and demonstrates the advancements in machine learning.

Last year, Microsoft reported that the performance of their AI on speech recognition had reached 94.9% accuracy. This figure is reached by comparing the software’s output with a manual transcription performed by a human. At the time, analysts compared both transcripts to calculate the word error rate, which defines the performance of the system.

And in another AI field – Machine Translation – SDL announced this year that 95% of their system’s output was labeled as “human-level, by professional Russian-English translators”. 95% is impressive in itself, but to achieve that on such complex subject matters is truly outstanding. For SDL, mastering human level machine translation for such a complex language combination meant overcoming one of the toughest linguistic challenges facing the community for some time. Despite Russian to English first inspiring the science and research behind machine translation more than 50 years ago it has always been hard to crack. Largely because Russian and English are very different in terms of grammar, inflection and word order.

What’s unique about this particular SDL AI challenge is that there are many valid results. A true “objective error rate” is almost impossible to compute. Anyone who speaks more than one language will know that there are many correct ways to translate a sentence.

Chris Manning highlights in this Stanford NLP lecture that the human language is a highly compressed communication channel, using very few messages (or words) to communicate a lot of different meanings. The reason it works so well is because the recipient is responsible for reconstructing the context of the communication, using their own accumulated bank of world knowledge.

A machine translation system has to transfer this meaning from one language to another without altering it. That’s why the machine translation problem is considered “AI-complete”.

What’s the best way to test?

Back to evaluating the performance of such an AI system, the most accurate way is to have a blind test where specialists are scoring translated phrases (not knowing if they are human translated or machine translated). On one end (of this Likert scale) we have “completely wrong”, and “perfect translation” on the other end (i.e. human-level). That is actually the methodology that SDL used for their assessment.

So how should we interpret this 95%? Does it imply that we completely solved these AI challenges?

Not quite…

SDL accurately highlights that these are generic systems, and when applied to specialized content (like pharmaceutical labels or financial contracts), the performance changes. Domain specificity is a key challenge in any AI application using Machine Learning.

Humans vs Machines

There is also another aspect. In all of the examples listed above, AI performance is compared with human performance.

Performance (machines) vs. performance (humans)

However, AI is not a zero-sum game. And so, we should reconsider the way we compare the performance of machines. Should it be?

Performance (humans + machines) vs. performance (humans)

In a 2016 research study on computer vision an AI system was able to correctly identify cancerous cells from lymph node images with a 92.5% accuracy. The accuracy of a human pathologist was 96.6%. However, when combining the AI and human outputs, the accuracy jumped to 99.5%!

In mathematical terms, it’s broken down as follows:

Performance (humans) = 96.6%

Performance (machines) = 92.5%

Performance (humans + machines) = 99.5%

In other words:

Performance (humans + machines) > performance (humans)

When it comes to professional language translation, SDL has a similar view. Instead of comparing machines with humans, it’s vital to focus on the combined value (productivity, speed, of course, precision) unlocked by teaming the powerful capabilities of AI with human expertise.

To see SDL’s Machine Translation in action, request a demo.

Language Weaver Solutions > Blog

95% – AI’s Magic Number? The Human-Machine Collaboration to go Beyond this limit

95% – AI’s Magic Number? The Human-Machine Collaboration to go Beyond this limit