lip reading lips

How accurate is lip reading – AI vs. Professionals

Artificial Intelligence (AI) technologies are again making a big splash in the news. This time with AI sinking its virtual teeth into lip reading. Researchers at the University of Oxford in the UK in collaboration with Google’s Deepmind has developed a system they claim can lip read more accurately than humans.

Lip reading has been one of the most prominent areas of research for the past decade. The main focus having been on overcoming the shortcomings in audio recognition in noisy environments.

Most recently, however, the focus is firmly planted on speech recognition algorithms and how such systems could help people who are deaf or hard of hearing have better access to television through accurate real-time subtitling.

I won’t lie to you when I first saw the news title ‘AI has beaten humans at lip reading’. I had to choke back a laugh.

It would be impossible for machine lip readers to be more accurate than professional lip readers given the huge variation in dialogues, accents and human characteristics.

Just last week I received a call from the BBC2’s Daily Politics show, asking that I lip read David Cameron. Clearly, television still requires ‘real humans’ to assist in accurate lip reading, especially during live broadcasts.

Putting my personal opinions aside for the moment, I thought it only fair to investigate the AI lip reading phenomenon before coming to a conclusion.

lip reader living with a hearing loss

An artificial intelligence system that can lip read

First, there was LipNet. Trained in lip reading by watching thousands of hours of video and then matching text to the movement of the TV person’s mouth. However, learning from specifically selected videos has its limits.

The videos had every speaker’s face well-lit and facing forward, speaking only in a standardised sentence structure. Obviously, not an accurate representation of TV programmes or how people speak in real-world situations.

Then, another team at the Oxford’s Engineering Science department collaborated with Google’s Deepmind and took it a step further. They had the AI interpret over 100,000 video clips from BBC television, which represented a much wider variation of lighting and head positions.

The highly praised AI system, affectionately referred to as WAS (Watch, Attend and Spell) was trained using 5,000 hours of TV programmes, containing 118,000 sentences and a vocabulary of 17,500 words.

Right now the system can only operate on full sentences of recorded videos.

Jon Soon Chung, a doctoral student at Oxford, explains that there is still a lot more work that needs to be done on improving the accuracy of the system, before one day implementing ‘WAS’ to function in real-time.

What professionals in the hard of hearing world say about AI

Jesal Vishnuram, technology manager for the charity Action on Hearing Loss, said:

AI lip reading would be able to enhance the accuracy and speed of speech-to-text especially in noisy environments, and we encourage further research in this area and look forward to new advances being made.

What do I think?

Well, I would like to see more evidence of AI accurately lip reading people from various cultural backgrounds and how the researchers might go about teaching the machine the meaning behind words when they have no emotional reference.

Due to the AI being trained predominantly via BBC video clips, I doubt the variety of ways people of different backgrounds pronounce words can be accurately interpreted.

That being said, I do believe that in future AI lip reading technology could potentially help support and even improve professional lip readers’ accuracy.

However, these lip reading AI’s are a long way from being able to think and understand like humans do.

In principle, these ‘deep mind’ neural net approaches – where the machine does all the learning by using feedback from whether its own ‘guesses’ are correct – are powerful. As long as they keep learning and get plenty of input from connected real speech, and good input (auditory, visual – or written as seems to be the case in this demo) they should be able to lip read – possibly better than many people can. Voice recognition software has come a long way using these processes, and adding lip movements to the acoustic signal should work as well – if not better – than auditory alone. However, so far, these models are all ‘demo models’ … not really suitable for a test drive in open country, more to marvel at in the showroom of the laboratory. How long it will be before everyday viable machine lip reading comes about will, I think, depend on the financial prospects to develop the model. That may come from consumer products, but I think it’s more likely to come from military and surveillance institutions where there may be a need to get information from noisy signals… and where it looks like funding (in the USA at least) will be more reliable. Where human lip reading experts will continue to be of use is where they know more about the speech patterns, language and culture of the people they see speaking than may be available to the machine. Humans also pick up on other information ‘in the frame’ (for instance, what is going on around the speaker/s) – which the machine will not be looking for.

– Ruth Campbell, Emeritus Professor, University College London

A lip reader doesn’t just read lips

While AI lip readers can somewhat accurately predict what people are saying by extracting information from their lip movements, there’s one key element they are not capable of learning: Empathy.

To truly understand what’s being said and convey an accurate message, a person needs to focus on how something is being said, rather than purely focusing on what is being said.

Being able to lip read is a challenging skill that takes dedicated time and an in-depth understanding of human emotions and body language. When you read lips, only about 30% of speech can actually be seen on the lips. The rest is inferred from context, movements of the jaw, cheeks, neck, and the expression of the eyes.

The best lip readers I know use lip reading as their primary means of communication and will only lip read the languages they have a fluent command over.

While AI lip reading certainly has a place in the world of television, it still has a way to go.

Lip reading is a skill that requires mental agility and a good knowledge of the language being spoken.

121 Captions’ lip readers have a highly developed and lifelong experience of reading lips. If you need to get the inside scoop on live broadcast events, contact us.

11 replies

Trackbacks & Pingbacks

  1. L'intelligence artificielle de Meta peut-elle lire sur les - Actuiva Magazine says:

    […] Apparemment, le meilleur résultat obtenu par des personnes capables de lire sur les lèvres n’est que de 40 % (ils se trompent quatre fois sur dix). De toute évidence, pour des activités telles que la […]

  2. L'intelligence artificielle de Meta peut-elle lire sur les lèvres ? | Ultimatepocket says:

    […] Apparemment, le meilleur résultat obtenu par des personnes capables de lire sur les lèvres n’est que de 40 % (ils se trompent quatre fois sur dix). De toute évidence, pour des activités telles que la […]

  3. Open the pod bay doors, please, HAL: Meta’s AI simulates lip-reading - News Latest says:

    […] better than human, professional lip readers, at 26.9%. Apparently, the best most human lip-readers get is only 40% (they’re wrong four times in ten.) Obviously, for things such as transcribing talks after […]

  4. […] than human, professional lip readers, at 26.9%. Apparently, the best most human lip-readers get is only 40% (they’re wrong four times in ten.) Obviously, for things such as transcribing talks […]

  5. […] better than human, professional lip readers, at 26.9%. Apparently, the best most human lip-readers get is only 40% (they’re wrong four times in ten.) Obviously, for things such as transcribing talks after the […]

  6. Lip-reading artificial intelligence could help the deaf—or spies | Science – Letest News Line says:

    […] readers erred at a rate of 93% (though in real life they have context and body language to go on, which helps). The work was done by DeepMind, an AI company based in London, which declined to comment on the […]

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.