According to Wired Magazine, the "AI simultaneous interpretation fraud" controversy involving the well-known speech recognition company iFlytek is still fermenting, which has aroused more attention on AI real-time translation technology.
Figure: Real-time translation was initially limited to support Pixel Buds, but now it can be used on any headset that supports Google Assistant
Not long ago, Google quietly changed the support page of Pixel Buds earphones and wrote: “All earphones and Android phones optimized by Google Assistant can now use Google Translate.” Previously, this feature was limited to Pixel Buds earphones and Pixel. Mobile phone users. Although Google did not announce the news with great fanfare, this small adjustment is worth noting.
To know why, first of all, let's understand the history of Google Headphones. Google launched the wireless headset Pixel Buds last year. The company had previously hyped this product as a revolutionary tool that supports real-time translation. Just tap Pixel Buds and say "Help me" and it will open the Google Translate app on your phone. Now, Pixel phones also support this feature.
Then, you can speak the sentence, and Google Translate will translate it into the target language on your phone, transcribe it, and then read it out. In theory, Google’s new technology will even make interpreters worry about losing their jobs. The real-time translation demonstration of this product on the stage was a huge success, but when it started shipping, people seemed to have doubts about it: the quality of the translation did not meet public expectations.
The technology website Tech Insider tested the real-time translation function in ten different languages. It successfully translated some basic questions, such as "Where is the nearest hospital", but when the sentence becomes more complicated or the speaker has an accent, the translation will make mistakes. The commentators concluded that real-time translation seems a bit "suspicious of deception," and Google Assistant needs to work hard to understand what is said to it.
Consumer technology senior analyst Daniel Gleeson (Daniel Gleeson) said: "Mastering natural language is very difficult. For Google, this will be a huge achievement, and the day they achieve this goal, they can proudly Speak out." Maybe some people may say that this may be the reason why the updated information on the Pixel Buds support page is hidden.
Google’s problem is not the translation process itself. In fact, the company has been improving its translation application level in the past few years. In 2016, Google converted its Google Translate into an artificial intelligence (AI) driven system based on deep learning. Prior to this, the tool translated each individual word separately and applied linguistic rules to keep the sentence grammatically correct, resulting in a very familiar and fragmented translation effect. On the other hand, the neural network considers the sentence as a whole and guesses the correct output result based on the large amount of text data that has been trained before. Through machine learning, these systems can consider the context of sentences to provide more accurate translations.
Integrating machine learning is the task of the Google Brain team, which is Google's department dedicated to deep learning research and development. Google Brain also applies neural networks to another tool, which is the key to real-time translation, but this seems to make it easy to make mistakes in speech recognition. In fact, after several hours of voice training, Google Assistant will use machine learning tools to recognize patterns and finally correctly identify the content that is required to be translated.
So, if Google successfully applies neural networks to text-to-text translation to some extent, why can’t Google Assistant still use the same technology to accurately perform speech recognition? According to Matic Horvat, a natural language processing researcher at the University of Cambridge, it all comes down to the data set used to train the neural network.
Horvath said: "The system can adapt to the training data set they get. When you introduce it to something it has never heard before, the quality of speech recognition will drop. For example, if your training data set is conversational speech , Then the voice recognition effect will not be very good in a busy environment."
Interference is the nemesis of any computer scientist working to improve speech recognition technology. Last year, Google invested 150 million euros in London start-up Triint through its Digital News Innovation Foundation, which is a leader in automated voice transcription, even though its algorithm is different from Google. However, Triint's algorithm does not perform better in dealing with basic interference problems.
In fact, Triint’s company website devoted a long space to introducing how to record speeches in a quiet environment. The company claims that its operation has an error of 5% to 10%, but it clearly stated that this is suitable for recording in a quiet environment. Trint CEO Jeff Kofman (Jeff Kofman) said: "The biggest challenge is to explain to our users that our performance can only be as good as the audio they give us. In the case of echo, noise or even stress. Down, the algorithm will go wrong."
The challenge brought by the live speech means that in the process of creating a neural network, the training process is the most costly and time-consuming part. And like Google did with Pixel Buds, only supporting real-time translation on a limited number of devices, of course, does not help system learning. In fact, the more speech it processes, the more data it can add to the algorithm, and the more the machine can learn to recognize unfamiliar speech patterns.
For Gleason, a senior consumer technology analyst, this is one of the reasons Google has expanded the feature to more hardware. He said: “One of the most difficult problems in speech recognition is to collect enough data on specific accents, sayings, and idioms, all of which are highly regionalized. Using this feature only on Pixel will never let Google Expose to those regionalized data, and then can't process enough data."
However, accumulating data has a downside. The best-performing neural networks are those with the most data, but because the data needs to be processed on the CPU, the pressure on the CPU will increase as the amount of information increases. This type of CPU is far from reaching the level of perfect integration with mobile devices, making real-time voice processing still impossible to become a reality today. In fact, every time the Google Assistant is used, the voice information will be sent to the data center for external processing, and then sent back to the user's mobile phone. These calculations are not done locally, because the existing mobile phones cannot store the huge data needed by neural networks to process speech.
Horvath said that although Google Assistant can complete this process fairly quickly, it is still a long way from real-time speech recognition. One of the company's current challenges is how to integrate neural network processing in mobile phones to improve the seamlessness of functions such as real-time translation. In fact, developers are already working on developing small external chips suitable for efficient processing of neural networks, which can be integrated into mobile phones. For example, earlier this month, Huawei announced an AI chip, which the company claims can train neural network algorithms in a few minutes.
Although Google has its own chip Edge TPU, it is designed for enterprise users rather than smartphones. For Horvath, this is its Achilles’ heel: As a software company, Google does not have much control over the manufacturer and cannot guarantee the development of a product that enables all Android devices to use local neural network processing, which is completely different from Apple. .
In the near future, Google may be forced to take smaller steps to improve its speech recognition technology. Although real-time translation has attracted a lot of criticism, for Neil Shah, an industry analyst and Head of IoT, Mobile and Ecosystem Research at Counterpoint, expanding it will benefit Google’s participation in the competition: “Google has won 20 100 million Android users. As more and more users begin to use the latest voice interactions on Android phones, it can scale up faster than its competitors and receive training on a large amount of input data streams."
Gleason also agreed with this view. Regardless of whether the comments on real-time translation adhere to the tone of mild sarcasm, Google's move will ultimately bring significant improvements. Like all AI products, this tool also needs to be learned, and the process of entering the market has not yet been completed. Gleason said: "People may say that Google's real-time translation is not the same as promised, but this is the only way to achieve its goals." Interpreters now don't have to worry about losing their jobs immediately.