Home > Article > In-depth analysis of the development logic behind AIoT

In-depth analysis of the development logic behind AIoT

The field of AI and IoT integration has been very hot in recent years. Whether it is the capital market or mass entrepreneurship, all have shown great enthusiasm for it.

Market opportunities for human-computer interaction in the AIoT field

Since 2017, the term "AIoT" has been swiped frequently and has become a hot word in the Internet of Things industry. "AIoT" means "AI+IoT", which refers to the integration of artificial intelligence technology and the Internet of Things in practical applications. At present, more and more people have combined AI and IoT to see that AIoT, as the best channel for the intelligent upgrade of major traditional industries, has become an inevitable trend in the development of the Internet of Things.

In the market based on IoT technology, there are more and more scenarios for connecting with people (such as smart home, autonomous driving, smart medical, smart office). And as long as it is in contact with people, it will inevitably involve the needs of human-computer interaction. Human-computer interaction refers to the information interaction process between the person and the computer using a certain dialogue language to complete a certain task in a certain interactive manner. The scope of human-computer interaction is very wide, ranging from light switches to dashboards on airplanes or control rooms in power plants. With the explosion of smart terminal devices, users have also put forward new requirements for the interaction between humans and machines, which has gradually stimulated the AIoT human-computer interaction market.

AIoT development path

Taking the smart home market as an example, data shows that China's smart home market will reach 180 billion yuan in 2018, and the smart home market will reach 357.6 billion yuan in 2020. Analysts predict that the global smart home market will reach more than 500 billion yuan in 2021. In the rapidly erupting AIoT market, the demands and prospects for human-computer interaction are undoubtedly expected.

The digitalization of human life has lasted for about thirty years. In these years, we have experienced the evolution from the analog era to the PC interconnection era and then the mobile internet era. At present, we are in the process of evolving to the Internet of Things era. In terms of interaction, we can see that machines are more and more "accommodating" people: from the keyboard and mouse in the PC era to the touch screen, NFC and various MEMS sensors in the mobile era, to the booming Internet of Things era. The use of voice/image and other interactive methods is becoming lower and lower, which has led to more and more users being involved. At the same time, we need to pay attention to another profound change, that is, due to the evolution of interaction methods (at least one of the important reasons), a large number of new dimensions of data are constantly being created and digitized, such as work materials and entertainment in the PC era. Programs, user habits, location, credit and currency in the era of smartphones, and various possible new data in the era of the Internet of Things.

In the era of the Internet of Things, interactive methods are developing in the direction of ontology interaction. The so-called "ontology interaction" refers to the basic way of interaction between people, such as voice, vision, movement, touch, and even taste, starting from the human body. For example, use voice to control home appliances, or the air conditioner uses infrared to determine whether it should cool down, and use voice and infrared to control the temperature (when no one is detected in the room, even if "cooling" is mentioned in the TV program, the air conditioner is also Do not react).

New data is the nourishment of AI, and a large number of new dimensions of data are creating infinite possibilities for AIoT.

From the perspective of the development path of AIoT, industry professionals currently generally believe that it will experience three stages: stand-alone intelligence, interconnected intelligence, and active intelligence.

Stand-alone intelligence refers to that the smart device waits for the user to initiate an interaction request, and there is no mutual connection between the device and the device in this process. In this situation, the stand-alone system needs to accurately perceive, recognize, and understand various instructions of the user, such as voice, gestures, etc., and make correct decisions, executions, and feedback. The AIoT industry is at this stage. Take the home appliance industry as an example. In the past, home appliances were a feature phone era. Just like the previous mobile phone’s button type, it helped you lower the temperature and help you realize the refrigeration of food; now home appliances have realized stand-alone intelligence, that is, voice or mobile phones. The remote control of APP can adjust the temperature, turn on the fan, etc.

Smart items that cannot be interconnected are just islands of data and services, far from meeting people's needs. To achieve continuous upgrade and optimization of intelligent scene experience, the first thing that needs to be broken is the island effect of single product intelligence. The interconnected smart scene essentially refers to a matrix of interconnected products. Therefore, the "one brain (cloud or central control), multiple terminals (perceptrons)" model becomes inevitable. For example, when the user tells the air conditioner in the bedroom to close the curtains in the living room, and the air conditioner and the smart speaker central control in the living room are connected, they can discuss and make decisions with each other, and then make the action of closing the curtains in the living room by the speaker; or When the user speaks "sleep mode" to the air conditioner in the bedroom at night, not only the air conditioner is automatically adjusted to a suitable temperature for sleep, but also the TV, speakers, curtains, and lights in the living room are automatically turned off. This is a typical scene of interconnected intelligence through cloud brains and multiple sensors.

Active intelligence refers to that the intelligent system is on standby at any time based on user behavior preferences, user portraits, environment and other information. It has self-learning, self-adaptation, and self-improvement capabilities, and can actively provide services suitable for users without waiting for users to make demands. , Just like a personal secretary. Imagine such a scene. In the early morning, as the light changes, the curtains are automatically opened slowly, the sound box comes with soothing wake-up music, and the fresh air system and air conditioner start to work. When you start to wash, the personal assistant in front of the wash station will automatically broadcast today's weather and dressing suggestions for you. After washing, breakfast and coffee are ready. When you walk out of the house, the electrical appliances in the house will automatically cut off the electricity and turn on again when you wait for you to go home.

The realization of AIoT places demands on edge computing capabilities

Edge computing refers to an open platform that integrates core capabilities of network, computing, storage, and applications at the edge of the network close to the source of things or data, and provides edge intelligent services nearby to meet the needs of industry digitalization in agile connection, real-time business, data optimization, application intelligence, Key requirements for security and privacy protection. There is a very vivid analogy in the industry. Edge computing is like the nerve endings of the human body, which can process simple stimuli by itself and feed back characteristic information to the cloud brain. With the implementation of AIoT, in the scenario of intelligent connection of all things, devices and devices will be interconnected, forming a new ecology of data interaction and sharing. In this process, the terminal not only needs to have more efficient computing power, in most scenarios, it must also have local autonomous decision-making and response capabilities. Take smart speakers as an example. It not only needs the ability to support local wake-up, but also the ability to reduce noise at a distance. Due to real-time and data availability considerations, this calculation must occur on the device side instead of the cloud.

As the most important landing scenario for AIoT human-computer interaction, the smart home industry is attracting more and more companies to enter. Among them, there are not only technology giants such as Apple, Google, Amazon, etc., but also traditional home appliance manufacturers such as Haier and Samsung. Of course, there are also Internet upstarts such as Xiaomi and JD.com. Based on the concept of interconnected intelligence, in the future AIoT era, every device needs to have certain perception (such as preprocessing), inference, and decision-making functions. Therefore, each device side needs to have certain independent computing capabilities that do not rely on the cloud, that is, the edge computing mentioned above.

In the smart home scenario, interacting with terminal devices through natural voice has now become the mainstream of the industry. Due to the particularity of the home scene, home terminal equipment needs to accurately distinguish and extract correct user commands (instead of invalid keywords that family members accidentally say when talking), as well as information such as sound source and voice print. Therefore, the smart home field Voice interaction also puts forward higher requirements for edge computing, which are specifically manifested in the following aspects:

Talk about noise reduction and wake up

The sound field in the home environment is complex, such as TV sound, multi-person conversation, children playing, spatial reverberation (the noise of kitchen cooking, washing machine and other equipment). These sounds that easily interfere with the normal interaction between the user and the device are likely to be Exist at the same time, which requires processing and suppression of various interferences to make the voices from real users more prominent. In this process, the device needs more information to make auxiliary judgments. An essential function of voice interaction in the home scene is the use of a microphone array for multi-channel simultaneous sound recording. Through the analysis of the acoustic space scene, the spatial positioning of the sound is more accurate and the voice quality is greatly improved. Another important function is to help distinguish the real user through voiceprint information, so that his voice can be more clearly distinguished from the interference of multiple people. All of these need to be implemented on the device side and require a large amount of computing power.

Local recognition

The local recognition of human-computer interaction in the home furnishing field is inseparable from edge computing, which specifically reflects two aspects:

High-frequency words. According to actual statistics, users have a limited number of frequently-used keyword instructions in specific scenarios. For example, for car and machine products, users may use "previous/next song" most often, and air-conditioning products may use "on/off" commands most often. The words frequently used by these users are called high-frequency words. . For the processing of high-frequency words, it can be processed locally without relying on the delay of the cloud, so as to bring users the best experience.

Networking rate. In the process of smart home products, especially home appliances, networking rate is a problem. How to make users perceive the power of voice AI without connecting to the Internet and to cultivate users is also an important role of edge computing in the current situation.

Balance of local/cloud efficiency

In the natural language interaction process of the home furnishing field, when all the calculations are placed in the cloud, the acoustic calculation part will put a lot of pressure on the cloud computing. On the one hand, it will cause a substantial increase in the cost of the cloud platform; on the other hand, it will cause calculation delays. Damage the user experience. Natural voice interaction is divided into two parts: acoustics and natural language understanding (NLP). From another dimension, it can be regarded as "business-independent" (speech-to-text/acoustic computing) and "business-related" (NLP). Business-related parts undoubtedly need to be solved in the cloud, such as users asking the weather, listening to music and other needs, then the device's understanding of the user's sentence and the acquisition of weather information must be completed through the Internet. However, for the user's voice-to-text conversion, such as issuing an instruction "turn on the air conditioner, increase the temperature, etc.", some or even most of the calculations may be done locally. In this case, the data uploaded from the local to the cloud will no longer be the compressed voice itself, but a more streamlined intermediate result or even the text itself. The data is streamlined, cloud computing is simpler, and the response is better. For swift.

Multi-modal demand

The so-called multi-modal interaction refers to the interaction after the combination of multiple ontology interaction means, for example, the integration of multiple senses, such as text, voice, vision, action, environment, and so on. Humans are a typical example of multimodal interaction. In the process of human-to-human communication, expressions, gestures, hugs, touches, and even smells all play an irreplaceable role in the process of information exchange. Obviously, the human-computer interaction of smart homes is bound to be more than just a voice mode, but requires multi-modal interaction in parallel. For example, if a smart speaker sees that a person is not at home, it does not need to respond to the wake-up word that was mistakenly released on the TV, and can even put itself to sleep; if a robot feels the owner is watching him, it may Will proactively greet the host and ask if they need help. Multi-modal processing undoubtedly requires the introduction of common analysis and calculation of multiple types of sensor data. These data include not only one-dimensional voice data, but also two-dimensional data such as camera images and thermal images. The processing of these data does not require the ability of local AI, which puts forward a strong demand for edge computing.

AI chip demand brought by AIoT

The AI ​​algorithm puts forward higher requirements on the parallel computing capability and memory bandwidth of the device-side chip. Although the traditional GPU-based chip can implement inference algorithms on the terminal, its disadvantages of high power consumption and low cost performance cannot be ignored. In the context of AIoT, IoT devices are endowed with AI capabilities. On the one hand, they can complete AI operations (edge ​​computing) while ensuring low power consumption and low cost; on the other hand, IoT devices are different from mobile phones, with ever-changing forms and fragmented demands. As the demand for AI computing power is not the same, it is difficult to provide a universal chip architecture across device forms. Therefore, only starting from the IoT scenario and designing a customized chip architecture can it greatly improve performance while reducing power consumption and cost, while meeting the needs of AI computing power and cross-device forms.