Application algorithm chip "trinity" analysis of speech recognition

The artificial intelligence industry chain consists of a basic layer, a technical layer, and an application layer. Similarly, intelligent speech recognition also consists of these three layers. Based on the accumulation of large amounts of data, the development of deep neural network models, and the iterative optimization of algorithms, the accuracy of speech recognition has been continuously improved in recent years.

Based on the accumulation of large amounts of data, the development of deep neural network models, and the iterative optimization of algorithms, the accuracy of speech recognition has been continuously improved in recent years. In October 2016, Microsoft announced that the error rate of English speech recognition has dropped to 5.9%, which is comparable to humans. At this stage, in an ideal environment, voice recognition systems of many companies have crossed the practical threshold and have been widely used in various fields.

The artificial intelligence industry chain consists of a basic layer, a technical layer, and an application layer. Similarly, intelligent speech recognition also consists of these three layers. This article starts with the commercialization of speech recognition and discusses the algorithms and hardware computing capabilities that drive the development of speech recognition. The Trinity analyzes the current state, trends, and difficulties of speech recognition.

First, application

Intelligent voice technology is one of the most mature technologies for artificial intelligence applications and has the natural nature of interaction. Therefore, it has a huge market space. According to the data of the China Voice Industry Alliance â€œWhite Paper on China's Intelligent Voice Industry Development in 2015â€, the scale of the global intelligent voice industry will exceed 10 billion US dollars for the first time in 2017, reaching 10.5 billion US dollars. China's smart voice industry in 2017 will also exceed 10 billion yuan for the first time, with a compound annual growth rate of more than 60% in five years.

The tech giants are building their own intelligent voice ecosystem. There are IBM, Microsoft, and Google in foreign countries, and Baidu and HKUST fly in the country.

Companies such as IBM, Microsoft, and Baidu use a combination model in speech recognition to continuously improve speech recognition performance. Based on the acoustic model of six different deep neural networks and the language model of four different deep neural networks, Microsoft has achieved recognition accuracy that surpasses humans. The University of Science and Technology is based on a deep full sequence convolutional neural network speech recognition framework and has achieved practical-level identification performance. Intelligent voice startup companies such as Yun Zhisheng, Jietong Huasheng, and Si Chi Chi are also constantly polishing their identification engines and are able to bring their own technologies to the industry.

Driven by the giants and innovators, voice recognition has been rapidly developing in smart homes, smart vehicles, voice assistants, and robots.

1, smart home

In the smart home, especially the smart speaker market, Amazon and Google are in an industry dominant position and have their own characteristics.

Amazonâ€™s Echo has sold nearly 10 million units, setting off the online smart speaker market. Compared to traditional speakers, Echo has functions such as remote wake-up playing music, online inquiry consulting information, and intelligent control of home appliances. However, in the intelligent question and answer aspect, Echo performed generally. Google used this as a breakthrough point to release Google Home and grab 23.8% of smart speaker market share from Amazon. In September 2017, Amazon released a number of Echo second generation products. Compared with the previous generation, Amazon has a significant improvement in sound quality, and Echo Plus has a more powerful home control function that can automatically search for and control the attachment of smart home devices. .

In the intelligent language control household appliances market, such as the language-controlled television, the language-controlled air-conditioning, and the language-controlled lighting in China, HKUST News, Yun Zhisheng, and Kai Ying Tailun made in-depth arrangements.

HKUST announced that it has released a sig- nal speaker with JD.com, and launched InFocus TV Assistant in 2016 to create entry-level applications in the smart home sector. Yunzhisheng provided IoT artificial intelligence technology, and through the cooperation with Gree and other companies, integrated their own voice recognition technology into the terminal home appliance products. In addition, Yun Pansheng released the 'Pandora' voice control program, which can significantly reduce product intelligence. Cycle. Qi Ying Tai Lun combines its powerful hardware (terminal smart speech recognition chip CI1006) and algorithm (depth learning speech recognition engine) to provide offline and online complete speech recognition program, and has a wide range of layout in all areas of the Internet of things.

2, smart car

With the development of Intelligent Networking, it is expected that the penetration rate of Vehicular Networks on the vehicle side will exceed 50% in the future. However, due to factors such as security, vehicle-side intelligence and mobile-side intelligence have great differences. The simple method of copying from mobile phones is not suitable for vehicle-end usage scenarios. Voice is based on the natural nature of its interaction and is considered to be the main entry path for human-vehicle interaction in the future.

Baidu launched the smart driving assistant CoDriver with its artificial intelligence ecosystem. In response to cooperation with Chery and other car manufacturers, HKUST launched the Flying Fish Car Assistant to advance the car networking process. Sogou and NavInfo have newly launched Flying Song Navigation. Yun Zhisheng and Sci-Chi launched a number of smart language control vehicle-mounted products in navigation applications such as navigation and head-up displays. Going out to ask questions is based on their own query magic mirror into the smart car market.

In the commercialization of voice recognition, there is a need for collaborative support in terms of content, algorithms, etc., but a good user experience is the first element of commercial applications, and the recognition algorithm is a core factor to enhance the user experience. The speech recognition technology will be discussed in the following three aspects: the algorithm development path of speech recognition, the development status of the algorithm, and the research of the leading edge algorithm.

Second, the algorithm

For a speech recognition system, the first step is to detect whether there is speech input, that is, speech activation detection (VAD). In low-power design, VAD employs an always on mechanism of operation compared to other parts of speech recognition. When VAD detects voice input, VAD wakes up the subsequent recognition system. The overall process of the identification system is shown in Figure 2. It mainly includes several steps of feature extraction, identification modeling and model training, and decoding.

1. VAD (Voice Activation Detection)

Used to determine when there is voice input and when it is muted. Subsequent speech recognition operations are performed on valid segments that are captured by the VAD, which can reduce the noise recognition rate and system power consumption of the speech recognition system. In the near field environment, since the attenuation of the speech signal is limited and the signal-to-noise ratio (SNR) is relatively high, only a simple method (such as zero-crossing rate, signal energy) is required for activation detection. However, in the far-field environment, since the transmission distance of the speech signal is relatively long and the attenuation is severe, the SNR of the collected data of the microphone is very low. In this case, the simple activation detection method is ineffective. The use of deep neural network (DNN) for activation detection is a commonly used method in deep learning speech recognition systems (under this method, speech activation detection is a classification problem). In the intelligent speech recognition chip of MIT, a simplified version of DNN is used to do VAD. This method also has good performance in the case of relatively large noise. However, in the more complex far-field environment, VAD is still the focus of future research.

2, feature extraction

Mel Frequency Cepstrum Coefficient (MFCC) is the most commonly used speech feature, and Mel frequency is extracted based on the human auditory features. The MFCC is mainly composed of pre-emphasis, frame division, windowing, fast Fourier transform (FFT), mel filter bank, and discrete cosine transform. The FFT and mel filter banks are the most important parts of the MFCC. However, recent studies have shown that for speech recognition, mel filter banks are not necessarily optimal. Constrained Boltzmann Machine (RBM), Convolutional Neural Network (CNN), CNN-LSTM-DNN (CLDNN) and other in-depth neural network models are used as learning filters instead of mel filter banks for automatic learning. The voice feature extraction, and achieved good results.

It has been demonstrated that CLDNN has a distinct performance advantage over logarithmic mel filter banks in feature extraction. The feature extraction process based on CLDNN can be summarized as three steps: the convolution, pooling, and pooled signals on the time axis enter the CLDNN.

In the field of far-field speech recognition, microphone array beamforming is still the dominant method due to problems such as strong noise and reverberation.

In addition, at the present stage, the beamforming method based on deep learning has also achieved numerous research results in automatic feature extraction.

3, identification modeling

Speech recognition is essentially the process of transforming an audio sequence into a text sequence, that is, when given a speech input, the most probable text sequence is found. Based on the Bayesian principle, the speech recognition problem can be decomposed into the conditional probability of the given text sequence and the prior probability of the occurrence of the text sequence. The model obtained from the conditional probability model is the acoustic model. The model of a priori probabilistic model of a text sequence is a language model.

3.1 Acoustic model

An acoustic model is the output of converting speech into an acoustic representation, that is, the probability of finding a given speech from an acoustic symbol. For acoustic symbols, the most direct expression is a phrase, but it is difficult to get a good model when the amount of training data is not sufficient. The phrase is composed of continuous pronunciations of multiple phonemes. In addition, the phonemes are not only clearly defined but also limited in number. Thus, in speech recognition, an acoustic model is usually converted into a speech sequence-to-speech sequence (phoneme) model and a pronunciation sequence to a dictionary of output text sequences.

It should be noted that due to the continuity of human vocal organ movements and the specific spelling habits in certain languages, the pronunciation of phonemes is affected by both front and back phonemes. In order to distinguish the phonemes in different contexts, triphones that can consider each phoneme before and after are often used as modeling units.

In addition, in the acoustic model, triphones can be decomposed into smaller particle-states, usually one tritone corresponds to three states, but this causes an exponential increase in modeling parameters. The common solution is to use a decision tree. These triphone models are first clustered, and then the results of clustering are used as categorization targets.

At this point, speech recognition has the ultimate goal of classificationâ€”state. The most commonly used acoustic modeling method is the Hidden Markov Model (HMM). In the HMM, the state is a hidden variable, the speech is an observation, and the jump between states conforms to the Markov assumption. Among them, the state transition probability density is mostly modeled by geometric distribution, and the model of the observed probability of fitting the hidden variable to the observed value is usually a Gaussian mixture model (GMM). Based on the development of deep learning, models such as deep neural network (DNN), convolutional neural network (CNN), and recurrent neural network (RNN) have been applied to the modeling of observational probabilities and have achieved very good results. The following gives the principle of each model, the problems solved and their limitations, and gives the context of the development of modeling methods caused by the limitations of the model.

1) Gaussian Mixture Model (GMM)

The observed probability density function is modeled by a Gaussian mixture model. During training, it is continuously iteratively optimized to obtain the weighted coefficients in the GMM and the mean and variance of each Gaussian function. The GMM model has a faster training speed, and the GMM acoustic model has a small amount of parameters and can be easily embedded in the terminal equipment. For a long time, the GMM-HMM hybrid model is the best performing speech recognition model. However, GMM cannot use contextual information and its modeling capabilities are limited.

2) Deep Neural Network (DNN)

The earliest neural network used for modelling acoustic models, DNN solves the inefficiency of data representation based on Gaussian mixture models. In speech recognition, the DNN-HMM hybrid model greatly improves the recognition rate. At present, DNN-HMM is still a common acoustic model in the specific speech recognition industry based on its relatively limited training cost and high recognition rate. It should be noted that, based on the constraint of the modeling method (consistency requirement of model input feature length), the DNN model uses a fixed-length sliding window to extract features.

3) Recurrent neural network (RNN)/convolutional neural network (CNN) model

For different phonemes and speech rates, the optimal feature window length is different using the context information. RNN and CNN, which can effectively use variable-length context information, can achieve better recognition performance in speech recognition. Thus, CNN/RNN performs better than DNN in speech rate robustness.

In the use of RNN modeling, the models used for speech recognition modeling are: multi-hidden long-short-term memory network (LSTM), highway LSTM, ResidualLSTM, bidirectional LSTM, delay-controlled bidirectional LSTM.

LSTM, based on gated circuit design, can use long and short time information to achieve very good performance in speech recognition. In addition, the recognition performance can be further improved by increasing the number of layers, but simply increasing the number of layers of LSTM will cause difficulty in training and disappearance of gradient.

Highway LSTM adds a gated direct link between memory cells in adjacent LSTM layers to provide a direct and unattenuated path for information flow between layers, thus eliminating the problem of gradient disappearance

Residual LSTM provides a shortcut between the LSTM layers and also solves the gradient disappearance problem.

The bi-directional LSTM can use past and future contextual information, so its recognition performance is better than that of the unidirectional LSTM, but since the bidirectional LSTM makes use of future information, the speech recognition system based on the bi-directional LSTM model needs to observe a complete passage. It can only be recognized later and it is not suitable for real-time speech recognition systems.

The delay-controlled bidirectional LSTM achieves a compromised performance and real-time modeling scheme by adjusting the reverse LSTM of the bidirectional LSTM and can be applied to real-time speech recognition systems.

CNN modeling aspects, including Time Delay Neural Network (TDNN), CNN-DNN, CNN-LSTM-DNN (CLDNN), CNN-DNN-LSTM (CDL), Deep CNN, Layer-by-Layer Extension and Attention (LACE) CNN Dilated CNN.

TDNN, the earliest CNN modeling method used for speech recognition, TDNN convolved along the frequency axis and the time axis at the same time, so it can use variable-length context information. TDNN is used for speech recognition in two cases. In the first case, only TDNN is very difficult to use for large vocabulary continuous speech recognition (LVCSR) because of variable-length utterance and variable length. Contextual information is two different things. In LVCSR, variable length representation problems need to be dealt with, while TDNN can only deal with variable-length context information. Second case: TDNN-HMM hybrid model, because HMM can handle variable-length representation problems, Therefore, this model can effectively handle the LVCSR problem.

The CNN-DNN adds one or two layers of convolutional layers before the DNN to improve the robustness of different speaker's vocal tract problems. Compared to pure DNN, the CNN-DNN has a certain degree of performance. Increase in margin (5%)

CLDNN and CDL, in these two models, CNN only deals with frequency axis changes, and LSTM is used to use variable-length contextual information.

Depth CNN, where "depth" refers to more than a hundred layers. Spectral maps can be viewed as images with specific patterns. By using relatively small convolution kernels and more layers, the long-range related information on the time and frequency axis can be used to deepen the modeling performance and bidirectionality of CNN. LSTM performance is comparable, but deep CNN has no latency issues. In the context of controlling computational costs, deep CNN can be well applied to real-time systems.

Layered context extension and attention (LACE) CNN and divided CNN, deep CNN calculations are relatively large. Therefore, LACE CNN and divided CNN that can reduce the amount of computation are proposed, and the entire utterance is regarded as a single input map. The intermediate results can be reused. In addition, the cost can be reduced by designing the step size of each layer of the LACE CNN and the divided CNN network so that it can cover the entire core.

The application environment of speech recognition is often complicated. Choosing a model modeling acoustic model that can deal with various situations is a common modeling method in industry and academia. However, each single model has limitations. HMM can handle variable-length representations, CNN can handle variable channels, and RNN/CNN can handle variable context information. In the acoustic model modeling, the hybrid model is the mainstream method of acoustic modeling because it can combine the advantages of each model.

3.2 Language Model

The most common language model for speech recognition is N-Gram. In recent years, the modeling of deep neural networks has also been applied to language models, such as CNN and RNN-based language models.

4. End-to-end speech recognition system

In DNN-HMM or CNN/RNN-HMM models, DNN/CNN/RNN and HMM are separately optimized, but speech recognition is essentially a sequence identification problem. If all the components in the model can be jointly optimized, it is likely that To obtain better recognition accuracy, this can also be seen from the mathematical expression of speech recognition (using the Bayesian rule after the expression changes), so the end-to-end processing is also introduced into the speech recognition system.

4.1 CTC Guidelines

The core idea is to introduce blank tags, and then do sequence-to-sequence mapping based on forward-backward algorithms. The CTC criteria can be divided into character-based CTCs, other output units-based CTCs, and word-based CTCs. Since the CTC criteria directly predict characters, words, etc. instead of predicting phonemes, it can eliminate expert knowledge of dictionaries in speech recognition. . As in non-word-based CTCs, language models and decoders are still needed. Thus, character-based CTCs and other output units-based CTCs are non-pure end-to-end speech recognition systems. In contrast, the word-based CTC model is a purely end-to-end speech recognition system.

Based on the word-based CTC criteria, models that use 100,000 words as output targets and 125,000 hours of training samples for speech sequences to word sequences can surpass phoneme-based models. However, the word-based CTC model has difficulty in training and slow convergence.

4.2 Attention-based model

Compared to the CTC criteria, the Attention-based model does not need to have the inter-frame independence assumption. This is also a great advantage of the Attention-based model, so the Attention-based model may be able to achieve better recognition performance. However, attention-based model training is more difficult than the CTC criteria, and it has the disadvantage that it cannot monotonously align from left to right and has a slower convergence. By using the CTC objective function as an auxiliary cost function, the Attention training and CTC training are combined in a multi-task learning approach. This training strategy can greatly improve the convergence of the Attention-based model and alleviate the alignment problem.

Deep learning has played a key role in the development of speech recognition. Acoustic models follow the development path from DNN to LSTM to end-to-end modeling. One of the biggest advantages of deep learning is feature characterization. In the case of noise, reverberation, etc., deep learning can regard noise and reverberation as new features, and can achieve ideal recognition performance by learning noise and reverberation data. At the present stage, the end-to-end modeling method is the key research direction of acoustic model modeling, but it has not achieved obvious performance advantages compared to other modeling methods. How to improve the training speed and performance on the basis of end-to-end modeling, and solve the convergence problem is an important research direction of the acoustic model.

5, decoding

Based on the trained acoustic model, combined with a dictionary and a language model, the process of identifying the input speech frame sequence is a decoding process. Traditional decoding is to compile acoustic models, dictionaries, and language models into a network. Decoding is to select one or more optimal paths as the recognition result (optimal output character sequence) based on the maximum posterior probability in this dynamic network space. The commonly used method for searching is the Viterbi algorithm. For an end-to-end speech recognition system, the simplest decoding method is the beam search algorithm.

6. Solutions for far-field complex environments

At present, in the quiet environment of the near field, speech recognition can achieve a very good recognition effect. However, in environments with high noise, many people speaking, strong accents, especially in the far field environment, there are still many problems to be solved in speech recognition. Speech model adaptation, speech enhancement and separation, recognition model optimization, etc. are commonly used alternative solutions.

6.1 Speech Enhancement and Separation

In the far-field environment, the attenuation of voice input signals is severe. In order to enhance the voice signals, the beamforming technology of the microphone array is often used. For example, Google Home adopts a design scheme of dual-Megaphone, and Amazon Echo adopts a 6+1 microphone array design scheme. In recent years, the deep learning method has been applied to speech enhancement and separation. The core idea is to transform speech enhancement and separation into a supervised learning problem, that is, the problem of predicting the input sound source. Some studies have used DNN instead of beamforming to achieve speech enhancement and have achieved ideal results under certain scenarios. However, in an environment with large background noise, there is still much room for improvement in the performance of the method.

In the case of many people speaking, if the input signal is not separated and speech recognition is performed, the recognition effect will be poor. For this problem, beamforming is a better solution when multiple speakers are far away from each other, but when multiple speakers are close together, the beamforming speech separation effect is also poor. In order to avoid the problem of scene classification caused by beamforming, traditional methods mostly attempt to solve the problem in a single channel. Commonly used algorithms include computational auditory scene analysis, non-negative matrix factorization, deep clustering, etc. However, these methods only have noise. These techniques achieve better results when the signal (other than the sound source) has distinct characteristics from the sound source signal. In other cases, these methods have a general effect on speech separation. In 2016, Dr. Yu Dong proposed a new permutation invariant training, which cleverly solved the problem and achieved good results.

6.2 Speech Model Adaptation

A large and rich set of data (which can provide more information) is the most straightforward way to enhance the model's generalization ability.

Based on cost and training time considerations, only limited training data is generally used. At this point, adding a Kullback-Leiblerdivergence regularization term to the model training is a very effective way to solve the model adaptation problem.

In addition to adding regular terms, using a very small number of parameters to characterize speaker characteristics is another adaptive approach, which includes: singular value decomposition bottleneck adaptive decomposition of the full rank matrix into two low rank matrices, reducing training parameters Subspace method, subspace method includes:

1. Add i-vector, speaker code, noise estimation and other auxiliary features to each layer of the input space and depth network;

2. Clustering Adaptive Training (CAT);

3. Hidden layer decomposition (FHL). Compared to CAT, FHL requires only a small amount of training data because the FHL base is a rank-1 matrix, while the CAT base is a full rank matrix, with the same number of bases. CAT requires more training data.

Real-time performance is one of the issues of high concern in speech recognition applications. Real-time performance directly affects the user's sense of experience. Enhancing the real-time performance of speech recognition can be accomplished by reducing the cost of computing time and improving the ability to identify hardware.

7, reduce operating time costs

SVD, based on the mathematical principle of singular value decomposition, decomposes the full rank matrix into two low rank matrices, reduces the parameters of the depth model, and can not reduce the performance of model recognition;

Compress the model, using vector quantization or very low-bit quantization algorithms;

Change the model structure, mainly for LSTM, add a linear mapping layer in LSTM, reduce the output dimension of the original LSTM, thus reduce the operation time cost;

Use cross-frame correlation to reduce the frequency of evaluating deep network scores. For DNNs or CNNs, this can be accomplished by using a skipping frame strategy where the acoustic score is calculated every few frames and the score is copied at the time of decoding. To frames that do not have an acoustic score evaluated.

In addition, improving the computing power of the recognition stage hardware and developing a dedicated speech recognition chip is of great significance for enhancing the real-time nature of speech recognition. The following discussion will be conducted in this regard.

Third, the chip

Constant accumulation of high-quality big data and deep learning algorithms is the key to continuous improvement of speech recognition performance. The core processing chip at the base layer is the key element supporting massive training data, complex deep network modeling methods, and real-time inference. Speech recognition includes two parts: training and recognition (given a trained model and recognition of input speech).

In the training stage, because of the large amount of data and the large amount of computation, a traditional CPU or single processor can hardly complete a model training process alone. (At the initial stage, Google Brain Speech Recognition Project is based on 16000 CPUs, using 75 days to complete a Deep neural network model training with 156M parameters. The reason is that there is only a small number of logic operation units in the CPU chip architecture, and the instruction execution is a serial process one by one, and its calculation power is insufficient. The development of chips with high computing power has become a trend in the development of voice recognition and even the entire artificial intelligence hardware.

Unlike CPUs, GPUs have a large number of computing units and are therefore particularly suitable for massively parallel computing. In addition, FPGAs, TPUs, and ASICs that continue traditional architectures are also widely used in large-scale parallel computing. It should be noted that, in essence, these chips are the result of the trade-off of computational performance and flexibility/generality, ie, as shown in FIG. 3 . CPUs, GPUs are general-purpose processors, DSPs are ASPs, TPUs are ASICs, and FPGAs are Configurable Hardwares.

In addition, based on the needs of real-time, low power consumption, and high computational power, dedicated speech recognition AI chips are used to process a large number of matrix operations in the recognition phase, and computational acceleration is the mainstream of the terminal voice recognition chip market in the future.

1. Cloud scene

Due to the large amount of calculation and training data, and a large amount of parallel operations, the model training part of the current speech recognition is basically carried out in the cloud. In the cloud training, Nvidia's GPU dominates the market, and multi-GPU parallel architecture is the common infrastructure solution for terminal training. In addition, Google uses TPU for training and recognition in its artificial intelligence ecosystem.

At the current stage, most of the recognition parts of speech recognition companies are also on the cloud, such as Google home, Amazon Echo, Chinaâ€™s National Science and Technology University of Science and Technology, and China Telecomâ€™s information technology. In the cloud recognition, although GPUs are also used, GPUs are not the optimal solution. They use the respective advantages of CPU, GPU, and FPGA, and use heterogeneous computing solutions (CPU+GPU+FPGA/ASIC).

2. Terminal scenarios

In applications such as smart homes, there are high requirements for real-time performance, stability, and privacy. Due to the consideration of cloud data processing capabilities, network delays, and data security, edge calculations that have been decentralized into the terminal hardware have been rapidly developed. The offline speech recognition is a kind of edge intelligence based on edge computing. We think offline and online are the development routes of coexistence of speech recognition. In the terminal offline recognition, the trained model needs to be stored in the chip. Given a speech input, the engine invokes the model to complete the recognition. The two key factors of terminal voice recognition are real-time performance and cost, in which real-time affects user experience, and cost affects voice recognition application range.

Because deep neural network has obvious performance advantages in speech recognition, it is currently the mainstream speech recognition modeling method. However, the model parameters of the neural network are generally very large, and there are a large number of matrix calculations in the recognition process. A common DSP or CPU needs a large amount of time to process the problem, and thus cannot meet the real-time requirements of speech recognition. The price of GPU and FPGA is another major obstacle to its large-scale application in terminal voice recognition. Considering that the scene is relatively fixed and requires high computational performance in terminal applications, the development of a dedicated chip for speech recognition is the development trend of terminal speech recognition hardware.

Chi-Chi Intelli (ChipIntelli): Established in Chengdu in November 2015. In June 2016, the worldâ€™s first artificial intelligence-based speech recognition chip CI1006 was introduced. The chip integrates neural network acceleration hardware to enable single-chip, offline local, and large vocabulary recognition, and the recognition rate is significantly higher than traditional terminal speech recognition. Program. In addition, Kaiyin Tailun can provide inexpensive single wheat far-field speech recognition module, the actual recognition effect can be comparable to the use of the Connex noise reduction module dual-Mic module, significantly reducing the far-field speech recognition module composition. Qi Yingtailun achieved significant technology and first-mover advantage in the development of terminal speech recognition chips.

MIT Project: MIT Black Science and Technology, which is the chip in the paper published by MIT at ISSCC2017. The chip can support DNN computing architecture, perform high-performance data parallel computing, and can realize single chip off-line recognition of thousands of words.

Yun Zhisheng: Yun Zhisheng is committed to building a "cloud core" voice ecosystem service system. It just obtained an investment of 300 million yuan, and will invest some of its funds in the development of the terminal voice recognition chip "UniOne". According to reports, the chip The DNN processing unit will be built in, compatible with multi-microphone arrays.

Over the past few decades, especially in recent years, speech recognition technology has continued to make breakthroughs. However, in most scenarios, speech recognition is far from perfect. Solving the speech recognition problem in the far-field complex environment is still the current research hotspot. In addition, under normal circumstances, speech recognition is aimed at a specific task, training a dedicated model, and therefore, the portability of the model is poor.

In the dialogue process, humans can use prior knowledge very efficiently. However, the current speech recognition systems are still unable to make effective use of prior knowledge. Therefore, there are still many problems to be solved in speech recognition. Excitingly, with the continuous accumulation of high-quality data, continuous technological breakthroughs, and increased hardware platform computing power, voice recognition is rapidly developing in the direction we expect.

The original title applications, algorithms, chips, "Trinity" analysis of speech recognition The author of this article is Chen Xie capital Huangsong Yan, the original first in the WeChat public number: Chen Yu capital. Huang Songyan, Ph.D. in artificial intelligence at Zhejiang University and a former senior algorithm engineer at Huawei, has conducted in-depth research on deep learning and its applications.

The vibration elevator can be also called as vertical elevator, Spiral Conveyor, Spiral Elevator etc.

The vibration elevator is applicable to lift materials of powder,granule,block and short fiber vertically up to eight meters.The material is conveyed solely through the action of centrifugal force generated by twin externally mounted motors. This machine is also useful in material cooling and drying applications during the process of lifting.

According to customer's special request, vertical conveyor can also be designed to classify the materials and convey flammable and explosive materials.

Feature

* Low wear

* Low maintenance

* Low power consumption

* Minimum use of floor space

* Lift heights up to 9 meters

* No moving parts - virtually maintenance free

* Quiet running

* Can be totally covered

* Smooth vibration,low transmitted vibration

Application

Food industry

sugar powder,starch,milk powder,egg powder,rice powder,food additive,seasoning,spice etc.

Chemical industry

fertilizer, detergent powder, sodium carbonate, talcum powder, dyestuff, rubber, PVC resin, pigment, cosmetics, coatings, Chinese medicine powder,fiber etc.

Metal, metallurgy industry

copper powder,nickle powder,aluminum powder,copper granule,ore alloy powder,welding powder,battery material, abrasive powder, fire-proof material etc.

Spiral Elevator
Spiral Elevator,Spiral Elevator Conveyor,Spiral Vertical Elevator,Vibrating Spiral Elevator
XINXIANG CHENWEI MACHINERY CO.,LTD , https://www.sieves.nl