Speech Recognition Wikipedia
Speech recognition is the sub-field of computational linguistics.
Speech recognition develops methodologies and technologies that enables the recognition of spoken language into text by computers.
Speech recognition is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT).
"Speaker dependent" speech recognition system analyzes the person specific voice and use it to fine-tune the recognition of that person's speech.
"Speaker independent" speech recognition system doesn't need the person specific voice to fine-tune the recognition of that person's speech.
Most recently, speech recognition has been benefited from advances in deep learning and big data.
Model, Method, and Algorithms
Acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
Hidden Markov models
畫圖 Hidden Markov models (HMM)are statistical models that output a sequence of symbols or quantities.
Speech signal can viewed as a piece-wise stationary signal or a short-time stationary signal (~10 milliseconds).
HMMs can only be used in stationary process.
In a short time-scale (10 milliseconds), speech can be though as a Markov model.
HMMs can be trained automatically. HMMs are simple and computationally feasible.
In speech recognition, the HMM would output a sequence of n-dimension real-valued vectors.
HMM outputs one element of sequence every 10 milliseconds.
The output vectors would consists of cepstral coefficients.
Cepstral coefficient are obtained by three step
- taking a Fourier transform of a short time window of speech
- decorrelating the spectrum using a cosine transform
- take the most significant coefficient from result of cosine transform.
HMM tend to have a statistical distribution in each state.
A statistical distribution is a mixture of diagonal covariance Gausssians.
A statistical distribution give a likelihood for each observed vector.
Each word or each phoneme will have a different output distribution.
A hidden Markov model for a sequence of words or phonemes is built by concatenating the hidden Markov for individual words and phonemes.
Modern ASR systems use various combinations of standard techniques to improve results.
A typical large-vocabulary system needs context dependency for the phonemes.
Context dependency for the phonemes means phoneme with different left and right context have different realizations such as HMM states.
A typical large-vocabulary system uses cepstral normalization to normalize different speaker and recording conditions.
A typical large-vocabulary system uses vocal tract length normalization (VTLN) for male-female normalization.
A typical large-vocabulary system uses maximum likelihood linear regression (MLLR) for more general speaker adaption.
A typical large-vocabulary system would use the delta and delta-delta coefficients combined with heteroscedastic linear discriminant analysis (HLDA) for features to capture speech dynamics.
A typical large-vocabulary system might skip delta and delta-delta coefficients rather using splicing and an LDA-based projection followed by heteroscedastic linear discriminant analysis (HLDA) or a global semi-tied covariance transform (maximum likelihood linear transform MLLT).
Many systems use discriminative training techniques.
Discriminative training techniques dispenses purely statistical approach to HMM parameter estimation and optimize some classification-related measure of the training data.
classification-related measure include maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).
Decoding the speech is the term that describes the system takes new utterance and compute the most likely source sentence.
Decoding the speech use the Viterbi algorithm to find the best path.
Viterbi algorithm dynamically create a combination of HMMs
Finite state transducer (FST) approach combines a combination of HMMs statically.
A possible improvement to decoding the speech is to keep a set of good candidates instead of keeping the best candidate.
These good candidates are refined by better re-scoring function.
Best candidate is pick according to refined score.
A set of good candidates can be kept as a list (N-best list) or as a subset of the models (lattice).
Re-scoring is usually done by minimize the Bayes risk.
One approach is that instead of taking the source sentence with maximal probability, take the sentence that minimizes the expectancy of a given loss function to all possible transcriptions.
One criteria is taking the sentence that minimizes the average distance to other possible sentences weighted by their estimated probability.
Levenshtein distance is usually used as the loss function due to its flexibility to adapt for different tasks.
The set of possible transcription is pruned to maintain tractiability.
Lattices is re-score as weighted finite state transducers with edit distances to verify certain assumption.
Dynamic Time Warping (DTW)-based speech recognition
Dynamic Time warping has currently been replaced by HMM-based approach.
DTW measures similarity between two sequences that may vary in time or speed.
DTW finds an optimal match between two given sequences and warps sequences non-linearly to match each other.
DTW has been applied video, audio, graphic, or any data that can be turned into a linear representation.
Automatic speech recognition uses DTW to cope with different speaking speeds.
Alignment method based on DTW is often used in the context of hidden Markov models.
Neural Networks
Since 1980, neural network has been used in many aspects of speech recognition such as phoneme classification, isolated word recognition, audiovisual speech recognition, audiovisual speaker recognition and speaker adaption.
Neural network make no assumptions about feature statistical properties in contrast to HMMs.
Neural network allows discriminative training in a natural and efficient manner.
Deep Feedforward and Recurrent Neural Networks
A Deep feedforward neural network (DNN) is an aritificial neural network with multiple hidden layers between the input and output layers
Deep feedforward neural network has effectiveness in classifying short-time units (individual phonemes or isolated words.
Deep feedforward neural network are not successful in continuous recognition tasks due to lack of ability to model temporal dependencies.
Deep feedforward neural network can be used as pre-processing for HMM-based recognition.
LSTM Recurrent Neural Network (RNNs) and Time Delay Neural Networks (TDNN's) are able to identify latent temporal dependencies to perform the task of speech recognition.
End-to-end Automatic Speech Recognition
Traditional phonetic-based (i.e., all HMM-based model) approaches requires separate components such as the pronunciation, acoustic and language model.
End-to-end models jointly learn all the components of speech recognizer.
The first end-to-end ASR was Connectionist Temporal Classification (CTC) based systems introduced by Google DeepMind Group in 2014.
CTC based systems consist of recurrent neural networks and a CTC layer (RNN-CTC model).
RNN-CTC model learns the pronunciation and acoustic model together.
RNN-CTC is incapable of learning language model due to conditional independence assumption similar to HMMs.
RNN-CTC models map speech acoustics to English characters.
RNN-CTC models make many common spelling mistakes and rely on a separate language model to clean up the transcripts.
RNN-CTC models are expanded on extremely large datasets to two different languages Chinese Mandarin and English.
First end-to-end sentence-level lip reading model (LipNet) is developed by University of Oxford.
LipNet use spatiotemporal convolutions coupled with an RNN-CTC architecture, surpassing human performance in a restricted grammar dataset.
Attention-based models is an alternative approach of ASR.
First Attention-based ASR model was introduced by Google Brain in 2016.
First Attention-based ASR model is named "Listen, Attend, and Spell" (LAS) model.
"Listen" stands for listening to acoustic signals. "Attend" stands for paying attentions to different parts of signal. "Spell" stands for spelling out the transcript one character at a time.
LAS model don't have conditional-independence assumptions and learns the pronunciation, acoustic and language model all together.
Latent Sequence Decompositions (LSD) is one of various LAS model extensions.
LSD was proposed by CMU, MIT and Google Brain.
LSD directly addresses sub-word units which are more natural than English characters
"Watch, Listen, Attend and Spell" (WLAS) is another variation of LAS models.
WLAS surpasses human performance in lip reading dataset.
Performance
The performance of speech recognition systems is usually evaluated in terms of accuracy and speed.
Accuracy is rated with word error rate (WER), single word error rate (SWER) or command success rate (CSR).
Speed is measured with the real time factor.
Accuracy of speech recognition may vary with the following factor
- Vocabulary size and confusability
- Speaker dependence v.s. independence
- Isolated, discontinuous, or continuous speech
- Task and language constraints
- Read v.s. Spontaneous speech
- Adverse conditions
Error rates increase as the vocabulary size grows.
A speaker-dependent system is intended for use by a single speaker. A speaker-independent system is intended for use by any speaker. A speaker-independent system is more difficult to achieve same level of accuracy than a speaker-dependent system.
Isolated speech is also known as isolated-word speech.
In isolated speech, a pause is required between saying each word.
In discontinuous speech, a silence is used between saying each full sentence.
In continuous speech, full sentences are naturally spoken.
The Difficulty of recognizing isolated, discontinuous, continuous speech is "isolated < discontinuous < continuous".
Task constraints Query application dimiss the hypothesis "The apple is red" (fact, no need for hypothesis)
Language constraints
- semantic constraint rejects "The apple is angry."
- Syntactic (Grammar) constraint rejects "Red is apple the."
Read speech is a person reading a previously prepared context.
Spontaneous speech is a person speaking without any previously prepared context.
Adverse conditions involve environmental noise (e.g. noise in a car or a factory), acoustical distortions (e.g. echoes, room acoustics).
Acoustical Signals are structured into a hierarchy of units including phonemes, words, phrases, sentences.
Speech recognition is a multi-level pattern recognition task.
Each level pattern provides additional constraints, e.g. known word pronunciations or legal word sequences can compensate the errors or uncertainties at lower level.
There are four steps for neural network approaches
- Digitize the speech
- Compute features of spectral domain of the speech
- Classify features into phonetic-based categories
- Match the neural network output scores for the best word
A frame is a basic unit of spectral features.
A frame is a short period section of acoustic signal.
Available Software
There are three freely available resources Carnegie Mellon University's Sphinx Toolkit, HTK Toolkit, and Kaldi Toolkit.
[0] https://en.wikipedia.org/wiki/Speech_recognition#Practical_speech_recognition
[1] http://www.speech.cs.cmu.edu/comp.speech/Section6/Q6.1.html