It is necessary for a computer to recognize large vocabulary continuous speech, to provide the most convenient way of communication for users. Recently, hidden Markov model(HMM) has become the predominant approach to speech recognition. The performance of an HMM-based speech recognizer depends on all components of speech processing. In this study, we present a acoustic model to improve the output probability modeling capability in an HMM-based acoustic modeling and a lexicon model to effectively represents variations in word pronunciation. In order to evaluate the recognition performance, the proposed methods have been tested on the large vocabulary Korean continuous speech recognition system with a vocabulary of 3,064 words.
First, we study the method to estimate robust output probability distributions by using only a small amount of training data. In the HMM-based approach, the maximum likelihood estimate of the parameters converges to the true values as the number of training data tends to infinity. If the training data are limited, this will result in some parameters being inadequately trained and the classification based on the poorly trained model will result in fatal error. The proposed methods based on a HMM improves modeling of output probability modeling. The basic idea is that the proposed HMVQM uses the state dependent VQ codebook, and each state represents a partition of specific acoustic space. This approach can be reduced the size of model and improved the model accuracy. The proposed HMVQM were compared to discrete HMM-based continuous speech recognition through speaker-independent mode. The experimental results indicated that the proposed methods reduced the word error by 57.9% and sentence error by 60.6%.
Second, we study a method for deriving a stochastic representation of a word baseform from sample utterances. Most large vocabulary speech recognizers employ subwords as basic recognition units. This implies that in order to obtain word (or sentence) recognition, a lexicon which defines the composition rules of the words in terms of basic units must be made available to the recognizer. In general, the lexicon is commonly created by the use of expert knowledge or a standard pronunciation dictionary. These approaches have some problems, e.g., speakers pronunciation variations in many different dialects need to be represented by one or multiple lexical entries. However, the standard pronunciation of a word doesn't have to do with the actual realization of the word, especially in continuous speech recognition. We have described the stochastic lexicon model which allows for pronunciation variations in speech recognition. In this lexicon model, the baseform of words is represented by a hidden Markov model with probability distributions of subword units. This lexicon is automatically trained from sample sentence utterances. Additionally, the stochastic baseform is further optimized to the subword model and recognizer. The proposed stochastic lexicon was compared to a conventional lexicon with a single baseform. In this experiments, the use of a stochastic lexicon reduced the word error by 53.6% and sentence error by 32.9%.