Acoustic and phonetic contexts are very important in speech recognition. Achieving the highest possible levels of speech recognition performance means making efficient use of all the contextual information. However, current hidden Markov model (HMM) technology primarily approaches the problem from a top-down perspective by modeling phonetic context. In this dissertation work, we present various methods to incorporate acoustic contextual information in HMM-based speech recognition. To evaluate the performance of the proposed methods, we use three kinds of speech databases.
First, we propose a variable information rate (VIR) model which applies different information rates to all basic portions of sampled speech waveform. As a special case of VIR model, we use context-dependent state weights as a scaling factor to reflect the informational importance within each portion of the signal. The discriminating power of the individual states is evaluated based on the acoustic context. Context-dependent state weights are supposed to reduce the influence of non-characteristic feature vectors and to raise the influence of typical feature vectors on the observation probability of an HMM state. The additional parameters are estimated by using the generalized probabilistic descent (GPD) training algorithm. The proposed method does not increase the complexity of the recognizer and can be implemented with minor modification of the conventional recognition algorithm. In speaker-independent speech recognition experiments, the proposed method results in considerably improved performance than the conventional method that treats all speech segments with the same importance.
Second, a new approach of using multi-layer perceptrons (MLPs) to estimate context-dependent state weights is proposed. MLPs have a very flexible architecture which can easily accommodate contextual inputs, and thus we employ this merit of MLPs in order to obtain state weights with the wider acoustic context. In this approach, MLP outputs are used as state-dependent weights of HMM log state-likelihoods. The MLP is trained in two steps. In the first step, context-dependent state weights with explicit context classification are used as the desired outputs, and the MLP is trained with the error back-propagation (EBP) algorithm. In the second step, the MLP parameters are adapted by a discriminative training in order to further improve the discriminability of competing HMM states. The proposed method reduces the error rate considerably as compared with the conventional HMM in three kinds of speech recognition tasks.
Third, a novel method is proposed to incorporate acoustic contextual information into speech recognizers based on HMM. Acoustic contextual information in conventional HMMs is hardly taken into account except for higher-order derivative features, and thus the possible correlation of the successive acoustic vectors is overlooked. We investigate the effects of contextual inputs in HMM-based speech recognizers, and these effects are incorporated into contextual information parameters with simplifying assumptions. The contextual information parameters are shown to measure both the degree of correlation of the input features and the boundary uncertainties between HMM states. The parameter estimation and recognition algorithms can be implemented without extensive modification or increased complexity. Experimental results show that the recognizer with contextual information results in much better performance than the conventional HMM speech recognizer.
Finally, we propose a VIR analysis in which the amount of information within a basic period of speech signal determines the number of features to be extracted. And we formulate the HMMs incorporating the VIR analysis. The information rate parameters, which determine the number of acoustic vectors to be extracted in the period, depend on both an HMM state and the neighboring feature vectors. The VIR analysis is incorporated into the conventional HMMs in two approaches. In the first approach, to not influence the calculation of Viterbi path within an HMM, variable information rates are applied to only within model selection steps. In the other approach, the parameters are directly incorporated into the calculation of state observation probabilities. The information rate parameters are estimated based on the minimum classification error criterion. The HMM recognizers with VIR analysis achieve 10-47% decrease in word error rate for two kinds of continuous speech recognition tasks.