Hidden Markov Model (HMM) has become increasingly popular for speech recognition. Although it is true that HMM is good at modeling the stationary and sequential characteristics of speech signal, it has some drawbacks. One of the most frequently criticized aspects of HMM is its weak discrimination ability between competing classes. In this dissertation work, we present various methods to improve discrimination based on continuous density HMM. To evaluate the performance of the proposed methods, we use two sets of speech materials. One is speech for speaker-independent continuous speech recognition and the other is that for speaker-independent isolated word recognition.
First, a discriminative modeling algorithm based on continuous density HMM has been studied. The proposed algorithm assigns different numbers of mixtures to each state of HMM by considering the acoustical variabilities. The variabilities are measured by the change of the entropy information when the number of mixtures is increased. In determining the number of mixtures, a competitive method which takes into account the information of different classes is employed. To obtain a more reliable segmentation information, the use of a training algorithm alternating the increment of the number of mixtures and the segmental k-means training is proposed. The proposed algorithm reduces the error rate considerably compared with a conventional HMM with a fixed number of mixtures in all states.
Second, a new approach of using multilayer perceptrons (MLPs) in combination with HMMs is proposed. The MLP outputs are used as the state-dependent weightings of HMM likelihoods. MLP is trained for phoneme classification using the segmentation information which is obtained from the Viterbi alignment of HMM. Two independent MLPs for different parameter sets are trained with inputs of multiple context frames. The phoneme classification rate is considerably enhanced when their outputs are multiplied together. And, a relation between MLP outputs and the state-dependent weightings is devised to effectively make use of MLP outputs. To improve the discrimination of competing classes, the discriminative training of state-dependent weightings is performed by the hybrid MLP/state-weighted HMM training. The proposed algorithm is shown to be effective in improving the discrimination between competing models.
Third, an MLP/HMM hybrid model in which the input feature vectors are transformed by MLPs to produce prediction error vectors is proposed. The prediction error vectors are taken as observations for the HMM in which the observation density function is represented by Gaussian mixtures to take into account the variances in prediction error signals. The MLE training of the hybrid model is first performed. And then, a training algorithm based on the generalized probabilistic descent (GPD) method is applied to improve the discrimination of the hybrid model. The discriminative training is performed by minimizing the objective function for all training sentences with respect to the weights of MLP predictors as well as HMM parameters. With the Gaussian mixture modeling of prediction error signals, the word error rate is reduced by 27% and 32% for the testing data and training data, respectively. Furthermore, by training with a discriminative criterion, confusion among different models is significantly reduced.
Finally, a discriminative training algorithm is proposed for the stochastic segment model (SSM). In addition to sampled segments from phoneme boundaries, the differences between adjacent samples in a phonetic segment are considered as additional features. It is shown that they contribute remarkably to improve recognition performance. Here, a hybrid architecture of SSM and HMM is proposed. Based on the segmentation information from the HMM, the likelihood score of the SSM is obtained. After MLE training of the SSM is done, the discriminative training algorithm based on the GPD method is performed using the N-best candidate sentences obtained from the HMM-based network search procedure to improve the discrimination of the SSM. We also improve the recognition performance by combining the scores of HMM and SSM using the conjecture that the two systems may convey different information for recognizing an unknown utterance. With the discriminative training of SSM, the recognition rate is increased significantly compared with the MLE-trained SSM.
HMM은 음성 인식을 위한 방법으로서 그 이용이 증대되고 있다. HMM은 음성 신호를 잘 모델링 하고 있지만, 경쟁적인 class들을 분별하는데 있어서 약점을 갖고 있다. 본 논문에서는 연속 밀도 HMM에서 분별력을 증가시키기 위한 몇가지 방법들을 제시한다. 제안된 방법들의 성능을 평가하기 위해서 화자 독립의 연속어와 고립 단어에 대한 인식 실험을 실시하였다.
첫째로, 연속 밀도 HMM에서 분별적인 모델링에 대한 방법이 연구되었다. 제안된 알고리듬에서는 음성 신호의 다양성을 고려하기 위해서 HMM의 각 state에 다른 갯수의 mixture를 할당하였다. Mixture의 갯수를 결정하는데 있어서 state들 간의 경쟁적인 방법을 사용하였다. 또한, 좀더 신뢰성 있는 분할 정보를 얻기 위해서 mixture의 갯수를 증가시키는 과정과 이를 이용해 훈련시키는 과정을 번갈아 가며 실시하였다. 제안된 방법을 이용함으로써 기존의 HMM에 비해서 인식율의 많은 향상을 이룰 수 있었다.
둘째로, HMM과 MLP을 결합하는 새로운 방법이 제안되었다. MLP는 HMM의 state likelihood의 가중치로써 사용되었다. MLP는 먼저 음소 분류를 하기 위해서 훈련되었고, MLP의 출력과 HMM의 state에 따른 가중치와의 적합한 관계를 제시하였다. 경쟁적인 class들 간의 분별력을 증가시키기 위하여 MLP와 state-weighted HMM의 결합 모델에 대한 훈련을 통해서 state에 따른 가중치의 분별적 훈련을 실시하였다. 제안된 방법에 의해서 경쟁적인 모델들간의 분별력을 많이 향상시킬 수 있었다.
셋째로, MLP를 통해서 입력 특징 벡터의 예측 오차를 얻도록 하는 MLP와 HMM의 결합 모델이 제안되었다. 예측 오차는 HMM의 입력으로 사용되며, 예측 오차의 분포를 고려하기 위해서 Gaussian mixture 확률 밀도 함수를 사용하였다. 제안된 결합 모델의 분별력을 높이기 위하여 GPD 방법을 이용한 훈련을 실시하였다. 예측 오차를 모델링 함으로써 많은 인식율의 향상을 가져왔으며, 분별적 훈련을 통해서 모델들 간의 혼돈을 크게 감소 시킬 수 있었다.
끝으로, SSM에 대한 분별적 훈련 방법이 제시되었다. 이를 위해서, SSM과 HMM의 결합 구조가 제안되었다. HMM에 근거한 dynamic programming을 통해서 likelihood가 가장 높은 N개의 후보 문장과 그들에 대한 분할 정보가 얻어지며, 이를 이용해서 GPD 방법에 의한 분별적 훈련을 실시하였다. 또한 SSM과 HMM이 인식을 위해서 서로 다른 정보를 제공할 수 있다는 근거에서 서로의 likelihood score을 결합함으로써 인식율의 향상를 이루었다. SSM에 대한 분별적 훈련을 통해서 기존의 MLE에 의해 훈련된 모델에 비해서 상당한 인식율의 향상을 이루었다.