In this thesis, we investigate the effect of the variability of speech signals according to speaking rate on the performance of automatic speech recognition (ASR). To reduce the variability, we propose two methods in the feature space and in the Hidden Markov model (HMM) space.
First, we propose a feature extraction method, in which each speech analysis frame has a different time resolution depending on a speech characteristic the frame belongs to. The proposed method provides higher resolution to speech frames of transient regions than those of steady regions. The distinguishing feature of the proposed method is achieved by combining a time-scale modification (TSM) technique and a variable frame rate (VFR) analysis. TSM is applied for increasing the resolution of speech signals in transient regions, and a VFR analysis reduces the resolution of steady regions by discarding steady frames. We performed speech recognition experiments on a task of Korean connected digit, and it was shown that the proposed method reduced word error rate by 14.1% compared to the conventional feature extraction method.
Second, we propose the method of normalizing speaking rate to reduce an acoustic variability due to speaking rate between speakers. The proposed method is to find an optimal speaking rate for each utterance so that the best word accuracy is obtained with the speaking rate. The maximum-a-posterior (MAP) criterion in the HMM space is used for searching the optimal speaking rate, and the utterance is modified by using TSM. The word error rate of the connected digit recognition system was reduced by 10.14% by employing the proposed speaking rate normalization method.