서지주요정보
Formant enhancement of voiced sound for HMM-based speech synthesis = HMM 기반 음성 합성기를 위한유성음 포만트 강화 알고리즘
서명 / 저자 Formant enhancement of voiced sound for HMM-based speech synthesis = HMM 기반 음성 합성기를 위한유성음 포만트 강화 알고리즘 / Sunghee Jung.
발행사항 [대전 : 한국과학기술원, 2015].
Online Access 원문보기 원문인쇄

소장정보

등록번호

8027688

소장위치/청구기호

학술문화관(문화관) 보존서고

MEE 15092

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Speech synthesis has been of growing interest since it has a wide variety of applications such as car navigation systems, e-book readers, and communication aids for the impaired. Two main streams of speech synthesis ap-proaches are unit-based selection and HMM-based speech synthesis system (HTS). HTS is growing in its popu-larity due to its compactness and adaptability to various speaking styles and speakers. In the HTS approach, theoretically, hidden Markov models are generated for all possible combinations of phones and context parameters. However, in reality, the combinations of context parameters increase exponentially as the number of contextual information grows. It is practically impossible to collect enough training data to estimate robust HMMs for all the possible combinations. In order to cope with this problem and generate robust HMMs, intro-ducing parameter tying algorithms in the training step is inevitable. There are mainly two approaches for pa-rameter tying. One is to generate HMM parameters for each monophone and share it for triphones that have the corresponding monophone at the center. Then the HMM is re-estimated with the Baum-Welch algorithm. The other is to employ the classification and regression tree (CART). The regression tree which consists of numerous acoustic and linguistic questions specific to the target language is constructed to cluster the center states of similar units. The latter is less preferred since it requires constructing regression trees for the target language and this rule-based approach is considered to be against the philosophy of HTS, which is to generate the synthesized speech by data-driven statistic approach [3]. Plus, in order to employ this regression tree approach, a very detailed labeling about 53 contextual elements on the training and synthesis sentences must be prepared [1]. Hence, the former approach is the default setting of widely used HTS toolkit. However, with this approach, formant information is inevitably degraded by significant amount. It is illustrated in [2] that the bandwidth and the energy of formants are greatly related with the clarity and naturalness of speech. Therefore, to overcome this undesirable spectral smoothing issue, several formant enhancing algorithms have been reported. The one that is currently adopted by the HTS toolkit is to multiply a fixed constant to cepstral coefficients. Another widely studied post-processing approach is to subtract out a smoothed version of the spectrum itself and add another smoothed version of the original one but less smooth than the one to be subtracted. The shortcoming that these two approaches share in common is that the post-processing filter is not adaptive to the frame nor the characteristics of the phone. They ignore the fact that the trends and the amounts of spectral smoothing vary depending on the characteristics of each unit. Particularly, semivowels such as `r`, `l`, `w`, `y` vary significantly depending on the neighboring phones and thus the formants of these phones are more degraded than those of other phones. For this reason, the goal of this study is to adaptively post-process the HMM-based synthesized speech. To that end, the proposed algorithm is composed of three main parts. Firstly, parameters to represent formant peaks are extracted from the spectral envelope of training data and synthesized speech. Here, the formant parameters are extracted directly from the spectral envelope rather than from spectral coefficients mainly because our algorithm is post-processing one and is applied after the comple-tion of training and the synthesis of speech. Namely, it is necessary for post-processing algorithms to be applica-ble without requiring specific type of spectral coefficients other than the one used in the training and the synthe-sis phases. Since the proposed method requires only the spectral envelope to extract the formant parameters, it can be applied to any type of spectral coefficients among LPC-based ones and cepstral-based ones used in the preceded training and the synthesis steps. Especially nowadays, Mel-generalized cepstral coefficients (MGC) which can either be one of the LPC-based and the cepstrum-based coefficients or even combinations of the two depending on the use of controlling parameters and is being widely used for HTS. The formant parame-ters to be extracted from the spectral envelope are the location, the energy, and the bandwidth of formant peaks. The second part of the proposed algorithm is to construct codebooks for each voiced phone from the extracted formant parameters. Here, 20 codebooks are generated for 16 vowels and 4 semivowels and the size of the codebooks are 32 for all of them. To generate the codebook, K-means clustering was adopted for its low complexity. To match each synthesized speech frame to the codeword, the Euclidean distance was calculated and the codeword having the minimum Euclidean distance was adopted to calculate the post-filter coefficients. The third part is to construct the post-processing filter with the parameters from the codebook. After a series of experiments, it is decided that among a number of filter types, the Hamming window gives the best result. The height and the bandwidth of the Hamming filter are calculated from the codebook parameters. Plus, the relationships with neighboring frames have been considered to avoid perceptible discontinuity in the energy of refined formants. Since the energy of formants takes up the most of the energy in that frame, abrupt changes in formant peaks of neighboring frames result in the discontinuities of speech sound. A subject listening test has been performed to evaluate the proposed algorithm in comparison with the post-processing algorithm adopted in HTS toolkit. 19 listeners have participated in the test and the number of the test sentences is 10. The ten sentences are randomly chosen from the CMU-ARCTIC database [24]. As a result of the test, 79 % of the listeners have shown preference for the proposed algorithm. 5 % of the lis-teners have reported that they find the two indistinguishable, and 15 % of them answered that they prefer the baseline post-processing algorithm to the proposed one. Therefore it can be seen that the proposed algorithm improved the quality of the synthesized speech by refining the formants of the synthesized speech frames.

본 논문에서는 음소와 문맥정보를 반영해 이에 따라 적응적으로 포만트 보정을 함으로써 HMM 기반 음성 합성기의 음질을 향상시키려는 새로운 시도를 소개한다. 다른 음소와 문맥의 HMM 유닛들을 모두 동일하게 후처리하여 포만트를 강화시키는 기존의 접근법과는 달리, 사전에 이러한 특성들을 반영하는 파라미터들을 제시하고 추출해 코드북을 생성한 뒤 이를 이용해 후처리 필터의 파라미터를 음소와 문맥정보마다 다르게 구하는 방법을 제시하고 있다. 코드북에 포함되는 파라미터들은 포만트 피크로부터 추출된 대역폭, 정규화된 에너지, 위치 등의 정보를 사용한다. 각각의 합성음 프레임에 해당하는 코드워드는 해당 음소의 코드북에서 그 프레임의 파라미터 정보들과 가장 적은 유클리드 거리를 갖는 코드워드로 결정된다. 이 코드워드로부터 해밍필터의 이득과 대역폭을 결정해 포만트 피크마다 적용한다. 제안하는 방법은 세 가지의 장점을 갖고 있다. 첫째로, 기존의 후처리 방법에서 고정된 상수 값만큼만 포만트 에너지를 강화시킬 수 있었던 것과는 달리 다양한 범위의 스무딩된 에너지에 대해서 적응적인 이득을 적용해 후처리를 하는 것이 가능해진다. 둘째로, 음성이 덜 단조롭게 변화된다. 즉, 제안하는 알고리즘에서는 필터 이득이 고정되어 있지 않고 문맥 정보 및 음소 특성을 반영해 변화하기 때문에 문장 내에서 에너지의 변화가 커지게 된다. 셋째로, 버지 사운드가 약화된다. 이는 STRAIGHT에서 주어진 합성 파라미터들로 출력 합성음을 만들 때 각 프레임의 에너지가 전체 문장의 에너지로써 정규화되는데, 이때 제안한 알고리즘의 합성음은 포만트 부분의 에너지가 크기 때문에 상대적으로 고주파대역의 에너지가 적게 들리기 때문이다. 제안하는 알고리즘의 평가를 위해 HTS에서 사용하는 후처리 알고리즘과 비교하는 주관적 듣기 평가를 실시하였다. 그 결과, 79 %의 피실험자가 제안하는 후처리알고리즘의 합성음을 선호한다고 응답하였으며 5%의 응답자는 그 두 합성음 간의 차이가 미묘하다고 응답하였다. 그리고 15%의 응답자는 HTS에 내장된 후처리 알고리즘으로 처리한 합성음을 더 선호한다고 응답하였다. 따라서 제안하는 후처리 알고리즘이 HMM 기반 음성 합성기의 음질을 향상시켰음을 알 수 있다.

서지기타정보

서지기타정보
청구기호 {MEE 15092
형태사항 vii,45 : 삽화 ; 30 cm
언어 영어
일반주기 저자명의 한글표기 : 정성희
지도교수의 영문표기 : Min Soo Hahn
지도교수의 한글표기 : 한민수
Including Appendix
학위논문 학위논문(석사) - 한국과학기술원 : 전기및전자공학과,
서지주기 References : p.
QR CODE

책소개

전체보기

목차

전체보기

이 주제의 인기대출도서