Speech is a basic and comfortable communication means for human. Thus, the use of speech for man-machine interface is rapidly increasing. Text-to-speech(TTS) system transforms an arbitrary input text into a speech signal by synthesizing and concatenating speech units. It provides a very comfortable interface and is being used in various applications.
Sinusoidal model, one of speech synthesis methods for the TTS system, is known to flexibly modify speech characteristics and to produce a high quality synthetic speech. Conventional sinusoidal methods use the pitch onset time measured in speech analysis phase to synthesize speech units, but inaccuracy of the measurement causes a spectral distortion in the synthesized units. Moreover, discontinuities are shown at the boundaries of concatenated speech units.
In this thesis, a novel method for unit synthesis and concatenation is proposed to solve the above problems, such as intra-unit distortion and inter-unit discontinuity. The proposed method assumes a synthesis frame center as the pitch onset time; i.e., a system phase equals a vocal tract phase, and thus the procedure for computing the pitch oneset time is not necessary. speech units are concatenated by using the phase succession of sinusoids and by interpolating the sinusoid amplitudes through several frames near the concatenation point.
To evaluate the proposed method, intelligibility and naturalness tests were carried out, Experimental results showed that speech samples synthesized by the proposed method were better than those by the conventional methods in both tests.