한국과학기술원 도서관

서지주요정보
Coherence-based quantitative analysis of reverberation effect on english automatic speech recognition error = 잔향이 영어 음성인식 오류에 끼치는 영향의 코히런스 기반 정량적 분석
서명 / 저자	Coherence-based quantitative analysis of reverberation effect on english automatic speech recognition error = 잔향이 영어 음성인식 오류에 끼치는 영향의 코히런스 기반 정량적 분석 / Hyeonuk Nam.
발행사항	[대전 : 한국과학기술원, 2020].
Online Access	원문보기 원문인쇄

소장정보

등록번호

8035934

소장위치/청구기호

학술문화관(문화관) 보존서고

MME 20030

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Automatic speech recognition (ASR) is one of core techniques for human-machine interaction, yet it is too vulnerable to the external noises for real-life uses. Especially, reverberation has convolutive nature which reduces speech clarity to hinder ASR and make it very difficult to be removed from speeches recorded in reverberant environments. Therefore, improving ASR's robustness to reverberation is essential to applying ASR in various environments. In this research, as a precedent research to optimize ASR performance on reverberated speeches, effect of reverberation on ASR error is quantitatively analyzed using coherence. The ASR environment used in this research is in single-channel machine listening ASR in English language. Room impulse responses obtained in various reverberant conditions are convoluted with clean speeches from English language corpus to simulate reverberated speech. Coherence is used to measure the similarity between reverberated speech spectrograms and corresponding clean speech spectrogram at each time frame and frequency bin. A variable named mean phoneme coherence (MPC) is presented to quantify the spectral contamination of a phoneme in a reverberated speech. MPC of a phoneme is obtained by averaging the coherence values of time frames and frequency bins within the time interval where that phoneme is articulated. Spectral contamination of a phoneme is small when the phoneme’s MPC is close to one. On the other hand, spectral contamination is severe when the phoneme’s MPC is close to zero. By applying ASR to reverberated speeches and comparing MPC distributions of each phoneme in correctly and wrongly recognized words, it is shown that MPC values are statistically higher when phonemes belong to the correctly recognized words than when phonemes belong to wrongly recognized words. From this result, it is quantitatively verified that severe spectrum contamination upon reverberation leads to more ASR error. By comparing phoneme groups' MPC distributions, it is shown that stops increase ASR error rate the least while fricatives increase ASR error rate the most upon increase in spectral contamination. In addition, sequential interaction between phonemes is analyzed by grouping phonemes into voiced consonants, unvoiced consonants and vowels. Upon increase in spectral contamination, voiced consonants increase ASR error rate less when preceded by consonants. On the other hand, vowel and unvoiced consonants increase ASR error rate more when one precedes the other upon increase in spectral contamination. From such methodologies, physical interactions between phonemes and spectral contamination upon reverberation on English ASR error are quantitatively analyzed based on coherence.

인간과 기계간의 상호작용에서 있어 핵심 역할을 하는 음성인식기술은 실생활에서 사용되기에는 음성 신호에 섞여 들어오는 여러 소음에 취약하다. 특히 잔향은 인식하고자 하는 음성신호가 방의 고유 특성과 합성곱이 된 소음으로써, 음성신호의 명료도를 떨어트려 음성인식에 큰 방해가 된다. 따라서 실생활에 음성인식을 널리 적용하기 위해서는 음성인식을 잔향에 강건하게하는 과정이 필수적이다. 본 연구는 잔향이 존재하는 음성 신호의 음성 인식률 최적화를 위한 선행연구로써, 코히런스를 사용하여 잔향이 음성인식환경에 오류를 발생시키는 물리적 원리를 정량적으로 분석했다. 본 연구에 사용된 음성인식 환경은 단일채널의 기계 청취 영어 음성인식이다. 여러 잔향환경에서 얻은 실내충격응답을 잔향이 없는 영어 클린음성신호와 합성곱하여 잔향음성신호를 얻었다. 원본 클린음성신호와 잔향음성신호 간 코히런스를 구함으로써, 두 신호 간의 유사성을 각 시간과 주파수에 대해 얻었다. 이로부터 음성 신호의 각 음소에 해당하는 시간 구간 내 모든 주파수와 시간에 대한 코히런스의 평균을 내어 "평균 음소 코히런스 (MPC, Mean Phoneme Coherence)"라는 변수를 제안하였다. MPC는 잔향음성신호 내의 각 음소 스펙트로그램이 잔향에 의해 얼마나 오염 됐는지를 정량화 한다. MPC가 1에 가까우면 잔향에 의한 스펙트로그램 오염이 큰 음소이고, 0에 가까우면 잔향에 의한 스펙트로그램 오염이 적은 음소이다. 잔향음성신호를 음성인식하여 원본 문장에 비교해 인식이 맞게 된 단어와 틀리게 된 단어 속의 각 MPC 분포를 음소 별로 비교해보면, 각 음소가 음성인식이 맞는 단어에 속할 경우의 MPC값이 음성인식이 틀린 단어에 속할 경우보다 통계적으로 높다. 이를 통해 잔향에 의한 스펙트로그램 오염이 클 경우 음성인식오류가 일어날 가능성이 큰 것을 정량적으로 확인하였다. 음소를 그룹별로 분석하여, 파열음과 마찰음이 MPC 증가에 따른 음성인식 오류 발생률이 각각 가장 낮고 높은 것을 정량적으로 확인하였다. 또한 음소를 유성 자음, 무성 자음, 모음으로 나누어 각 그룹이 각 그룹의 뒤에 발음되었을 때의 MPC 분포의 차이를 분석하였다. 그 결과, 자음 뒤에 유성 자음이 있을 때와 무성 자음과 모음이 서로의 앞에 있을 때, 스펙트로그램 오염 증가에 따른 음성인식 오류 발생률이 각각 가장 낮고 높은 것을 정량적으로 확인하였다. 이를 통해 본 연구는 잔향에 의한 스펙트럼 오염과 특정 음소가 포함된 단어의 영어음성인식오류발생률의 관계를 코히런스를 사용하여 정량적으로 분석하였다.

서지기타정보

서지기타정보
청구기호	{MME 20030
형태사항	vii, 45 p. : 삽화 ; 30 cm
언어	영어
일반주기	저자명의 한글표기 : 남현욱 지도교수의 영문표기 : Yong-Hwa Park 지도교수의 한글표기 : 박용화
학위논문	학위논문(석사) - 한국과학기술원 : 기계공학과,
서지주기	References : p. 42-43

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서