In cognitive science, a top-down selective attention (TDSA) mechanism of humans has been studied for decades and is known to be controlled by “objects” in our mind via feedback processes. This cognitive process enhances the perceptual saliency of a response to the object of interest and filters out irrelevant responses. The engineering models using TDSA have been proposed for out-of-vocabulary rejection, and isolated word recognition. In this work, we apply the TDSA mechanism to the N-best rescoring framework to provide attentional information of confusing words within competing hypotheses. The TDSA mechanism is applied to adapt a test input feature for several confusing words. The attentional information required to rescore the hypotheses is then derived as the probability of the adapted features and the amount of feature deformation. Recently, numerous neural network models with attention have been developed and successfully applied to diverse tasks. The sequence to sequence learning framework with attention has become especially popular for sequence labeling tasks such as neural machine translation, image caption generation, and speech recognition. While predicting a soft-window over input sequences corresponding to output targets in previous attention works, our attention approach adapts a test input feature “directly” using a gradient to maximize the probability of the feature given target words. Therefore, our system provides the most probable feature of the target words without the need to train extra attention networks We propose an N-best rescoring and utterance verification systems that integrate attentional information for locally confusing words extracted from alternative hypotheses to a conventional speech recognition system. The attentional information is derived by adapting a test input feature for the word of interest, which is motivated by the top-down selective attention mechanism of the brain. To rescore the competing hypotheses, we define a new confidence measure that contains both the conventional posterior probability and the attentional information for the confusing words. In addition, a neural network is designed to provide different weights within the confidence measure for each utterance. The network is then optimized to minimize the word error rates. Tests on the WSJ and Aurora4 speech recognition tasks were conducted, and our best rescoring results achieve a word error rate of 3.83% and 11.09%, yielding a relative reduction of 5.20% and 2.55% over baselines, respectively.
인지 과학에서 인간의 하향식 선택적 주의 집중(TDSA) 메커니즘은 수십 년 동안 연구되어 왔다. 이는 피드백 (feedback) 과정을 통해 우리의 마음에 있는 "Object"에 의해 제어되는 것으로 알려져 있다. 이러한 인지 과정은 관심 대상에 대한 반응(response)을 향상시키고 관련이 없는 부분을 필터링하는 것을 포함한다. TDSA를 사용하는 엔지니어링 모델은 비어휘(out-of-vocabulary) rejection과 고립 단어 인식(isolated word recognition)과 같은 문제들에 제안되어 왔다. 본 논문에서는 경쟁 가설(competing hypotheses)로부터 혼란스러운 단어(confusing words)에 대한 attention 정보를 제공하기 위해 TDSA 메커니즘을 이용한다. 이러한 TDSA 메커니즘은 혼란스러운 단어의 로그 확률을(log-likelihood)을 최대화하여 입력 특징을 적응하는 데 사용되었다. 또한 반복적인 TDSA 프로세스를 과정을 거치게 되면서 over-fitting 문제를 해결하기 위해 stopping criterion이 적용되었습니다. attention 정보는 마침내 신뢰도 측정(confidence measure) 기법으로 기존의 ASR 시스템에 통합되었다. 또한 제안된 신뢰도에서 발화마다 서로 다른 리스코어링 weight를 출력하는 신경망을 설계하고 WER을 최소화하는 방향으로 네트워크를 최적화하였다. WSJ 및 Aurora4 음성 인식 문제에 대한 테스트가 수행되었으며, 최상의 결과는 3.83% 및 11.09%의 단어 오류율을 달성하여 baseline 보다 각각 5.20% 및 2.55%의 상대적 감소를 나타내었다.