This paper describes a novel approach for Visual Question Answering.
Our proposed network solves an open-ended problem with candidate answer recommendation, which is generated solely from the given question.
Then, we combine the score from our question-aware prediction module and the score from candidate answer recommendation module to determine the final composite score.
Our approach uses the bag-of-words (BOW) framework to understand questions, instead of a complex and neural-network-based module; therefore, an additional dataset to pre-train the language model is not required.
Moreover, we show that the BOW framework is capable of extracting the keywords from the question.
Although our proposed approach does not achieve the state-of-the-art performance overall, our approach performs the best for certain types of questions with a small amount of training data.
이 논문은 영상기반 질의응답에 대한 새로운 접근을 제시하였다. 제시된 방법은 질문으로부터 답변을 추천하고, 질문인지 예측 모듈과 답변들의 추천 점수를 합하여 주관식 문제를 해결한다. 논문의 접근법은 복잡한 인공 신경망 알고리즘을 사용하지 않고, 비교적 단순한 방법인 Bag-of-word를 사용하기 때문에 큰 데이터를 이용한 추가적인 학습이 필요하지 않다. 또한, BOW를 사용하여 질문에서 중요한 키워드를 찾아낼 수 있음을 보였다..