한국과학기술원 도서관

서지주요정보
Data augmentation for natural language processing = 자연언어처리를 위한 데이터 증강방법
서명 / 저자	Data augmentation for natural language processing = 자연언어처리를 위한 데이터 증강방법 / Seanie Lee.
발행사항	[대전 : 한국과학기술원, 2022].
Online Access	원문보기 원문인쇄

소장정보

등록번호

8039062

소장위치/청구기호

학술문화관(도서관)2층 학위논문

MAI 22013

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Deep neural networks have achieved remarkable performance on various natural language processing tasks --- text classification, machine translataion, and question answering to name a few. Although pretraining a model on large unlabeled corpora and finetuning it on labeled data is sample efficient method, it still requires a large amount of annotate data. Data augmentation is known to be one of the most effective method for tackling few labeled data problem. However, it is challenging to construct a well-defined data augmentation for NLP, which preserves semantic of the original data with diversity. In this thesis, we propose three data augmentation methods for question answering and conditional text generation task. First, we leverage probabilistic generative models regularized with information maximization to sample diverse and consistent question answer pairs. Second, we propose adversarial perturbation to generate negative examples for text generation and train a text generation model to push away negative examples from given source sentences. Last, we propose a stochastic word embedding perturbation to regularize QA model for domain generalization. With stochastic word embedding perturbation, we can transform original question and context without any semantic drift.

심층 신경망은 텍스트 분류, 기계 번역 및 질문 답변과 같은 다양한 자연어 처리 작업에서 놀라운 성능을 달성했습니다. 레이블이 없는 큰 말뭉치에 대해 모델을 사전 훈련하고 레이블이 지정된 데이터에서 모델을 미세 조정하는 것이 샘플 효율적인 방법이지만 여전히 많은 양의 레이블 데이터가 필요합니다. 데이터 증강은 레이블이 적은 문제를 해결하는 가장 효과적인 방법 중 하나로 알려져 있습니다. 그러나 원본 데이터의 의미를 다양성으로 보존하는 잘 정의된 NLP용 데이터 증대를 구성하는 것은 어렵습니다. 본 논문에서는 질의응답과 조건부 텍스트 생성 작업을 위한 3가지 데이터 보강 방법을 제안합니다. 첫째, 정보 극대화로 정규화된 확률적 생성 모델을 활용하여 다양하고 일관된 질문 답변 쌍을 샘플링합니다. 둘째, 텍스트 생성을 위한 부정적인 예를 생성하기 위해 적대적 교란을 제안하고 주어진 소스 문장에서 부정적인 예를 밀어내기 위해 텍스트 생성 모델을 훈련합니다. 마지막으로 도메인 일반화를 위한 QA 모델을 정규화하기 위해 확률적 단어 임베딩 섭동을 제안합니다. 확률적 단어 임베딩 섭동으로 의미적 드리프트 없이 원래 질문과 컨텍스트를 변환할 수 있습니다.

서지기타정보

서지기타정보
청구기호	{MAI 22013
형태사항	iv, 47 p. : 삽화 ; 30 cm
언어	영어
일반주기	저자명의 한글표기 : 이신의 지도교수의 영문표기 : Sung Ju Hwang 지도교수의 한글표기 : 황성주 공동지도교수의 영문표기 : Juho Lee 공동지도교수의 한글표기 : 이주호
학위논문	학위논문(석사) - 한국과학기술원 : 김재철AI대학원,
서지주기	References : p. 36-45

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서