한국과학기술원 도서관

서지주요정보
Learning from demonstrations under transition dynamic mismatch = 상태 전이 함수의 변화에 강인한 시연 학습
서명 / 저자	Learning from demonstrations under transition dynamic mismatch = 상태 전이 함수의 변화에 강인한 시연 학습 / Taesu Kim.
발행사항	[대전 : 한국과학기술원, 2022].
Online Access	원문보기 원문인쇄

소장정보

등록번호

8039906

소장위치/청구기호

학술문화관(도서관)2층 학위논문

MCS 22050

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Demonstrations relieve the difficulties of Reinforcement Learning (RL). Learning from demonstrations (LfD) is the problem of seeking optimal policies without true reward signals, which is necessary in RL. Also, demonstrations help to speed up learning a new task in RL. Practical challenges arise when we handle demonstrations: (1) when the environments of an agent and a demonstrator (especially, transition dynamics functions) are different and (2) when demonstrations have suboptimal performances or are too few. The prior-art, Indirect Imitation Learning (I2L), overcomes different dynamics by matching state-only distributions, instead of state-action distributions, however, its performance is limited to that of the demonstrator. On the other hand, a method, Trajectory-ranked Reward Extrapolation (TREX) outperforms the demonstrator by inferring a high-quality reward function from ranked demonstrations. The learnt reward model inevitably performs poorly under the dynamic mismatch. Likewise, behavioral priors, learnt from diverse demonstrations, can accelerate RL, but are not useful in a new environment with different dynamic. Firstly, in this paper, we propose a novel algorithm that handles both of the challenges. It learns a reward function with ranked demonstrations while considering domain mismatches by I2L algorithm. Additionally, I2L in the proposed method is replaced with Adversarial Inverse Reinforcement Learning (AIRL) for environments with no dynamic mismatch. It takes the benefit of data augmentation effects when demonstrations are few. In the experiments on continuous physical locomotion tasks, the proposed method outperforms I2L and TREX baselines by up to 330%. Our method is shown robust to transition dynamic mismatches between the agent and demonstrator, and achieves good policies from suboptimal demonstrations. Also, the method with AIRL outperforms baselines when no dynamic mismatch. Secondly, we propose a method for accelerating RL, that incorporates past observations collected in different dynamic from new task.

시연 데이터는 강화 학습의 어려움을 해소해준다. 시연 학습은, 강화 학습에서 필수 요소인, 보상 신호 없이 최적의 정책을 찾는 문제이다. 그리고 시연 데이터은 강화 학습에서 주어진 문제를 빠르게 해결하는데 도움을 준다. 시연 데이터를 다룰 때는 (1) 에이전트와 시연자의 환경(특히, 상태 전이 함수)이 다를 때, (2) 접근 가능한 시연 데이터의 성능이 최적이 아니거나, 시연 데이터의 수가 너무 적을 때, 이렇게 두 가지 경우에 문제가 발생한다. 선행 기술인 간접 모방 학습(I2L)은 상태 분포를 일치시킴으로써 환경이 달라지는 문제를 해결하지만, 이 모델의 성능은 시연자의 성능으로 제한된다. 한편, 경로 순위 기반 보상 신호 외삽법(TREX)는 시연 데이터의 랭크를 이용해 고품질 보상 함수를 추론하여 시연자보다 성능이 뛰어나다. 학습된 보상 모델은 환경이 다른 상황에서 저조한 성능을 보일 수 밖에 없다. 마찬가지로 다양한 시연 데이터로부터 학습한 행동 사전 모델은 강화 학습을 가속화할 수 있지만 상태 전이 함수가 다른 새로운 환경에서는 유용하지 않다. 이 논문에서는, 첫 번째로, 위에서 제시한 두 가지 문제를 모두 처리하는 새로운 시연 학습 알고리즘을 제안한다. 간접 모방 학습 알고리즘으로 도메인 불일치를 고려하면서 순위가 매겨진 데모로 보상 함수를 학습한다. 또한 불일치가 없는 환경에는 간접 모방 학습 대신 적대적 역강화학습(AIRL)로 대체한다. 이는 데모가 적을 때 데이터 증대 효과의 이점을 얻는다. 여러 실험에서 우리가 제안한 방법은 최대 330%까지 선행 기술의 성능을 능가한다. 우리의 방법은 에이전트와 시연자 사이의 상태 전이 함수 불일치에 강인하고, 최적이 아닌 시연 데이터에서 좋은 정책을 찾는다. 또한 우리의 방법은 불일치가 없을 때에 선행 연구의 성능을 능가한다. 두 번째로, 우리는 시연에서 상태 데이터를 이용해 상태 전이 함수가 다른 환경에서도 강화학습을 가속화하는 방법을 제안한다.

서지기타정보

서지기타정보
청구기호	{MCS 22050
형태사항	iii, 18 p. : 삽도 ; 30 cm
언어	영어
일반주기	저자명의 한글표기 : 김태수 지도교수의 영문표기 : Tae-Kyun Kim 지도교수의 한글표기 : 김태균
학위논문	학위논문(석사) - 한국과학기술원 : 전산학부,
서지주기	References : p. 16-18
주제	Learning from Demonstrations Imitation Learning Reward Learning Reinforcement Learning 시연 학습 모방 학습 보상 학습 강화 학습

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서