한국과학기술원 도서관

서지주요정보
Models and algorithms for inverse reinforcement learning = 역강화학습을 위한 모델과 알고리즘
서명 / 저자	Models and algorithms for inverse reinforcement learning = 역강화학습을 위한 모델과 알고리즘 / Jae-Deug Choi.
발행사항	[대전 : 한국과학기술원, 2013].

소장정보

등록번호

8025526

소장위치/청구기호

학술문화관(문화관) 보존서고

DCS 13025

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Reinforcement learning (RL) is the problem of how an agent can learn to behave optimally. A reward function plays an important role in determining the optimality because it specifies how much reward or punishment is given in every situation. We thus need an appropriate reward function that describes the objective of the problem or the preference of the agent to formalize an RL problem. However, it is a difficult task in practice. The reward function is often hand-tuned by domain experts iteratively until a satisfactory strategy is obtained via RL algorithms. Therefore, a systematic way to determine the reward function is highly desired to avoid this labor-intensive process. The main focus of this thesis is inverse reinforcement learning (IRL), which aims to infer the reward function that the domain expert is optimizing from her behavior data. Since IRL provides a framework to explore the principle of the behavior, it can be utilized in various research areas such as examining human and animal behaviors, building intelligent agents that imitate the demonstrator, and developing an econometric model for making decisions. A number of studies on IRL algorithms have appeared in the literature during the last decade, but there remain a number of challenges: (1) The IRL problem is inherently ill-posed. There are infinitely many reward functions that make the expert’s behavior optimal. (2) The expert is generally assumed to behave optimally, but she may choose sub-optimal actions. (3) The behavior data is typically assumed to be generated by a single expert having a single reward function. In practice, it is often gathered from a number of experts to obtain the enough amount of the data. (4) When dealing with large problems, we assume that the pre-defined features are given and find the reward function as a linear function of the features. However, it is difficult to specify the features that compactly represent the reward structure. (5) Although the expert is generally assumed to have ability to perceive the current situation completely, she may have a limited sensory capability. In this thesis, we develop novel models and algorithms for finding the reward function from the expert’s behavior data while addressing several limitations of the previous approaches. We first propose a Bayesian framework for IRL to address the inherent ill-posedness of IRL. We then extend the Bayesian framework for IRL to overcome the assumptions of an optimal expert, the behavior data from a single expert, and the pre-defined features. We finally provide a general framework for IRL that can deal with the expert’s limited sensory capability.

강화학습 (reinforcement learning)은 목표를 달성하기 위한 에이전트의 최적 행동을 학습하는 것이다. 보상함수 (reward function)는 에이전트의 목표나 선호도에 비추어 에이전트의 행동을 평가하는 것으로, 최적 행동을 결정함에 있어 중요한 역할을 한다. 그러므로 강화학습 문제를 정형화하기 위해서는 에이전트의 목표나 선호도를 표현하는 보상 함수를 결정해야 한다. 그러나 보상함수를 결정하는 것은 어려운 문제로, 일반적으로 문제의 전문가가 원하는 정책이 얻어질 때까지 반복적으로 보상함수를 수정하는 과정을 거친다. 이러한 과정은 많은 시간과 수고를 필요로 하는 것으로, 보다 체계화된 방법이 필요하다. 본 연구는 문제의 전문가가 수행한 행동 데이터로부터 그 전문가가 내재적으로 최적화하고 있을 보상 함수를 추론하는 역강화학습 (inverse reinforcement learning)에 대한 것이다. 역강화학습은 의사 결정 주체의 행동의 원리를 밝히는 연구로서, 사람이나 동물의 행동에 대한 연구, 주어진 행동을 모방하는 지능적인 에이전트를 생성하는 연구, 의사 결정을 위한 경제학적 모델을 개발하는 연구 등 다양한 연구 분야에서 활용 가능하다. 최근 십여 년간 역강화학습의 많은 연구가 있었지만 다음과 같은 문제들이 남아있다. (1) 역강화학습 문제는 본질적으로 부적절하게 정립된 (ill-posed) 문제이다. 주어진 전문가의 행동을 최적으로 하는 보상함수는 무수히 많이 존재한다. (2) 행동 데이터를 생성하는 전문가가 항상 최적의 행동을 할 것이라 가정하지만, 전문가가 때때로 최적의 행동이 아닌 다른 차선의 행동을 하는 것이 일반적이다. (3) 행동 데이터는 고정된 보상함수를 가진 한 명의 전문가로부터 생성되었다고 가정한다. 하지만 충분히 많은 데이터 확보를 위해서는 여러 명의 전문가가 데이터 생성에 참여하는 것이 현실적이다. (4) 많은 연구에서 커다란 문제를 다룰 때 보상함수가 사전에 주어진 특징함수 (feature function)들의 선형함수로 결정된다고 가정한다. 하지만 보상함수를 선형적이며 간결하게 표현하는 특징함수를 결정하는 것은 어려운 문제이다. (5) 전문가가 현재 상황을 정확하게 파악할 수 있다고 가정하지만, 센서의 한계로 인해서 전문가는 현재 상황을 정확하게 파악하기 어렵다. 본 연구에서는 앞서 기술한 기존 연구들의 한계점을 해결할 수 있는 역강화학습에 대한 새로운 모델과 알고리즘을 제안한다. 먼저 역강화학습 문제의 본질적인 부적절성을 해결하기 위해 역강화학습을 위한 베이지안 (Bayesian) 프레임워크를 제안한다. 그리고 이를 확장하여 최적 전문가에 대한 가정, 한 명의 전문가로부터 행동 데이터가 생성되었다는 가정, 특징함수가 사전에 정의되어 있다는 가정 을 극복하고자 한다. 마지막으로 전문가의 제한된 센서로 인한 문제를 해결할 수 있는 프레임워크를 제안한다.

서지기타정보

서지기타정보
청구기호	{DCS 13025
형태사항	viii, 102 p. : 삽화 ; 30 cm
언어	영어
일반주기	저자명의 한글표기 : 최재득 지도교수의 영문표기 : Kee-Eung Kim 지도교수의 한글표기 : 김기응 수록잡지명 : "Inverse Reinforcement Learning in Partially Observable Environments". Journal of Machine Learning Research, 12, pp.663-702(2011)
학위논문	학위논문(박사) - 한국과학기술원 : 전산학과,
서지주기	References : p. 93-98

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서