한국과학기술원 도서관

서지주요정보
Sequential decision-making with rotting rewards and infinitely many actions = 감소하는 보상 및 무한히 많은 액션에서의 순차적 의사 결정
서명 / 저자	Sequential decision-making with rotting rewards and infinitely many actions = 감소하는 보상 및 무한히 많은 액션에서의 순차적 의사 결정 / Jung-hun Kim.
발행사항	[대전 : 한국과학기술원, 2023].
Online Access	원문보기 원문인쇄

소장정보

등록번호

8041482

소장위치/청구기호

학술문화관(도서관)2층 학위논문

DIE 23011

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

In this thesis, we study the infinitely-many armed bandit problem in rotting rewards where the mean reward of an arm may decrease at each arm pull and, otherwise, it remains unchanged. We first study a simple model where initial mean rewards are generated from a uniform distribution and there is a rotting rate constraint with maximum rotting rate $\varrho=o(1)$. We first provide a regret lower bound of this problem. Then we propose an efficient algorithm using UCB and a threshold for detecting sub-optimal arms achieving a near-optimal regret bound. We then study a more generalized model where initial mean rewards follow a power function class of distributions with exponent parameter $\beta > 0$. Also, for rotting rewards, we study two cases, one under which the cumulative amount of rotting is $V_T$ and the other under which the number of rotting instances is $S_T$ over a time horizon of $T$ time steps. We first provide regret lower bounds for both slow rotting with $V_T=o(T)$ and abrupt rotting with $S_T=o(T)$ scenarios. Then we propose an adaptive window-UCB algorithm for controlling the bias-variance trade-off from the rotting rewards along with a generalized threshold value for detecting suboptimal arms. The proposed algorithm achieves near-optimal regret bounds for both scenarios under some conditions.

본 학위논문에서는 보상이 감소하는 환경 및 무한히 많은 액션의 밴딧 문제를 연구합니다. 여기서 선택된 액션의 평균 보상은 감소할 수 있지만, 그렇지 않으면 변하지 않습니다. 첫 번째로, 초기 평균 보상이 균등 분포에서 생성되고 최대 감소 비율이 $\varrho=o(1)$로서 감소 비율에 제약 조건이 있는 단순한 모델을 연구합니다. 먼저, 이 문제의 후회 하한 값을 제시합니다. 그런 다음 UCB와 임계 값을 사용하여 최선이 아닌 액션을 탐지하는 효율적이고 거의 최적의 후회 경계를 달성하는 알고리즘을 제안합니다. 그 다음으로, 초기 평균 보상이 지수 매개 변수 $\beta( > 0)$를 가진 거듭제곱 함수 분포 클래스를 따르는 보다 일반화된 모델을 연구합니다. 또한, 감소하는 보상에 대해 두 가지 경우를 고려합니다. 하나는 시간 단계 $T$ 동안 누적 감소양이 $V_T$이고 다른 하나는 감소 횟수가 $S_T$인 경우입니다. 우리는 천천히 감소하는 경우인 $V_T=o(T)$와 갑작스럽게 감소하는 경우인 $S_T=o(T)$, 모두에 대한 후회 하한 값을 먼저 제시합니다. 그런 다음 감소하는 보상에서 편향-분산 균형을 조절하기 위한 적응형 창-UCB 알고리즘과 최선이 아닌 액션을 탐지하기 위한 일반화된 임계 값을 제안합니다. 제안된 알고리즘은 일부 조건 하에서 두 경우 모두에 대해 거의 최적의 후회 경계를 달성합니다.

서지기타정보

서지기타정보
청구기호	{DIE 23011
형태사항	ii, 70 p. : 삽도 ; 30 cm
언어	영어
일반주기	저자명의 한글표기 : 김정훈 지도교수의 영문표기 : Se-Young Yun 지도교수의 한글표기 : 윤세영 수록잡지명 : "Rotting Infinitely Many-Armed Bandits". Proceedings of the 39th International Conference on Machine Learning, PMLR, PMLR 162, 11229-11254(2022) Including appendix
학위논문	학위논문(박사) - 한국과학기술원 : 산업및시스템공학과,
서지주기	References : p. 66-68
주제	Sequential decision making Bandit algorithms Rotting rewards Infinitely many arms 순차적 의사 결정 밴딧 알고리즘 감소하는 보상 무한히 많은 액션

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서