서지주요정보
Hierarchical successor representation learning for multi-goal path planning = 위계적 승계 표상 학습을 통한 멀티골 과제에서의 최적 경로 설정
서명 / 저자 Hierarchical successor representation learning for multi-goal path planning = 위계적 승계 표상 학습을 통한 멀티골 과제에서의 최적 경로 설정 / Nayeong Jeong.
발행사항 [대전 : 한국과학기술원, 2025].
Online Access 원문보기 원문인쇄

소장정보

등록번호

8044006

소장위치/청구기호

학술문화관(도서관)2층 학위논문

MBCS 25001

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Developing reinforcement learning algorithms capable of rapid adaptation and generalization across diverse tasks, similar to humans and animals, still remains a key challenge. We examined a new approach that could enable this flexibility for multi-goal scenarios, which pose particular difficulties due to policy dependency of the successor representation (SR) model. In this study, the hierarchical successor representation (HSR) model addresses multi-goal tasks by using option-level predictive maps based on subgoal configuration. In particular, it calculates option-level distances and values from a single unbiased SR map to derive optimal option-level trajectory and construct the corresponding option-level SR map. The model facilitates rapid learning of optimal paths in multi-goal tasks by leveraging the option-level representations for two-level navigation. Our method demonstrates significantly higher total rewards and fewer steps than previous approaches, selecting subgoals on the way to the target state while avoiding obstacles. This study highlights the potential of combining hierarchical learning with scalable SR maps to improve task generalization in multi-goal environments, contributing to the development of human-like reinforcement learning mechanisms.

인간과 동물처럼 다양한 과제에서 빠르고 유연하게 적응하며 일반화할 수 있는 강화학습 알고리즘의 개발은 여전히 중요한 도전 과제로 남아 있다. 본 연구에서는 이러한 유연성을 멀티골 시나리오에 적용할 수 있는 새로운 접근법을 제시한다. 멀티골 과제는 특히 승계 표상 모델의 정책 의존성으로 인해 효율적인 학습이 어려운 과제 중 하나이다. 본 연구의 위계적 승계 표상 모델은 서브골 구성을 기반으로 옵션 수준의 예측 지도를 구축하여 멀티골 과제를 해결한다. 특히, 단일 비편향 승계 표상으로부터 옵션 수준의 거리와 가치를 계산하여 최적의 옵션 경로를 도출하고, 이에 기반한 옵션 수준의 승계 표상을 형성한다. 이 모델은 옵션 수준 표상을 활용한 이중 네비게이션을 통해 멀티골 과제에서 최적 경로를 빠르게 학습하도록 돕는다. 본 연구에서 제안한 방법은 다양한 멀티골 환경에서 기존 접근법 대비 유의미하게 높은 총 보상과 더 적은 스텝 수를 보여주었으며, 목표 상태로 가는 경로에서 장애물을 피하면서 서브골을 효과적으로 선택했다. 이러한 결과는 위계적 학습과 확장 가능한 SR 지도를 결합하여 멀티골 환경에서의 과제 일반화 능력을 개선할 수 있는 가능성을 보여주며, 인간과 유사한 강화학습 메커니즘 구현에 기여할 수 있음을 시사한다.

서지기타정보

서지기타정보
청구기호 {MBCS 25001
형태사항 v, 36 p. : 삽화 ; 30 cm
언어 영어
일반주기 저자명의 한글표기: 정나영
지도교수의 영문표기: Lee, Sang Wan
지도교수의 한글표기: 이상완
학위논문 학위논문(석사) - 한국과학기술원 : 뇌인지과학과,
서지주기 References: p. 32-34
주제 Reinforcement Learning
Successor Representation
Predictive Map
Hierarchical Reinforcement Learning
Multi-goal Task
강화 학습
승계 표상
예측 지도
위계적 강화 학습
멀티골 과제
QR CODE

책소개

전체보기

목차

전체보기

이 주제의 인기대출도서

Model diagram for the SR-MB model

Model diagram for the DR model

Task environment

Examples of 4-room maze and 9-room maze without puddles Visualization of example 4-room mazes (top row) and 9-room mazes (bottom row) with increasing numbers of subgoals. From left to right, the mazes contain 1 to 7 subgoals, where orange represents subgoal positions, blue indicates the starting state, and red marks the main goal state.

Examples of4-room maze and 9-room maze with puddles Visualization of example 4-room mazes (top row) and 9-room mazes (bottom row) including puddle states, with the increasing number ofsubgoals. From left to right, the mazes contain 1 to 7 subgoals where orange represents subgoal positions, light sky blue illustrates puddle states, blue indicates the starting state, and red marks the main goal stat

All discount factors across the three models are the same as 0.9. QS is a learning rate for the transition probability matrix.

Example subgoal reward values for random seed 0 to 9. Values are rounded to the second decimal place for clarity. In Experiment 1, subgoal reward values from random seed 0 to 9 were used In Experiment 2, subgoal reward values from random seed 0 to 99 were used.

The HSR model diagram for path planning : task environment, option path, value, and distance are determined due to the subgoal con- ACL 11. C1.16 ATINCOLI COT

Self-Attention module for calculation ofoption trajectory value Example of the Self-Attention module architecture, when the task involves three subgoals. Option distance, value, and reward information construct the query and initial key matrix. The matrix multi- plication, attention process, and key matrix reconstruction are repeated three times.

Example first-step query, key, and value matrix Examples of first-step query, key, and value matrix when the task involves three subgoals.

Option-level SR map with three subgoals in 4-room maze nple of the option-level SR map for the case of three subgoals and one main goal in 4-room maze SR represents the deterministic policy from option state 0 →2→1→4

Two-level navigation module or each option-level navigation, state-level execution happens according to the stored option patl

Example of resultant paths of each model in the 4-room maze without puddles Resultant paths ofthe (a) HSR, (b) DR, and (c) SR-MB in the 4-room maze without puddles, especially with 7 subgoals sampled from the interval [1,3].

Average total rewards and steps for varying number of subgoals in the 4-room maze without puddles

Average total rewards and steps for varying intervals of subgoal rewards in the 4-room maze without puddles (a) Bar plots of the average total rewards and (b) steps for the sampling interval of subgoal rewards across the three models. Average total rewards: HSR (10.41 土 1.53, 19.41 土3.80, 32.06 土 5.45, 46.16 土 6.77, 60.92 土 7.59), DR (10.21 土 0.70, 12.95 土 2.44, 15.95 土 4.59, 18.95 土 6.80, 21.95 土

Model efficiency comparison in the 4-room maze without puddles (a) Scatter plot ofsteps VS. total Rewards for the three models in the 4-room maze without puddles. (b) Box plot comparing efficiency scores (total rewards divided by steps) across the three models. Me- dian: HSR (0.8110), DR (0.6354), SR-MB (0.0103). IQR: HSR (0.1850), DR(0.1494), SR-MB (0.0060) Statistical significance was observed f

Example of resultant paths of each model in the 9-room maze without puddles Resultant paths ofthe (a) HSR, (b) DR, and (c) SR-MB in the 9-room maze without puddles, especially with 7 subgoals sampled from the interval [1, 3].

Average total rewards and steps for varying number ofsubgoals in the 9-room maze without puddles L ] 1:, 41. 1 ]. (7.

Average total rewards and steps for varying intervals ofsubgoals in the 9-room maze without puddles Bar plots showing the (a) average total rewards and (b) steps for the sampling intervals of subgoal rewards across the three models. Average total rewards: HSR (10.36 土 1.65, 19.46 土 3.87, 31.79 土 5.35 45.79 土 6.23, 60.45 土 7.18), DR (10.32 土 0.78, 13.12 士 2.49, 16.27 土 4.53, 19.35 土 6.59, 22.47 士 8

Model efficiency comparison in the 9-room maze without puddles (a) Scatter plot ofsteps VS. total rewards for the three models in the 9-room maze without puddles. (b) Box plot comparing efficiency scores (total rewards divided by steps) across the three models. Median: HSR (0.8038), DR (0.6331), SR-MB (0.0111). IQR: HSR (0.1942), DR (0.1517), SR-MB (0.0066). Statistical significance was observed f

Example of resultant paths ofeach model in the 4-room maze with puddles Resultant paths of the (a) HSR, (b) DR, and (c) SR-MB in the 4-room maze with puddles, with 7 subgoals sampled from the interval [1,3].

SR-MB model's behavior in front ofthe puddles SR-MB agent's freezing behavior after the encounter with the puddle states.

SR-MB model's behavior of negative peak

Total rewards per each task index and average steps for varying number of subgoals in the 4-room maze with puddles

Average total rewards and steps for varying intervals ofsubgoals in the 4-room maze with puddles

Model efficiency comparison in 4-room maze with puddles

Example of resultant paths of each model in the 9-room maze with puddles Resultant paths of the (a) HSR, (b) DR, and (c) SR-MB in the 9-room maze with puddles, with 7 subgoals sampled from the interval [1,3].

Total rewards per each task index and average steps for varying number of subgoals in the 9-room maze with puddles (a) Line graph oftotal rewards per each task index. The number ofsubgoals increases from 1 to 7 per 100 tasks (1 to 700 tasks). (b) Bar plot ofthe average steps for the number ofsubgoals across the three models. Average steps: HSR (21.26 士 2.26, 22.12 土 2.72, 22.58 士 2.67, 23.44 士 2.9

Average total rewards and steps for varying intervals ofsubgoals in the 9-room maze with puddles

Model efficiency comparison in the 9-room maze with puddles