한국과학기술원 도서관

서지주요정보
Mixture of Experts 모델 서빙에 효율적인 단일 입력 단위 expert 병렬 실행을 위한 스케줄링 기법 = Fine-grained expert parallelism-based scheduling for Mixture-of-Experts model serving
서명 / 저자	Mixture of Experts 모델 서빙에 효율적인 단일 입력 단위 expert 병렬 실행을 위한 스케줄링 기법 = Fine-grained expert parallelism-based scheduling for Mixture-of-Experts model serving / 심성환.
발행사항	[대전 : 한국과학기술원, 2023].
Online Access	원문보기 원문인쇄

소장정보

등록번호

8040840

소장위치/청구기호

학술문화관(도서관)2층 학위논문

MCS 23025

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Mixture of Experts (MoE) has been proved to be effective in reducing the high computational cost due to neural network scaling. MoE model incurs significantly less computation cost than their dense counterparts of the same size because only a subset of the whole neural network weights are activated for each incoming input. These MoE models cannot avoid the problem of high device memory requirements due to the model size scaling. To address this, expert parallelism is used. It reduces the memory requirements per device by placing MoE layer's experts on different devices. However, expert parallelism greatly increases the execution time per request in the online deep learning model serving scenario where the model inference requests should be handled in real time. To address this, this thesis proposes single-input granularity expert parallelism and two scheduling ideas that can effectively use it: interleaving multiple input’s tasks in layer granularity and prioritizing expert execution tasks from other GPUs. We demonstrate that proposed single input granularity expert parallelism and two scheduling ideas reduce tail latency by 54.2% on average, while maintaining similar throughput to original expert parallelism.

딥러닝 모델 크기의 증가로 인한 높은 계산량 문제를 해결하기 위해 MoE (Mixture of Experts) 모델이 제안되었다. MoE 모델은 각 입력 실행에 일부 레이어들만이 사용되기 때문에 동일 크기 밀집된 (dense) 모델에 비해 현저히 적은 계산량을 보인다. 이러한 MoE 모델 또한 크기 증가로 인한 높은 GPU 메모리 요구량 문제를 피해 갈 수 없기에 MoE 레이어의 Expert 들을 여러 GPU에 분산시켜 실행하는 Expert 병렬 실행 기법이 사용된다. 하지만 이러한 Expert 병렬 실행 방식은 모델 추론 요청이 실시간으로 들어오는 온라인 딥러닝 모델 서빙 시나리오에서 요청 별 실행 시간이 크게 증가하는 문제가 있다. 본 논문에서는 이를 해결하기 위해 단일 입력 단위 Expert 병렬 실행 방식과 이를 효과적으로 사용할 수 있는 두 가지 스케줄링 아이디어인 레이어 단위 입력 중첩 실행 방식과 위탁받은 Expert 실행 작업 우선 처리 방식을 제안한다. 우리는 다양한 강도의 서빙 워크로드에서의 실험을 통해 제안한 단일 입력 단위 Expert 병렬 실행 방식과 스케줄링 기법이 모든 워크로드에서 기존 Expert 병렬 실행 방식과 비슷한 단위 시간당 처리율 (throughput)을 유지하면서도 꼬리 지연을 평균적으로 54.2% 단축시키는 것을 확인했다.

서지기타정보

서지기타정보
청구기호	{MCS 23025
형태사항	ii, 22 p. : 삽도 ; 30 cm
언어	한국어
일반주기	저자명의 영문표기 : Sunghwan Shim 지도교수의 한글표기 : 강지훈 지도교수의 영문표기 : Jeehoon Kang
학위논문	학위논문(석사) - 한국과학기술원 : 전산학부,
서지주기	참고문헌 : p. 20-22
주제	Mixture of Experts 엑스퍼트 병렬 실행 방식 딥러닝 모델 서빙 스루풋 꼬리 지연 Mixture of Experts Expert Parallelism Deep Learning Model Serving Throughput Tail Latency

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서