한국과학기술원 도서관

서지주요정보
Energy-efficient embedded media application processor = 에너지 효율적인 임베디드 미디어 어플리케이션 프로세서
서명 / 저자	Energy-efficient embedded media application processor = 에너지 효율적인 임베디드 미디어 어플리케이션 프로세서 / Hyo-Eun Kim.
발행사항	[대전 : 한국과학기술원, 2013].
Online Access	원문보기 원문인쇄

소장정보

등록번호

8024650

소장위치/청구기호

학술문화관(문화관) 보존서고

DEE 13032

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Application Processor (AP) is the main chip on today’s handheld devices such as smartphones, tablet PCs, or portable media players. Unlike PC-based system, there are three main limitations on embedded environment; battery (power), resource (area), and bandwidth limitations. Based on the heterogeneous many-core platform which includes various functional IPs on a single silicon die, power dissipation, implementation area, and memory bandwidth should be carefully considered in AP design. Among the heterogeneous functional IPs on AP, we focused on programmable accelerators for multimedia applications such as GPU, ISP, and DSP which are responsible for intensive multimedia workloads in APs. Multimedia applications can be classified into two types; 2-dimensional (2D) image analysis applications such as image processing or computer vision and 3D image synthesis applications such as 3D graphics. Based on the system which can accommodate those 2D/3D image analysis/synthesis applications, more complex multimedia contents such as augmented reality (AR), 3D display, and 3D reconstruction can be processed on the same hardware platform. Media contents processing includes both data-intensive memory operations and compute-intensive non-memory operations. In order to support various media applications on a single mobile platform, both operations should be supported in an energy-efficient way. In this dissertation, a heterogeneous multimedia processor (media application processor; MAP) is presented for media contents processing on mobile devices. It includes reconfigurable hardware components such as a data transceiver with reconfigurable output drivers, a multi-purpose micro-operation cache, and mode-configurable parallel processing cores for general-purpose media contents processing on battery-limited embedded environment. The data transceiver and the multi-purpose micro-operation cache support data-intensive memory operations, while the mode-configurable parallel processing cores support compute-intensive non-memory operations. [For data-intensive memory operations] The transceiver unit enables high-speed data communication between the implemented MAP and an external memory, so it achieves 8× to 16× bandwidth improvement. Especially, an output driver in the transceiver reconfigures its driving strength according to physical channel loss, so it reduces average power consumption by 32%. The micro-operation cache supports not only data buffering for spatial coherency exploitation between on-chip data elements but also various types of simple arithmetic operations frequently used in media processing to unburden workloads of on-chip processing units. It includes two-level memory hierarchies for energy-efficient memory operations, and all the supported micro operations can be executed only with a single instruction on the fly. Furthermore, additional delay reduction scheme (adaptive texture block selection) is proposed to reduce external memory access latency in image synthesis applications, as a result up to 80% of energy reduction is achieved. [For compute-intensive non-memory operations] In the implemented MAP, two types of parallel processing cores with fixed and floating-point data-paths support compute-intensive ALU operations. Mode-configurable fixed-point parallel processing cores improve operation flexibility by dynamically changing the operation modes between image analysis and synthesis modes, and the homogeneous floating-point parallel stream cores accelerate geometry operations which require high precision. The fixed-point parallel processing cores are the basic and the most frequently used programmable core cluster in MAP to exploit instruction, data, and task (thread) level parallelisms. So, MAP is normally implemented based on many-core platform which includes N cores (N-wide MIMD; multiple instruction multiple data) each of which has M processing elements (M-wide SIMD; single instruction multiple data). Since the parallel processing cores consume more than 50% of total power in general, many-core power management technique such as dynamic voltage frequency scaling (DVFS) is required for energy-efficient MAP implementation. This becomes more important in today’s deep sub-micron CMOS process (65nm, 45nm, 32nm, and further), because leakage power dominates dynamic power as the process technology migrates. In this dissertation, many-core power management technique is also proposed to improve system efficiency in terms of energy or energy-delay-product (EDP) according to target applications. The proposed technique is based on a coordinate descent algorithm which is an optimization technique used in machine learning or pattern classification. The proposed coordinate descent based power management technique achieves ~10^6× faster convergence time (tens of micro-seconds) with 7.3% and 9.6% errors (near-optimal) compared to the optimal configuration in energy and EDP, respectively. It is implemented with 38.9k synthesized logic gates, and it only consumes 6.3mW. So, real-time many-core power management is possible with negligible implementation area (2~4%) and power (2~3%) overheads. The fixed and floating point parallel processing cores improve processing throughput by fully utilizing the computation flexibility, so the entire MAP achieves 2.5× higher frame rate compared to the state-of-the-art media processor in augmented reality (AR) which requires both image analysis and synthesis operations at the same time. The proposed MAP is fabricated in 130nm low-power CMOS process technology within a 4mm × 4mm die. It includes 1.46M synthesized logic gates and a custom-designed transceiver, and it consumes 275mW at 200MHz operation frequency for full operation. The proposed coordinate descent based many-core power management technique is verified in 65nm low-power CMOS process (~30% of leakage power), and also evaluated in more advanced CMOS technology which includes more than 50% of leakage power dissipation.

최근, 스마트 폰, 태블릿 PC 와 같은 스마트 모바일 기기가 대중화 됨에 따라, 모바일 기기 상에서도 PC와 같이 다양한 작업을 수행할 수 있는 환경이 일반화 되었고, 이로 인해 모바일 기기의 핵심 칩이라고 할 수 있는 어플리케이션 프로세서 (AP; application processor) 설계의 중요성이 점차 더 부각되고 있다. 모바일 환경은 PC 환경과는 달리 전력 소모, 실리콘 면적, 메모리 대역폭 등 설계 측면에서 다양한 제약사항들을 포함하고 있기 때문에, 단일 실리콘 칩 상에 다수의 하드웨어 intellectual property (IP)를 포함하는 비동형 (heterogeneous) many-core 프로세서 (어플리케이션 프로세서의 일반적인 구조) 설계 시 이러한 제약사항들을 신중히 고려할 필요가 있다. 본 연구에서는 어플리케이션 프로세서 상에 집적되는 다수의 비동형 IP 중, 멀티미디어 연산을 전담하는 프로그래머블한 형태의 가속기 (media application processor; MAP) 설계 방법을 다루고 있다. 그래픽 연산 유닛 (GPU), 영상 신호 처리기 (ISP), 디지털 신호 처리기 (DSP)등과 같은 어플리케이션 프로세서 상에 집적되는 다양한 멀티미디어 가속기는, 특히 어플리케이션 프로세서 내에서도 방대한 양의 영상데이터를 실시간으로 처리 해야 하기 때문에 수행 하고자 하는 연산의 특성을 기반으로 효율적인 하드웨어 설계를 필요로 한다. 멀티미디어 어플리케이션은 2차원 영상 분석 어플리케이션과 3차원 영상 합성 어플리케이션으로 나뉠 수 있다. 2차원 영상 분석 어플리케이션은 영상 처리 혹은 컴퓨터 비전과 같이 2차원 영상 데이터를 기반으로 다양한 분석 연산을 수행하는 어플리케이션을 통칭하며, 3차원 영상 합성 어플리케이션은 3차원 그래픽스와 같이 가상의 3차원 공간상에 존재하는 물체를 화면에 합성하기 위한 다수의 수학 및 로직 연산을 수행하는 어플리케이션을 통칭한다. 2차원 영상 분석 어플리케이션 및 3차원 영상 합성 어플리케이션을 동시에 가속할 수 있는 하드웨어를 설계할 경우, 증강 현실 (augmented reality), 3차원 디스플레이, 3차원 복원 (3D reconstruction)등과 같이 좀 더 복잡한 형태의 멀티미디어 연산을 동일한 하드웨어 환경 상에서 수행할 수 있도록 하기 때문에, 비동형 many-core 프로세서 설계는 다기능 어플리케이션 프로세서를 구성하기 위한 필수 조건이 된다. 일반적으로, 멀티미디어 연산은 방대한 양의 데이터를 처리 해야 하는 메모리 연산 (data-intensive memory operation)과 병렬 연산을 필요로 하는 비메모리 연산 (compute-intensive non-memory operation)으로 나뉠 수 있다. 따라서, 여러 제약사항을 포함하는 모바일 환경의 단일 하드웨어 플랫폼 상에서 다양한 종류의 멀티미디어 어플리케이션을 가속하기 위해서는, 에너지 소모 측면에서 효율적인 메모리 연산과 비메모리 연산이 지원되어야 한다. 본 연구에서는, 단일 모바일 기기 상에서 다수의 멀티미디어 컨텐츠를 지원할 수 있는 비동형 many-core 프로세서 설계 방법을 다루고 있다. 제안하는 미디어 어플리케이션 프로세서는 연산의 유연성 (flexibility)을 고려한 다수의 하드웨어 IP를 포함한다. 먼저, 칩 외부 메모리와의 통신 대역폭 향상을 위한 데이터 송수신기가 집적되며, 환경에 따라 구동 전력을 조절할 수 있는 가변 출력 구동기를 포함하여 평균 소모 전력을 줄일 수 있게 한다. 다음으로, 멀티미디어 어플리케이션의 메모리 연산을 에너지 측면에서 효율적으로 지원하기 위한 마이크로 연산 캐시 (micro-operation cache)가 포함되어, 2차원 영상 분석 및 3차원 영상 합성 어플리케이션에서 빈번히 수행되는 다양한 종류의 마이크로 연산을 지원할 수 있게 한다. 마지막으로, 영상 분석 및 합성 어플리케이션에 따라 연산 모드를 변경 할 수 있는 구조 변경 가능한 병렬 연산 코어를 제안하여 다양한 종류의 미디어 컨텐츠를 에너지 효율적으로 처리할 수 있게 한다. 특히, 병렬 코어는 멀티미디어 프로세서에서 약 50% 이상의 전력 소모를 차지하고 있기 때문에, 병렬 코어를 위한 전력 관리 기법이 필요하게 된다. 따라서, 누설 전력이 심각한 오늘날의 반도체 공정상에서도 에너지 측면에서 효율적인 코어 동작 조건 (공급 전압 및 동작 주파수)을 실시간으로 찾을 수 있는 전력 관리 기법을 제안하여, 향 후 구현 될 에너지 효율적인 미디어 어플리케이션 프로세서 상에 쉽게 적용할 수 있는 dynamic voltage frequency scaling (DVFS) 컨트롤 하드웨어를 별도로 설계하였다. [효율적인 메모리 연산을 위한 설계] 제안하는 칩 외부 데이터 송수신기 회로는 구현된 프로세서와 외부 메모리 사이의 고속 데이터 전송을 가능케 하며, 기존 모바일 멀티미디어 프로세서 대비 8~16배의 메모리 대역폭 향상을 가져온다. 특히, 구동 전력을 환경에 따라 변화시킬 수 있는 가변출력 구동기는 데이터 송수신시 필요한 소모 전력의 32%를 절감할 수 있도록 도와준다. 제안하는 마이크로 연산 캐시는 칩 상에 존재하는 데이터간의 시공간적 일관성 (temporal/spatial coherency)을 극대화 시켜줌과 동시에 멀티미디어 연산 시 빈번히 사용되는 다수의 마이크로 연산 (영상 필터링, 최소-최대 산출)을 지원할 수 있도록 설계되어 코어의 작업 부담을 덜어 줄 수 있도록 한다. 제안하는 마이크로 연산 캐시는 에너지 측면에서 메모리 연산을 효율적으로 처리하기 위해 2단계의 메모리 계층 구조를 포함하며, 지원하는 모든 종류의 마이크로 연산은 오직 하나의 명령어 (instruction)만으로 구동될 수 있다. 뿐만 아니라, 3차원 영상 합성 어플리케이션의 성능 저하의 주요 원인이 되는 텍스쳐링 연산을 효율적으로 지원하기 위해, 외부 메모리 접근 효율성을 높인 적응 가능한 텍스쳐 블록 선택 방법 (adaptive texture block selection)이 제안되어 최대 80%의 에너지 소모 감소 효과를 얻을 수 있게 된다. [효율적인 비메모리 연산을 위한 설계] 제안하는 미디어 어플리케이션 프로세서는 고정소수점 연산기를 포함하는 병렬 연산 코어와 부동소수점 연산기를 포함하는 병렬 연산코어를 포함하여, 방대한 양의 병렬연산을 가속할 수 있게 한다. 고정소수점 병렬 연산 코어는 2차원 영상 분석 어플리케이션과 3차원 영상 합성 어플리케이션에 따라 하드웨어의 구조를 변경할 수 있도록 설계하여 연산의 유연성을 향상시킨다. 부동소수점 병렬 연산 코어는 높은 정확도를 요구하는 3차원 공간상의 기하 연산 (geometry operation)을 효율적으로 지원하여 합성된 영상의 품질을 향상시키기 위해 별도로 설계 및 집적되었다. 고정 소수점 병렬연산 코어는 명령어, 데이터, 테스크 측면에서의 연산 병렬성을 최대한 활용하여 높은 성능과 에너지 효율성을 추구해야 하는 중요한 하드웨어이기 때문에, 일반적인 멀티미디어 가속 프로세서는 다수의 병렬 연산 코어를 집적하는 many-core 환경을 기반으로 설계된다. 즉, 테스크 병렬성을 위한 N개의 코어 각각은 데이터 병렬성을 위한 M개의 개별 연산기 (PE; processing element)를 포함하게 되며, 각각의 개별 연산기는 명령어 병렬성을 위한 Very Long Instruction Word (VLIW)구조를 기반으로 설계된다. 일반적으로 고정소수점 병렬 연산 코어는 멀티미디어 프로세서가 소모하는 전력의 50%이상을 필요로 하기 때문에, 에너지 혹은 energy-delay-product (EDP) 측면에서 효율을 높일 수 있는 many-core 전력 관리 기법이 필요하게 된다. 본 연구에서 제안하는 many-core 전력 관리 기법은 머신 러닝 분야의 최적화 기법에서 사용되는 알고리즘 중 하나인 coordinate descent 알고리즘을 기반으로 설계되었으며, 최적의 조건에서 동작하는 코어 대비 에너지와 EDP 측면에서 각각 평균 7.3%와 9.6%의 에러를 포함한다. 최적의 코어 동작 조건은 알고리즘 상에서는 간단하게 찾을 수 있지만, 실제 최적의 조건을 찾기 위해서는 수십 초 이상의 많은 연산 시간을 필요로 하게 되고, 코어의 개수가 증가함에 따라 연산에 필요한 시간은 기하급수적으로 (exponentially) 늘어나기 때문에 many-core 프로세서의 실시간 전력 관리 기법으로 사용할 수 없다. 제안하는 coordinate descent 기반의 전력 관리 기법은 평균 10% 미만의 에러를 포함하지만, 최적의 조건을 찾는 알고리즘 대비 약 1/10^6배의 연산 시간만을 필요로 하게 되며, 특히 코어의 개수가 증가함에 따라 연산 시간이 선형적으로 (linearly) 증가하기 때문에 many-core 프로세서의 실시간 전력 관리 기법으로 적용되기에 적합하다. 뿐만 아니라, 기존에 제안된 모바일 멀티미디어 프로세서의 평균적인 구현 면적을 고려했을 때, 약 2~4% (38.9×10^3 개의 합성 로직 게이트)의 추가적인 하드웨어 자원 및 6.3mW의 추가적인 전력 소모 만으로도 쉽게 구현될 수 있어, 오늘날의 many-core 멀티미디어 프로세서에 집적되기 용이하다. 설계된 두 종류의 병렬 연산 코어와 앞서 언급한 마이크로 연산 캐시는 프로세서의 연산 유연성을 증대 시켜주며, 2차원 영상 분석 및 3차원 영상 합성 연산을 모두 필요로 하는 증강현실 어플리케이션의 경우 기존에 제안된 증강현실 전용 프로세서 대비 최대 2.5배의 높은 성능 (frame rate)을 얻을 수 있도록 도와준다. 제안하는 미디어 어플리케이션 프로세서는 130nm 저전력 CMOS 공정으로 구현되었으며 실리콘 면적은 4mm×4mm 이다. 총 1.46×10^6개의 합성 로직 게이트 및 아날로그 데이터 송수신기를 포함하고, 200MHz의 동작 주파수에서 275mW의 전력을 소모한다. 제안하는 many-core의 전력관리 기법은 약 30%의 누설 전력 특성을 갖는 65nm CMOS 공정에서 구현 및 검증되었으며, 65nm 이하 최신 공정에서의 효용성을 판단하기 위해 30~50% 및 50% 이상의 누설 전력을 가정한 검증까지 포함한다.

서지기타정보

서지기타정보
청구기호	{DEE 13032
형태사항	viii, 133 p. : 삽화 ; 30 cm
언어	영어
일반주기	저자명의 한글표기 : 김효은 지도교수의 영문표기 : Lee-Sup Kim 지도교수의 한글표기 : 김이섭 수록잡지명 : "A Reconfigurable Heterogeneous Multimedia Processor for IC-Stacking on Si-Interposer". Transactions on Circuits and Systems for Video Technology, v.22, no.4, pp.589-604(2012)
학위논문	학위논문(박사) - 한국과학기술원 : 전기및전자공학과,
서지주기	References : p. 119-122

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서