As the 3D graphics demand grows, the hardware accelerator dedicated to 3D graphics algorithms becomes popular. Recently, commodity graphics processors have rendered hundreds of million pixels, but more improvement is still required to render photo-realistic scenes at real-time rates. The factor to limit the growth of 3D graphics rendering performance is memory bandwidth, because there is no temporal locality to make cache effective in 3D graphics pipeline.
In this paper, approach to increase efficiency of memory bandwidth as is proposed. Precharging time and row-activating time is much longer than data cycle time in DRAM-based memory. Since address scheduling chunks the addresses in the same row, it reduces the turn-around cycle of DRAM. Polices adequate to graphics system are suggested and analyzed in this paper.
Out-of-order fragment processing architecture is also proposed with out-of-order memory access. In graphics system, the data ready first can be processed first, owing to rare dependency between fragments. Out of order fragment processing does not enforce the fragment whose all data are fetched on waiting to be served even if earlier fragments are not processed. This requires no reorder buffer and shortens the amount of in-flight fragments to be stored in order to avoid pipeline stall.
Additionally, dedicated texel fetch unit architecture is suggested in this paper. Close pixels in screen a lso have the very high locality in texture space owing to mip-map structure. Texel size can be almost same to pixel size by selecting adequate level-of-detail, and some texture filtering algorithms need many texels adjacent to footprint center. Therefore the several same texel may be needed by neighboring pixel. Fetching redundant texel requests at once reduces cache access and makes one port cache operate just as multi-port cache. A texel fetch unit combines arrived requests with pending requests if their addresses are equal. A texel cache must not stall requests after a miss. This is possible by inserting request queue between tag and cache. It gives cache-prefetching effect.
The architecture using these schemes is proposed in this paper. There are four pixel processors that take charge of four pixel pipelines. These pixel processors have their responsible pixel region, and only render pixels in the region interleaved granularly for load balancing. The frame buffer is divided into four memory modules in the same way so that one pixel processor accesses one memory module. Similarly, there are eight texel fetch units. Every memory accesses from pixel processors and texel fetch units are sent to memory controllers, and they select best operations to maximize effective memory bandwidth by looking up to SDRAM states.
Simulation shows that effective DDR SDRAM memory bandwidth becomes 85% of peak bandwidth and it gives 640 Mpixels/sec fill rate. Dedicated texture fetching unit saves the memory bandwidth and gives 5.4 texels per cycle into one pixel processor on the average, hence 3Gtexels/sec available.