DCT has been recognized as one of the most widely used transforms in image processing. Specifically, in data compression technique, such as JPEG and MPEG, the fast execution time of DCT is required for the real-time processing. For this purpose, DCT specific processors are often prefered, and some were reported. However, it is a trend to use one chip for one application. Thus, it is very useful that a general purpose DSP processor has an ability to compute DCT fastly.
In this paper, for this objective, a new butterfly structure is proposed. This structure is obtained by a simple decomposition of the original structure, and after decomposition it becomes to the form of repetitions of an elementary structure. Efficient methods for hardware implementation which generate addresses between two stages automatically, are also proposed. Thus, hardware block executing both the elementary block and the address-generation part is described. In addition, for driving this block efficiently, some special instructions are proposed.