runtime, on a Windows XP platform (Linux version of
BrookGPU has an approx. 20% worse performance).
The graphics board used was a GeForce 7800 GTX
(G70 engine) connected to the host machine through
a PCI-Express slot on a nForce4 chipset motherboard.
The driver version was 76.67 (updated in June 2005).
Table 1 shows the times required for computing the
FFT on the GPU for several problem sizes. Data up-
load and download times are isolated from compu-
tation times, and penalize GPU use as a coprocessor
to the CPU for computing FFT. But for some appli-
cations, like medical imaging on precaptured data or
direct video manipulation on GPU, they are not re-
quired.
Computation times include the time consumed by
stream copies between kernels, but those times cannot
be broken down. The performance obtained is simi-
larly independent of the organization of the data. The
time to perform 1D FFT of size N
2
, doubles that of
performing N 1D FFTs of size N because twice the
stages must be performed. The difference between 1D
N
2
and 2D N × N is due to the transpositions.
These times, for the problems of bigger sizes are
3.4 times faster (58 vs 196 ms for 1024x1024 complex
size) than the FFTW library run on an AMD 3500+
CPU with 512KB of L1 cache.
6.1 Conclusion
Our implementation has a remarkable performance
and within the range of the fastest implementation
on GPU. Additionally a comprehensive framework in
which FFT variants and the stream model meet has
been discussed. The proposal and development of
new implementations can now be undertaken along
the lines suggested in this document.
For the future, we shall implement 2D transforms
that collapse the indices in two dimensions at a time.
We think these methods could be especially valuable
when working on GPUs that allow multiple outputs
per kernel.
ACKNOWLEDGEMENTS
This work has been partially supported by “Programa
Nacional de Dise
˜
no y Producci
´
on Industrial” (Project
DPI 2003-09726) of the “Ministerio de Ciencia y Tec-
nolog
´
ıa” of the spanish government, and by “Euro-
pean Regional Development Fund” (ERDF).
Jos
´
e G. Marichal-Hern
´
andez has awarded a grant
by “Becas de investigaci
´
on para doctorandos. Conve-
nio ULL–CajaCanarias 2005”.
The authors would like to thank Terry Mahoney for
a critical reading of the original version of the paper.
REFERENCES
(n.a.). General purpose computation using graphics hard-
ware. Developer’s forum. http://www.gpgpu.com/.
Buck, I. (2004). Brook specification v.0.2. Tech. Rep.
CSTR 2003-04 10/31/03 12/5/03, Stanford University.
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K.,
Houston, M., and Hanrahan, P. (2004). Brook for
GPUs: stream computing on graphics hardware. ACM
Trans. Graph., 23(3):777–786.
Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the
machine calculation of complex fourier series. Math-
ematics of Computation, 19:297–301.
Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., and alter
(2003). Merrimac: Supercomputing with streams. In
SC’03, Phoenix, Arizona.
Frigo, M. and Johnson, S. (2005). The design and im-
plementation of FFTW3. In Proc. of the IEEE, vol-
ume 93, pages 216– 231. http://www.fftw.org.
Jansen, T., von Rymon-Lipinski, B., Hanssen, N., and
Keeve, E. (2004). Fourier volume rendering on the
GPU using a Split-Stream-FFT. In Proc. of the
VMV’04, pages 395–403. IOS Press BV.
Loan, C. V. (1992). Computational frameworks for the fast
Fourier transform. Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA.
Moreland, K. and Angel, E. (2003). The FFT on a GPU. In
Proc. of the ACM SIGGRAPH, pages 112–119. Euro-
graphics Association.
Nussbaumer, H. J. (1982). Fast Fourier Transform and Con-
volution Algorithms. Springer-Verlag, second edition.
Pease, M. C. (1968). An adaptation of the fast fourier trans-
form for parallel processing. J. ACM, 15(2):252–264.
P
¨
uschel, M. and et al., J. M. F. M. (2005). SPIRAL: Code
generation for DSP transforms. Proc. of the IEEE,
93(2).
Schiwietz, T. and Westermann, R. (2004). GPU-PIV. In
Proc. of the VMV’04, pages 151–158. IOS Press BV.
Stockham, T. (1966). High speed convolution and correla-
tion. In AFIPS Proceedings, volume 28, pages 229–
233. Spring Joint Computer Conference.
Swarztrauber, P. N. (1987). Multiprocessor FFTs. Parallel
computing, 5(1–2):197–210.
Viola, I., Kanitsar, A., and Gr
¨
oller, M. E. (2004). Gpu-
based frequency domain volume rendering. In Proc.
of SCCG 2004, pages 49–58.
Wloka, M. M. (2003). Implementing a GPU-efficient FFT.
SIGGRAPH course slides.
VISAPP 2006 - IMAGE FORMATION AND PROCESSING
86