ffbench

Fast Fourier Transform Benchmark

by John Walker
April 1989
Last update: November 28th, 2016

Introduction

Ffbench executes a specified number of passes (default 20) through a loop in which each iteration performs a fast Fourier transform of a square matrix (default size 256×256) of complex numbers (default precision double), followed by the inverse transform. After all loop iterations are performed the results are checked against known correct values.

This benchmark is intended for use on C implementations which define “int” as 32 bits or longer and permit allocation and direct addressing of arrays larger than one megabyte.

If CAPOUT is defined, the result after all iterations is written as a CelLab pattern file. This is intended for debugging in case horribly wrong results are obtained on a given machine.

Archival timings are run with the program's configuration variables set to: Float = double, Asize = 256, and CAPOUT not defined. Passes should be adjusted so the benchmark runs about five minutes and the measured timing normalised to Passes = 20.

Benchmark Results for Various Systems

Representative timings are given below. All have been normalised as if run for 20 iterations.

Time (seconds)	Computer, Compiler, and Notes
2393.93	Sun 3/260, SunOS 3.4, C, “-f68881 -O”. (John Walker).
1928	Macintosh IIx, MPW C 3.0, “-mc68020 -mc68881 -elems881 -m”. (Hugh Hoover).
1636.1	Sun 4/110, “cc -O3 -lm”. (Michael McClary). The suspicion is that this is software floating point.
1556.7	Macintosh II, A/UX, “cc -O -lm” (Michael McClary).
1388.8	Sun 386i/250, SunOS 4.0.1 C “-O /usr/lib/trig.il”. (James Carrington).
1331.93	Sun 3/60, SunOS 4.0.1, C, “-O4 -f68881 /usr/lib/libm.il” (Bob Elman).
1204.0	Apollo Domain DN4000, C, “-cpu 3000 -opt 4”. (Sam Crupi).
1174.66	Compaq 386/25, SCO Xenix 386 C. (Peter Shieh).
1068	Compaq 386/25, SCO Xenix 386, Metaware High C. (Robert Wenig).
1064.0	Sun 3/80, SunOS 4.0.3 Beta C “-O3 -f68881 /usr/lib/libm.il”. (James Carrington).
1061.4	Compaq 386/25, SCO Xenix, High C 1.4. (James Carrington).
1059.79	Compaq 386/25, 387/25, High C 1.4, DOS\|Extender 2.2, 387 inline code generation. (Nathan Bender).
777.14	Compaq 386/25, IIT 3C87-25 (387 Compatible), High C 1.5, DOS\|Extender 2.2, 387 inline code generation. (Nathan Bender).
751	Compaq DeskPro 386/33, High C 1.5 + DOS\|Extender, 387 code generation. (James Carrington).
431.44	Compaq 386/25, Weitek 3167-25, DOS 3.31, High C 1.4, DOS\|Extender, Weitek code generation. (Nathan Bender).
344.9	Compaq 486/25, Metaware High C 1.6, Phar Lap DOS\|Extender, in-line floating point. (Nathan Bender).
324.2	Data General Motorola 88000, 16 Mhz, Gnu C.
323.1	Sun 4/280, C, “-O4”. (Eric Hill).
254	Compaq SystemPro 486/33, High C 1.5 + DOS\|Extender, 387 code generation. (James Carrington).
242.8	Silicon Graphics Personal IRIS, MIPS R2000A, 12.5 Mhz, “-O3” (highest level optimisation). (Mike Zentner).
233.0	Sun SPARCStation 1, C, “-O4”, SunOS 4.0.3. (Nathan Bender).
187.30	DEC PMAX 3100, MIPS 2000 chip. (Robert Wenig).
120.46	Sun SparcStation 2, C, “-O4”, SunOS 4.1.1. (John Walker).
120.21	DEC 3MAX, MIPS 3000, “-O4”.
98.0	Intel i860 experimental environment, OS/2, data caching disabled. (Kern Sibbald).
34.9	Silicon Graphics Indigo², MIPS R4400, 175 Mhz, IRIX 5.2, “-O”.
32.4	Pentium 133, Windows NT, Microsoft Visual C++ 4.0.
17.25	Silicon Graphics Indigo², MIPS R4400, 175 Mhz, IRIX 6.5, “-O3”.
14.10	Dell Dimension XPS R100, Pentium II 400 MHz, Windows 98, Microsoft Visual C 5.0.
10.7	Hewlett-Packard Kayak XU 450Mhz Pentium II, Microsoft Visual C++ 6.0, Windows NT 4.0sp3. (Nathan Bender).
5.09	Sun Ultra 2, UltraSPARC V9, 300 MHz, gcc “-O3”.
3.29	Raspberry Pi 3, ARMv8 Cortex-A53, 1.2 GHz, Raspbian, GCC 4.9.2 “-O3”.
0.846	Dell Inspiron 9100, Pentium 4, 3.4 GHz, gcc “-O3”.

All brand and product names are trademarks or registered trademarks of their respective companies. Results of this benchmark may or may not be representative of the performance of listed systems for other programs and workloads. Lawyers burn spontaneously in an atmosphere of fluorine.

Original `ffbench` Release Announcement

The following text accompanied the initial distribution of ffbench on April 24th, 1989. The text has been slightly edited to remove anachronisms. Clearly we've come, as Voyager has gone, a long way in the the succeeding years.

As Voyager bears down on Neptune, so do the Nineteen Nineties, the Gilded Age of Computing, thunder down the tracks toward our products and company. To celebrate, I'm rolling out a new floating point benchmark, not to replace the venerable FBENCH (now nine years old), but to provide a different perspective on the floating point performance to be had from a machine.

The original FBENCH, derived from a program that performs geometric ray tracing for lens design (as opposed to ray tracing as done for photorealistic rendering), relies heavily on trigonometric functions and thus provides a good metric for trig function performance. Its ability to evaluate trig functions in-line with series approximations permits comparison of four-function floating point performance from compiler generated code with library or hardware implemented floating point. FBENCH has proved, over time, to provide a much better estimate of the relative performance of AutoCAD generation time on a given machine than it has any right to do.

FBENCH is representative of much code that “does geometry”—it consists of many calculation statements, numerous conditionals, and frequent function calls. As such, it does not accurately reflect the performance on bulk number crunching to be had from pipeline, or MIDA/MIPC (multiple instruction dispatch architecture / multiple instruction per cycle) lite-parallel architectures. As Autodesk products come to incorporate more matrix operations and other heavy number-crunching algorithms there is a growing need to characterise machine performance on such tasks as it can differ substantially from relative figures of merit obtained on algorithms such as FBENCH.

I am now rolling out FFBENCH.C, a contender for this new role. While it differs from FBENCH in many ways, its raison d'être is the same—it's a benchmark we control ourselves, which we can easily run on machines we encounter, whose results we can learn, over time, to interpret in the light of their correlation with the performance of our several products on actual machines. Like FBENCH at its introduction, FFBENCH is at present immature and untested. Only as we gain familiarity with its results will it become a useful tool. Perhaps it will prove unrepresentative and be replaced. Regardless, there is a need for a benchmark to characterise loop-intensive floating point performance.

Like FBENCH, FFBENCH makes no attempt to compete with time-proven benchmarks such as Whetstone, LINPACK, or Dhrystone. Its value is that it is portable, ours, runnable under controlled conditions on short notice, does something real (and hence isn't subject to benchmark-loading compiler optimisations), and checks for the right answers (an aspect of the old FBENCH that has embarrassed several vendors).

So what is it? FFBENCH is a C language program which initialises a 256 by 256 array of double precision complex numbers to known values then executes twenty iterations through a loop, each iteration computing the two dimensional Fourier transform of the data in the matrix, then applying the inverse transform to recover the original data. The results of each loop iteration are input to the next. Upon exit from the the loop, values in the matrix are compared against the original pattern stored there—any discrepancies are reported as errors.

The Fourier transform and its inverse are computed with an algorithm derived from that given in “Numerical Recipes in C” by Press et al. This is an N-dimensional generalisation of the fast Fourier transform, At the heart of an FFT algorithm are three nested FOR loops with problem code only in the innermost loop. To the extent these loops can be collapsed into vector operations, subscript calculations replaced by progressive indexing, and expressions within the loop compiled into operations executed in parallel, the execution time will dramatically be reduced.

As FBENCH reflected the contemporary community standards of memory capacity and CPU performance at the time of its creation, so does FFBENCH embody the unvoiced premises of the Gilded Age of Computing—which is to say that it's a memory and CPU hawg. To be precise, FFBENCH requires more than one megabyte of memory, dynamically allocates a buffer more than one megabyte in length and addresses it as a single array of doubles, and performs on the order of 360 million floating point operations in the course of its execution. In addition, the code presumes that the C “int” type is 32 bits or more. In short, we're talking workstations here. If you want to run it on lesser machines, you'll have to tweak the code and prepare to be patient. Well…yes, and I had to be patient to get the FBENCH numbers for the Commodore 128.

In these heady early days, we aren't being as hard nosed as we've been with FBENCH results. I welcome your reports of the execution time of this benchmark on your weapon of choice. On Unix-like machines, reports of User time from the “time” program are fine, and on other machines hand-timed run reports will serve. We'll tighten up the numbers as they become more important.

FBENCH: Trigonometry Function Benchmark

Floating Point Benchmarks

Fourmilab Home Page

by John Walker
November 28th, 2016