Floating Point Benchmark: Chapel Language Added (Fourmilog: None Dare Call It Reason)

« ISBNiser 1.3 Update Released | Main | Floating Point Benchmark: Ruby Language Updated »

Thursday, October 26, 2017

Floating Point Benchmark: Chapel Language Added

I have posted an update to my trigonometry-intense floating point benchmark which adds the Chapel language.

Chapel (Cascade High Productivity Language) is a programming language developed by Cray, Inc. with the goal of integrating parallel computing into a language without cumbersome function calls or awkward syntax. The language implements both task based and data based parallelism: in the first, the programmer explicitly defines the tasks to be run in parallel, while in the second an operation is performed on a collection of data and the compiler and runtime system decides how to partition it among the computing resources available. Both symmetric multiprocessing with shared memory (as on contemporary “multi-core” microprocessors) and parallel architectures with local memory per processor and message passing are supported.

Apart from its parallel processing capabilities, Chapel is a conventional object oriented imperative programming language. Programmers familiar with C++, Java, and other such languages will quickly become accustomed to its syntax and structure.

Because this is the first parallel processing language in which the floating point benchmark has been implemented, I wanted to test its performance in both serial and parallel processing modes. Since the benchmark does not process large arrays of data, I used task parallelism to implement two kinds of parallel processing.

The first is “parallel trace”, enabled by compiling with:
chpl --fast fbench.chpl --set partrace=true
The ray tracing process propagates light of four different wavelengths through the lens assembly and then uses the object distance and axis slope angle of the rays to compute various aberrations. When partrace is set to true, the computation of these rays is performed in parallel, with four tasks running in a “cobegin” structure. When all of the tasks are complete, their results, stored in shared memory passed to the tasks by reference, is used to compute the aberrations.

The second option is “parallel iteration”, enabled by compiling with:
chpl --fast fbench.chpl --set pariter=n
where n is the number of tasks among which the specified iteration count will be divided. On a multi-core machine, this should usually be set to the number of processor cores available, which you can determine on most Linux systems with:
cat /proc/cpuinfo | grep processor | wc -l
(If the number of tasks does not evenly divide the number of iterations, the extra iterations are assigned to one of the tasks.) The parallel iteration model might be seen as cheating, but in a number of applications, such as ray tracing for computer generated image rendering (as opposed to the ray tracing we do in the benchmark for optical design), a large number of computations are done which are independent of one another (for example, every pixel in a generated image is independent of every other), and the job can be parallelised by a simple “farm” algorithm which spreads the work over as many processors as are available. The parallel iteration model allows testing this approach with the floating point benchmark.

If the benchmark is compiled without specifying partrace or pariter, it will run the task serially as in conventional language implementations. The number of iterations is specified on the command line when running the benchmark as:
./fbench --iterations=n
where n is the number to be run.

After preliminary timing runs to determine the number of iterations, I ran the serial benchmark for 250,000,000 iterations, with run times in seconds of:

	user	real	sys
	301.00	235.73	170.46
	299.24	234.26	169.27
	297.93	233.67	169.40
	301.02	236.05	171.08
	298.59	234.45	170.30
Mean	299.56	234.83	170.10

The mean user time comes to 1.1982 microseconds per iteration.

Now, to one accustomed to running this benchmark, these times were distinctly odd if not downright weird. You just don't see real time less than user time, and what's with that huge system time? Well, it turns out that even though I didn't enable any of the explicit parallelisation in the code, it was actually using two threads. (I haven't dug into the generated C code to figure out how it was using them.) The first clue was when I looked at the running program with top and saw:

    PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    20   0  167900   2152   2012 S 199.7  0.0   0:12.54 fbench

Yup, almost 200% CPU utilisation. I then ran top -H to show threads and saw:

    PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
    20   0  167900   2152   2012 R 99.9  0.0   1:43.28 fbench
    20   0  167900   2152   2012 R 99.7  0.0   1:43.27 fbench

so indeed we had two threads. You can control the number of threads with the environment variable CHPL_RT_NUM_THREADS_PER_LOCALE, so I set:
export CHPL_RT_NUM_THREADS_PER_LOCALE=1
and re-ran the benchmark, verifying with top that it was now using only one thread. I got the following times:

	user	real	sys
	235.46	235.47	0.00
	236.52	236.55	0.02
	235.06	235.07	0.00
	235.17	235.20	0.02
	236.20	236.21	0.00
Mean	235.68

Now that's more like what we're used to seeing! User and real times are essentially identical, with negligible system time. Note that the user time in the single threaded run was essentially identical to the real time when it was running with two threads. So, all that was accomplished by using two threads was burning up more time on two cores and wasting a lot of time in system calls creating, synchronising, and destroying them. With one thread, the mean user time per iteration was 0.9427 microseconds per iteration.

I then ran the C benchmark for 166,051,660 iterations, yielding run times of (296.89, 296.37, 296.29, 296.76, 296.37) seconds, with mean 296.536, for 1.7858 microseconds per iteration.

Comparing these times gives a ratio of 0.5279 for Chapel to C. In other words, the Chapel program ran about twice as fast as C.

Now, let's explore the various kinds of explicit parallelism. First, we'll enable parallel trace by compiling with “--set partrace=true”. The results are…disastrous. I ran a comparison test with 10,000,000 iterations and measured the timings for CHPL_RT_NUM_THREADS_PER_LOCALE set to the number of threads in the table below:

threads	real	user	sys
1	16.92	16.91	0.00
2	30.74	41.68	18.16
4	43.15	68.23	90.65
5	64.29	112.38	358.88

The amount of computation done in the parallel threads is just not large enough to recover the cost of creating and disposing of the threads. The thread overhead dwarfs the gain from parallelisation, and all we manage to do is keep the CPU cores (the machine on which I'm testing has eight) busy with system and user time overhead, which increases so rapidly the real runtime degrades as we throw more cores at the problem. Interestingly, the partrace=true program, when restricted to one thread, still ran much slower than the serial version of the program, which ran in 9.49 seconds on one thread.

Next, we'll move on to parallel iteration, which models a “farm” algorithm division of processing a large data set. Here, we simply partition the number of iterations of the benchmark and process them in separate threads with Chapel's “coforall” mechanism. Running 250,000,000 iterations with 8 threads (“--set pariter=8”) yields timings of:

	user	real	sys
	342.27	48.95	39.84
	339.50	48.10	39.76
	343.01	49.34	42.19
	342.08	48.78	39.90
	338.83	47.70	37.30
Mean	341.14	48.57	39.79

Now we're cooking! Going to 8 threads working on the iterations cut the total real runtime from 235.68 seconds for the serial implementation to just 48.57 seconds—almost five times faster. Note that we paid a price for this in additional user computation time across all of the threads: 341.14 seconds as opposed to 235.68, and we incurred around 40 seconds of system overhead, which was negligible in the serial program, but the bottom line was that we “got the answer out” much more quickly, even though the machine was working harder to get to the finish line.

To see just how much performance I could get from parallelism, I moved testing to the main Fourmilab server, Pallas. This machine has 64 CPU cores and runs at about the same speed as the laptop on which I was developing. To confirm this, I ran the C fbench compiled and static linked on the laptop with timings of (301.43, 300.92). This is essentially the same speed as the laptop running the same binary.

Next, I built the Chapel benchmark with pariter=32 and ran it with the following settings of CHPL_RT_NUM_THREADS_PER_LOCALE.

threads	real	user	sys
1	459.76	458.86	0.08
16	33.08	523.78	0.12
32	17.17	530.21	0.34
64	25.35	816.64	0.43

Finally, here are timings for a pariter=64 build.

threads	real	user	sys
32	17.12	528.46	0.29
64	14.00	824.79	0.66

By using 64 threads and cores, we are now running 16.8 times faster than the single thread, non-parallel version of the program.

Chapel is an open source software project which runs on a wide variety of computing platforms. Even without its parallel capabilities, it outperforms current releases of GCC for a scientific computation task like the floating point benchmark, and for algorithms which can be readily parallelised, it can deliver a large performance increases on multi-core computer systems without awkward or configuration-dependent programming. If you're looking at a computationally-intense project where parallel computing may make a difference, it's well worth investigating.

The relative performance of the various language implementations (with C taken as 1) is as follows. All language implementations of the benchmark listed below produced identical results to the last (11th) decimal place.

Language	Relative Time	Details
C	1	GCC 3.2.3 `-O3`, Linux
JavaScript	0.372 0.424 1.334 1.378 1.386 1.495	Mozilla Firefox 55.0.2, Linux Safari 11.0, MacOS X Brave 0.18.36, Linux Google Chrome 61.0.3163.91, Linux Chromium 60.0.3112.113, Linux Node.js v6.11.3, Linux
Chapel	0.528 0.0314	Chapel 1.16.0, `-fast`, Linux Parallel, 64 threads
Visual Basic .NET	0.866	All optimisations, Windows XP
FORTRAN	1.008	GNU Fortran (g77) 3.2.3 `-O3`, Linux
Pascal	1.027 1.077	Free Pascal 2.2.0 `-O3`, Linux GNU Pascal 2.1 (GCC 2.95.2) `-O3`, Linux
Swift	1.054	Swift 3.0.1, `-O`, Linux
Rust	1.077	Rust 0.13.0, `--release`, Linux
Java	1.121	Sun JDK 1.5.0_04-b05, Linux
Visual Basic 6	1.132	All optimisations, Windows XP
Haskell	1.223	GHC 7.4.1`-O2 -funbox-strict-fields`, Linux
Scala	1.263	Scala 2.12.3, OpenJDK 9, Linux
Ada	1.401	GNAT/GCC 3.4.4 `-O3`, Linux
Go	1.481	Go version go1.1.1 linux/amd64, Linux
Simula	2.099	GNU Cim 5.1, GCC 4.8.1 -O2, Linux
Lua	2.515 22.7	LuaJIT 2.0.3, Linux Lua 5.2.3, Linux
Python	2.633 30.0	PyPy 2.2.1 (Python 2.7.3), Linux Python 2.7.6, Linux
Erlang	3.663 9.335	Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}] Byte code (BEAM), Linux
ALGOL 60	3.951	MARST 2.7, GCC 4.8.1 -O3, Linux
PL/I	5.667	Iron Spring PL/I 0.9.9b beta, Linux
Lisp	7.41 19.8	GNU Common Lisp 2.6.7, Compiled, Linux GNU Common Lisp 2.6.7, Interpreted
Smalltalk	7.59	GNU Smalltalk 2.3.5, Linux
Forth	9.92	Gforth 0.7.0, Linux
Prolog	11.72 5.747	SWI-Prolog 7.6.0-rc2, Linux GNU Prolog 1.4.4, Linux, (limited iterations)
COBOL	12.5 46.3	Micro Focus Visual COBOL 2010, Windows 7 Fixed decimal instead of `computational-2`
Algol 68	15.2	Algol 68 Genie 2.4.1 -O3, Linux
Perl	23.6	Perl v5.8.0, Linux
Ruby	26.1	Ruby 1.8.3, Linux
QBasic	148.3	MS-DOS QBasic 1.1, Windows XP Console
Mathematica	391.6	Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian

Posted at October 26, 2017 21:41

Fourmilog: None Dare Call It Reason

John Walker's Fourmilab Change Log

Thursday, October 26, 2017

Floating Point Benchmark: Chapel Language Added