Sunday, November 19, 2017

Univac Document Archive: 1107 SLEUTH II and PROCS Manuals Added

I have added the following documents to the Univac 1107 section of the Univac Document Archive. These are PDFs of scanned paper documents in my collection. These documents are fifty years old and may appear wonky to contemporary eyes: text is sometimes misaligned on the page, multiple fonts are intermixed like a ransom note, and sample code sometimes appears as handwriting on coding forms. These are not artefacts of scanning—it's how the documents actually appeared. Recall that only around 38 Univac 1107s were sold, so documents describing it were produced in small numbers and didn't, in the eyes of Univac, merit the expense of the high production values of contemporary IBM manuals.

The Univac 1107 was originally supplied with a machine-specific assembler called SLEUTH (later renamed SLEUTH I). Computer Sciences Corporation subsequently developed an optimising Fortran compiler, a batch operating system initially called the Monitor System and later EXEC II, and a new assembler, SLEUTH II. SLEUTH II was a “meta-assembler”: within the constraints of a maximum word length of 36 bits, it could assemble code for any machine. The 1107 instruction set was defined by procedure definitions, but by writing new definitions, code for other machines could be easily generated. I used descendants of SLEUTH II to assemble code for a variety of minicomputer and microprocessor systems, including early versions of my Marinchip Systems software. SLEUTH II had a powerful procedure and function definition facility which was, rare for languages of the day, fully recursive. (I recall writing Ackermann's Function as a SLEUTH II FUNC just because I could.) A companion manual explained these procedures (PROCS) and functions in greater detail.

Posted at 15:16 Permalink

Friday, November 10, 2017

New: Univac Document Archive

I have just posted a new section on Univac Memories, the Univac Document Archive, which contains PDF scans of hardware and software manuals, sales brochures, and related documents for the Univac 1100 series from the 1107 through the 1100/80.

This collection includes some classics, including the original 1966 EXEC-8 manual whose camera ready copy appears to have been printed (in all capitals) on a 1004 line printer.

There remain a number of lacunæ. I'd love to add hardware manuals for the FH-432 and FH-1782 drums, the FASTRAND, and the CTMC, and software manuals for the Collector, SECURE, FORTRAN V, and others from the era. If you have any of these gathering dust in your attic, why not dig them out, fire up the scanner, and send them my way? (Please scan with at least 300 dpi resolution. I have a PDF editor which allows me to clean up raw scans and put them in publication-ready form.)

Posted at 21:46 Permalink

Monday, November 6, 2017

Floating Point Benchmark: Back to BASICs

The floating point benchmark was born in BASIC. The progenitor of the benchmark was an interactive optical design and ray tracing application I wrote in 1980 in Marinchip QBASIC [PDF, 19 Mb]. This was, for the time, a computationally intensive process, as analysis of a multi-element lens design required tracing four light rays with different wavelengths and axial incidence through each surface of the assembly, with multiple trigonometric function evaluations for each surface transit. In the days of software floating point, before the advent of math coprocessors or integrated floating point units, this took a while; more than a second for each analysis.

After I became involved in Autodesk, and we began to receive requests from numerous computer manufacturers and resellers to develop versions of AutoCAD for their machines, I perceived a need to be able to evaluate the relative performance of candidate platforms before investing the major effort to do a full port of AutoCAD. It was clear that the main bottleneck in AutoCAD's “REGEN” performance (the time it took to display a view of a drawing file on the screen) was the machine's floating point performance. Unlike many competitors at the time, AutoCAD used floating point numbers (double precision for the 80x86 version of the program) for the coordinates in its database, and mapping these to the screen required a great deal of floating point arithmetic to be performed. It occurred to me that the largely forgotten lens design program might be a good model for AutoCAD's performance, and since it was only a few pages of code, even less when stripped of its interactive user interface and ability to load and save designs, it could be easily ported to new machines. I made a prototype of the benchmark on the Marinchip machine by stripping the lens design program down to its essential inner loop and then, after testing, rewrote the program in C, using the Lattice C compiler we used for the IBM PC and other 8086 versions of AutoCAD.

Other than running a timing test on the Marinchip 9900 to establish that, even in 1986, it was still faster than the IBM PC/AT, the QBASIC version of the benchmark was set aside and is now lost in the mists of time. Throughout the 1980s and '90s, the C version was widely used to test candidate machines and proved unreasonably effective in predicting how fast AutoCAD would run when ported to them. Since by then most machines had compatible C compilers, running the benchmark was simply a matter of recompiling it and running a timing test. From the start, the C version of the benchmark checked the results of the ray trace and analysis to the last (11th) decimal place against the reference results from the original QBASIC program. This was useful in detecting errors in floating point implementations and mathematical function libraries which would be disastrous for AutoCAD and would preclude a port until remedied. The benchmark's accuracy check was shown to be invariant of the underlying floating point format, producing identical results on a variety of implementations.

Later, I became interested in comparing the performance of different programming languages for scientific computation, so I began to port the C benchmark to various languages, resulting in the comparison table at the end of this article. First was a port of the original QBASIC program to the Microsoft/IBM BASICA included with MS-DOS on the IBM PC and PC/AT. This involved transforming QBASIC's relatively clean (for BASIC, anyway) syntax to the gnarly world of line numbers, two character variable names, and GOSUBs which was BASICA. This allowed comparing the speed of BASICA with the C compiler we were using and showed that, indeed, C was much faster. The BASICA version of the benchmark was preserved and has been included in distributions of the benchmark collection for years in the mbasic (for Microsoft BASIC) directory, but due to the obsolescence of the language, no timings of it have been done since the original run on the PC/AT in 1984.

I was curious how this archaic version of the benchmark might perform on a modern machine, so when I happened upon Michael Haardt's Bas, a free implementation of the original BASICA/GW-BASIC language written in portable C, I realised the opportunity for such a comparison might be at hand. I downloaded version 2.4 of Bas and built it without any problems. I was delighted to discover that it ran the Microsoft BASIC version of the benchmark, last saved in 1998, without any modifications, and produced results accurate to the last digit.

Bas is a tokenising interpreter, not a compiler, so it could be expected to run much slower than any compiled language. I started with the usual comparison to C. I ran a preliminary timing test to determine an iteration count which would yield a run time of around five minutes, ran five timing runs on an idle machine, and for 3,056,858 iterations obtained a mean run time of 291.64 seconds, or 95.4052 microseconds per iteration. Compared the the C benchmark, which runs in 1.7856 microseconds per iteration on the same machine, this is 53.42 times slower, which is shown in the table below in the “BASICA/GW-BASIC” row. This is still 2.78 times faster than Microsoft QBasic (not to be confused with Marinchip QBASIC) which, when I compared it to C on a Windows XP machine in the early 2000s (running in the MS-DOS console window), ran 148.3 times slower than the C version compiled with the Microsoft Visual C compiler on the same machine.

Since I had benchmarked this program with IBM BASICA in the 1980s, it was possible to do a machine performance comparison. The IBM PC/AT, running at 6 MHz, with BASICA version A3.00 and software floating point ran 1000 iterations of the benchmark in 3290 seconds (yes, almost an hour), for a time per iteration of 3.29 seconds per iteration. Dividing this by the present day time per iteration of 95.4052 microseconds per iteration with Bas, we find that the same program, still running in interpreted mode on a modern machine with hardware floating point, runs 34,484 times faster than 1984's flagship personal computer.

This made me curious how a modern compiled BASIC might stack up. In 2005 Jim White ported the C benchmark to Microsoft Visual BASIC (both version 6 and .NET), and obtained excellent results, with the .NET release actually running faster than Microsoft Visual C on Windows XP. (Well, of course this was a comparison against Monkey C, so maybe I shouldn't be surprised.) These ports of the benchmark are available in the visualbasic directory of the benchmark collection.

FreeBASIC is an open source (GPL) command line compiler for a language which is essentially compatible with Microsoft QuickBasic with some extensions for access to operating system facilities. The compiler produces executable code, using the GNU Binutils suite as its back-end. The compiler runs on multiple platforms including Linux and Microsoft Windows. Since Visual Basic is an extension of QuickBasic, crudded up with Windows-specific junk, I decided to try to port Jim White's Visual Basic version 6 code to FreeBASIC.

This wasn't as easy as I'd hoped, because in addition to stripping out all of the “Form” crap, I had to substantially change the source code due to Microsoft-typical fiddling with the language in their endless quest to torpedo developers foolish enough to invest work in their wasting asset platforms. I restructured the program in the interest of readability and added comments to explain what the program is doing. The “Form” output was rewritten to use conventional “Print using” statements. The internal Microsoft-specific timing code was removed and replaced with external timing. The INTRIG compile option (local math functions written in BASIC) was removed—I'm interested in the performance of the language's math functions, not the language for writing math functions.

After getting the program to compile and verifying that it produced correct output, I once again ran a preliminary timing test, determined an iteration count, and ran the archival timing tests, yielding a mean time of 296.54 seconds for 127,172,531 iterations, or 2.3318 microseconds per iteration. Compared to C's 1.7858 microseconds per iteration, this gives a run time of 1.306 times than of C, or almost exactly 30% slower. This is in the middle of the pack for compiled languages, although slower than the heavily optimised ones. See the “FreeBASIC” line in the table for the relative ranking.

The relative performance of the various language implementations (with C taken as 1) is as follows. All language implementations of the benchmark listed below produced identical results to the last (11th) decimal place.

Language Relative
Time
Details
C 1 GCC 3.2.3 -O3, Linux
JavaScript 0.372
0.424
1.334
1.378
1.386
1.495
Mozilla Firefox 55.0.2, Linux
Safari 11.0, MacOS X
Brave 0.18.36, Linux
Google Chrome 61.0.3163.91, Linux
Chromium 60.0.3112.113, Linux
Node.js v6.11.3, Linux
Chapel 0.528
0.0314
Chapel 1.16.0, -fast, Linux
Parallel, 64 threads
Visual Basic .NET 0.866 All optimisations, Windows XP
C++ 0.939
0.964
31.00
189.7
499.9
G++ 5.4.0, -O3, Linux, double
long double (80 bit)
__float128 (128 bit)
MPFR (128 bit)
MPFR (512 bit)
FORTRAN 1.008 GNU Fortran (g77) 3.2.3 -O3, Linux
Pascal 1.027
1.077
Free Pascal 2.2.0 -O3, Linux
GNU Pascal 2.1 (GCC 2.95.2) -O3, Linux
Swift 1.054 Swift 3.0.1, -O, Linux
Rust 1.077 Rust 0.13.0, --release, Linux
Java 1.121 Sun JDK 1.5.0_04-b05, Linux
Visual Basic 6 1.132 All optimisations, Windows XP
Haskell 1.223 GHC 7.4.1-O2 -funbox-strict-fields, Linux
Scala 1.263 Scala 2.12.3, OpenJDK 9, Linux
FreeBASIC 1.306 FreeBASIC 1.05.0, Linux
Ada 1.401 GNAT/GCC 3.4.4 -O3, Linux
Go 1.481 Go version go1.1.1 linux/amd64, Linux
Simula 2.099 GNU Cim 5.1, GCC 4.8.1 -O2, Linux
Lua 2.515
22.7
LuaJIT 2.0.3, Linux
Lua 5.2.3, Linux
Python 2.633
30.0
PyPy 2.2.1 (Python 2.7.3), Linux
Python 2.7.6, Linux
Erlang 3.663
9.335
Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}]
Byte code (BEAM), Linux
ALGOL 60 3.951 MARST 2.7, GCC 4.8.1 -O3, Linux
PL/I 5.667 Iron Spring PL/I 0.9.9b beta, Linux
Lisp 7.41
19.8
GNU Common Lisp 2.6.7, Compiled, Linux
GNU Common Lisp 2.6.7, Interpreted
Smalltalk 7.59 GNU Smalltalk 2.3.5, Linux
Ruby 7.832 Ruby 2.4.2p198, Linux
Forth 9.92 Gforth 0.7.0, Linux
Prolog 11.72
5.747
SWI-Prolog 7.6.0-rc2, Linux
GNU Prolog 1.4.4, Linux, (limited iterations)
COBOL 12.5
46.3
Micro Focus Visual COBOL 2010, Windows 7
Fixed decimal instead of computational-2
Algol 68 15.2 Algol 68 Genie 2.4.1 -O3, Linux
Perl 23.6 Perl v5.8.0, Linux
BASICA/GW-BASIC 53.42 Bas 2.4, Linux
QBasic 148.3 MS-DOS QBasic 1.1, Windows XP Console
Mathematica 391.6 Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian

Posted at 14:04 Permalink

Thursday, November 2, 2017

Floating Point Benchmark: C++ Language Added, Multiple Precision Arithmetic

I have posted a new edition of the floating point benchmark collection which adds the C++ language and compares the performance of four floating point implementations with different precisions: standard double (64 bit), long double (80 bit), GNU libquadmath (__float128, 128 bit), and the GNU MPFR multiple-precision library, tested at both 128 and 512 bit precision.

It is, of course, possible to compile the ANSI C version of the benchmark with a C++ compiler, as almost any ANSI C program is a valid C++ program, but this program is a complete rewrite of the benchmark algorithm in C++, using the features of the language as they were intended to improve the readability, modularity, and generality of the program. As with all versions of the benchmark, identical results are produced, to the last decimal place, and the results are checked against a reference to verify correctness.

This benchmark was developed to explore whether writing a program using the features of C++ imposed a speed penalty compared to the base C language, and also to explore the relative performance of four different implementations of floating point arithmetic and mathematical function libraries, with different precision. The operator overloading features of C++ make it possible to easily port code to multiple precision arithmetic libraries without the cumbersome and error-prone function calls such code requires in C.

The resulting program is object-oriented, with objects representing items such as spectral lines, surface boundaries in an optical assembly, a complete lens design, the trace of a ray of light through the lens, and an evaluation of the aberrations of the design compared to acceptable optical quality standards. Each object has methods which perform computation related to its contents. All floating point quantities in the program are declared as type Real, which is typedef-ed to the precision being tested.

The numbers supported by libquadmath and MPFR cannot be directly converted to strings by snprintf() format phrases, so when using these libraries auxiliary code is generated to use those packages' facilities for conversion to strings. In a run of the benchmark which typically runs hundreds of thousands or millions of executions of the inner loop, this code only executes once, so it has negligible impact on run time.

I first tested the program with standard double arithmetic. As always, I do a preliminary run and time it, then compute an iteration count to yield a run time of around five minutes. I then perform five runs on an idle system, time them, and compute the mean run time. Next, the mean time is divided by the iteration count to compute microseconds per iteration. All tests were done with GCC/G++ 5.4.0.

Comparing with a run of the ANSI C benchmark, the C++ time was 0.9392 of the C run time. Not only didn't we pay a penalty for using C++, we actually picked up around 6% in speed. Presumably, the cleaner structure of the code allowed the compiler to optimise a bit better whereas the global variables in the original C program might have prevented some optimisations.

Next I tested with a long double data type, which uses the 80 bit internal representation of the Intel floating point unit. I used the same iteration count as with the original double test.

Here, the run time was 0.9636 that of C, still faster, and not that much longer than double. If the extra precision of long double makes a difference for your application, there's little cost in using it. Note that support for long double varies from compiler to compiler and architecture to architecture: whether it's available and, if so, what it means depends upon which compiler and machine you're using. These test results apply only to GCC on the x86 (actually x86_64) architecture.

GCC also provides a nonstandard data type, __float128, which implements 128 bit (quadruple precision) floating point arithmetic in software. The libquadmath library includes its own mathematical functions which end in “q” (for example sinq instead of sin), which must be called instead of the standard library functions, and a quadmath_snprintf function for editing numbers to strings. The benchmark contains conditional code and macro definitions to accommodate these changes.

This was 31.0031 times slower than C. Here, we pay a heavy price for doing every floating point operation in software instead of using the CPU's built in floating point unit. If you have an algorithm which requires this accuracy, it's important to perform the numerical analysis to determine where the accuracy is actually needed and employ quadruple precision only where necessary.

Finally, I tested the program using the GNU MPFR multiple-precision library which is built atop the GMP package. I used the MPFR C++ bindings developed by Pavel Holoborodko, which overload the arithmetic operators and define versions of the mathematical functions which make integrating MPFR into a C++ program almost seamless. As with __float128, the output editing code must be rewritten to accommodate MPFR's toString() formatting mechanism. MPFR allows a user-selected precision and rounding mode. I always use the default round to nearest mode, but allow specifying the precision in bits by setting MPFR_PRECISION when the program is compiled. I started with a precision of 128 bits, the same as __float128 above. The result was 189.72 times slower than C. The added generality of MPFR over __float128 comes at a steep price. Clearly, if 128 bits suffices for your application, __float128, is the way to go.

Next, I wanted to see how run time scaled with precision. I rebuilt for 512 bit precision and reran the benchmark. Now we're 499.865 times slower than C—almost exactly 1/500 the speed. This is great to have if you really need it, but you'd be wise to use it sparingly.

The program produced identical output for all choices of floating point precision. By experimentation, I determined that I could reduce MPFR_PRECISION to as low as 47 without getting errors in the least significant digits of the results. At 46 bits and below, errors start to creep in.

The relative performance of the various language implementations (with C taken as 1) is as follows. All language implementations of the benchmark listed below produced identical results to the last (11th) decimal place.

Language Relative
Time
Details
C 1 GCC 3.2.3 -O3, Linux
JavaScript 0.372
0.424
1.334
1.378
1.386
1.495
Mozilla Firefox 55.0.2, Linux
Safari 11.0, MacOS X
Brave 0.18.36, Linux
Google Chrome 61.0.3163.91, Linux
Chromium 60.0.3112.113, Linux
Node.js v6.11.3, Linux
Chapel 0.528
0.0314
Chapel 1.16.0, -fast, Linux
Parallel, 64 threads
Visual Basic .NET 0.866 All optimisations, Windows XP
C++ 0.939
0.964
31.00
189.7
499.9
G++ 5.4.0, -O3, Linux, double
long double (80 bit)
__float128 (128 bit)
MPFR (128 bit)
MPFR (512 bit)
FORTRAN 1.008 GNU Fortran (g77) 3.2.3 -O3, Linux
Pascal 1.027
1.077
Free Pascal 2.2.0 -O3, Linux
GNU Pascal 2.1 (GCC 2.95.2) -O3, Linux
Swift 1.054 Swift 3.0.1, -O, Linux
Rust 1.077 Rust 0.13.0, --release, Linux
Java 1.121 Sun JDK 1.5.0_04-b05, Linux
Visual Basic 6 1.132 All optimisations, Windows XP
Haskell 1.223 GHC 7.4.1-O2 -funbox-strict-fields, Linux
Scala 1.263 Scala 2.12.3, OpenJDK 9, Linux
Ada 1.401 GNAT/GCC 3.4.4 -O3, Linux
Go 1.481 Go version go1.1.1 linux/amd64, Linux
Simula 2.099 GNU Cim 5.1, GCC 4.8.1 -O2, Linux
Lua 2.515
22.7
LuaJIT 2.0.3, Linux
Lua 5.2.3, Linux
Python 2.633
30.0
PyPy 2.2.1 (Python 2.7.3), Linux
Python 2.7.6, Linux
Erlang 3.663
9.335
Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}]
Byte code (BEAM), Linux
ALGOL 60 3.951 MARST 2.7, GCC 4.8.1 -O3, Linux
PL/I 5.667 Iron Spring PL/I 0.9.9b beta, Linux
Lisp 7.41
19.8
GNU Common Lisp 2.6.7, Compiled, Linux
GNU Common Lisp 2.6.7, Interpreted
Smalltalk 7.59 GNU Smalltalk 2.3.5, Linux
Ruby 7.832 Ruby 2.4.2p198, Linux
Forth 9.92 Gforth 0.7.0, Linux
Prolog 11.72
5.747
SWI-Prolog 7.6.0-rc2, Linux
GNU Prolog 1.4.4, Linux, (limited iterations)
COBOL 12.5
46.3
Micro Focus Visual COBOL 2010, Windows 7
Fixed decimal instead of computational-2
Algol 68 15.2 Algol 68 Genie 2.4.1 -O3, Linux
Perl 23.6 Perl v5.8.0, Linux
QBasic 148.3 MS-DOS QBasic 1.1, Windows XP Console
Mathematica 391.6 Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian

Posted at 22:45 Permalink

Saturday, October 28, 2017

Floating Point Benchmark: Ruby Language Updated

I originally posted the results from a Ruby language version of my floating point benchmark on 2005-10-18. At that time, the current release of Ruby was version 1.8.3, and it performed toward the lower end of interpreted languages: at 26.1 times slower than C, slower than Python and Perl. In the twelve years since that posting, subsequent releases of Ruby have claimed substantial performance improvements, so I decided to re-run the test with the current stable version, 2.4.2p198, which I built from source code on my x86_64-linux development machine, as its Xubuntu distribution provides the older 2.3.1p112 release.

Performance has, indeed, dramatically improved. I ran the benchmark for 21,215,057 iterations with a mean run time of 296.722 seconds for five runs, with a time per iteration of 13.9864 microseconds. The C benchmark on the same machine, built with GCC 5.4.0, runs at 1.7858 microseconds per iteration, so the current version of Ruby is now 7.832 times slower than C, making it one of the faster interpreted or byte coded languages.

I have updated the language comparison result table in the FBENCH Web page to reflect these results. Here is the table as updated. I have also updated the Ruby version of the benchmark included in the archive to fix two warnings issued when the program was run with the -W2 option.

Language Relative
Time
Details
C 1 GCC 3.2.3 -O3, Linux
JavaScript 0.372
0.424
1.334
1.378
1.386
1.495
Mozilla Firefox 55.0.2, Linux
Safari 11.0, MacOS X
Brave 0.18.36, Linux
Google Chrome 61.0.3163.91, Linux
Chromium 60.0.3112.113, Linux
Node.js v6.11.3, Linux
Chapel 0.528
0.0314
Chapel 1.16.0, -fast, Linux
Parallel, 64 threads
Visual Basic .NET 0.866 All optimisations, Windows XP
FORTRAN 1.008 GNU Fortran (g77) 3.2.3 -O3, Linux
Pascal 1.027
1.077
Free Pascal 2.2.0 -O3, Linux
GNU Pascal 2.1 (GCC 2.95.2) -O3, Linux
Swift 1.054 Swift 3.0.1, -O, Linux
Rust 1.077 Rust 0.13.0, --release, Linux
Java 1.121 Sun JDK 1.5.0_04-b05, Linux
Visual Basic 6 1.132 All optimisations, Windows XP
Haskell 1.223 GHC 7.4.1-O2 -funbox-strict-fields, Linux
Scala 1.263 Scala 2.12.3, OpenJDK 9, Linux
Ada 1.401 GNAT/GCC 3.4.4 -O3, Linux
Go 1.481 Go version go1.1.1 linux/amd64, Linux
Simula 2.099 GNU Cim 5.1, GCC 4.8.1 -O2, Linux
Lua 2.515
22.7
LuaJIT 2.0.3, Linux
Lua 5.2.3, Linux
Python 2.633
30.0
PyPy 2.2.1 (Python 2.7.3), Linux
Python 2.7.6, Linux
Erlang 3.663
9.335
Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}]
Byte code (BEAM), Linux
ALGOL 60 3.951 MARST 2.7, GCC 4.8.1 -O3, Linux
PL/I 5.667 Iron Spring PL/I 0.9.9b beta, Linux
Lisp 7.41
19.8
GNU Common Lisp 2.6.7, Compiled, Linux
GNU Common Lisp 2.6.7, Interpreted
Smalltalk 7.59 GNU Smalltalk 2.3.5, Linux
Ruby 7.832 Ruby 2.4.2p198, Linux
Forth 9.92 Gforth 0.7.0, Linux
Prolog 11.72
5.747
SWI-Prolog 7.6.0-rc2, Linux
GNU Prolog 1.4.4, Linux, (limited iterations)
COBOL 12.5
46.3
Micro Focus Visual COBOL 2010, Windows 7
Fixed decimal instead of computational-2
Algol 68 15.2 Algol 68 Genie 2.4.1 -O3, Linux
Perl 23.6 Perl v5.8.0, Linux
QBasic 148.3 MS-DOS QBasic 1.1, Windows XP Console
Mathematica 391.6 Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian

Posted at 00:11 Permalink

Thursday, October 26, 2017

Floating Point Benchmark: Chapel Language Added

I have posted an update to my trigonometry-intense floating point benchmark which adds the Chapel language.

Chapel (Cascade High Productivity Language) is a programming language developed by Cray, Inc. with the goal of integrating parallel computing into a language without cumbersome function calls or awkward syntax. The language implements both task based and data based parallelism: in the first, the programmer explicitly defines the tasks to be run in parallel, while in the second an operation is performed on a collection of data and the compiler and runtime system decides how to partition it among the computing resources available. Both symmetric multiprocessing with shared memory (as on contemporary “multi-core” microprocessors) and parallel architectures with local memory per processor and message passing are supported.

Apart from its parallel processing capabilities, Chapel is a conventional object oriented imperative programming language. Programmers familiar with C++, Java, and other such languages will quickly become accustomed to its syntax and structure.

Because this is the first parallel processing language in which the floating point benchmark has been implemented, I wanted to test its performance in both serial and parallel processing modes. Since the benchmark does not process large arrays of data, I used task parallelism to implement two kinds of parallel processing.

The first is “parallel trace”, enabled by compiling with:
      chpl --fast fbench.chpl --set partrace=true
The ray tracing process propagates light of four different wavelengths through the lens assembly and then uses the object distance and axis slope angle of the rays to compute various aberrations. When partrace is set to true, the computation of these rays is performed in parallel, with four tasks running in a “cobegin” structure. When all of the tasks are complete, their results, stored in shared memory passed to the tasks by reference, is used to compute the aberrations.

The second option is “parallel iteration”, enabled by compiling with:
      chpl --fast fbench.chpl --set pariter=n
where n is the number of tasks among which the specified iteration count will be divided. On a multi-core machine, this should usually be set to the number of processor cores available, which you can determine on most Linux systems with:
      cat /proc/cpuinfo | grep processor | wc -l
(If the number of tasks does not evenly divide the number of iterations, the extra iterations are assigned to one of the tasks.) The parallel iteration model might be seen as cheating, but in a number of applications, such as ray tracing for computer generated image rendering (as opposed to the ray tracing we do in the benchmark for optical design), a large number of computations are done which are independent of one another (for example, every pixel in a generated image is independent of every other), and the job can be parallelised by a simple “farm” algorithm which spreads the work over as many processors as are available. The parallel iteration model allows testing this approach with the floating point benchmark.

If the benchmark is compiled without specifying partrace or pariter, it will run the task serially as in conventional language implementations. The number of iterations is specified on the command line when running the benchmark as:
      ./fbench --iterations=n
where n is the number to be run.

After preliminary timing runs to determine the number of iterations, I ran the serial benchmark for 250,000,000 iterations, with run times in seconds of:

user real sys
301.00 235.73 170.46
299.24 234.26 169.27
297.93 233.67 169.40
301.02 236.05 171.08
298.59 234.45 170.30
Mean 299.56 234.83 170.10

The mean user time comes to 1.1982 microseconds per iteration.

Now, to one accustomed to running this benchmark, these times were distinctly odd if not downright weird. You just don't see real time less than user time, and what's with that huge system time? Well, it turns out that even though I didn't enable any of the explicit parallelisation in the code, it was actually using two threads. (I haven't dug into the generated C code to figure out how it was using them.) The first clue was when I looked at the running program with top and saw:

    PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    20   0  167900   2152   2012 S 199.7  0.0   0:12.54 fbench
Yup, almost 200% CPU utilisation. I then ran top -H to show threads and saw:
    PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
    20   0  167900   2152   2012 R 99.9  0.0   1:43.28 fbench
    20   0  167900   2152   2012 R 99.7  0.0   1:43.27 fbench
so indeed we had two threads. You can control the number of threads with the environment variable CHPL_RT_NUM_THREADS_PER_LOCALE, so I set:
      export CHPL_RT_NUM_THREADS_PER_LOCALE=1
and re-ran the benchmark, verifying with top that it was now using only one thread. I got the following times:

user real sys
235.46 235.47 0.00
236.52 236.55 0.02
235.06 235.07 0.00
235.17 235.20 0.02
236.20 236.21 0.00
Mean 235.68

Now that's more like what we're used to seeing! User and real times are essentially identical, with negligible system time. Note that the user time in the single threaded run was essentially identical to the real time when it was running with two threads. So, all that was accomplished by using two threads was burning up more time on two cores and wasting a lot of time in system calls creating, synchronising, and destroying them. With one thread, the mean user time per iteration was 0.9427 microseconds per iteration.

I then ran the C benchmark for 166,051,660 iterations, yielding run times of (296.89, 296.37, 296.29, 296.76, 296.37) seconds, with mean 296.536, for 1.7858 microseconds per iteration.

Comparing these times gives a ratio of 0.5279 for Chapel to C. In other words, the Chapel program ran about twice as fast as C.

Now, let's explore the various kinds of explicit parallelism. First, we'll enable parallel trace by compiling with “--set partrace=true”. The results are…disastrous. I ran a comparison test with 10,000,000 iterations and measured the timings for CHPL_RT_NUM_THREADS_PER_LOCALE set to the number of threads in the table below:

threads real user sys
1 16.92 16.91 0.00
2 30.74 41.68 18.16
4 43.15 68.23 90.65
5 64.29 112.38 358.88

The amount of computation done in the parallel threads is just not large enough to recover the cost of creating and disposing of the threads. The thread overhead dwarfs the gain from parallelisation, and all we manage to do is keep the CPU cores (the machine on which I'm testing has eight) busy with system and user time overhead, which increases so rapidly the real runtime degrades as we throw more cores at the problem. Interestingly, the partrace=true program, when restricted to one thread, still ran much slower than the serial version of the program, which ran in 9.49 seconds on one thread.

Next, we'll move on to parallel iteration, which models a “farm” algorithm division of processing a large data set. Here, we simply partition the number of iterations of the benchmark and process them in separate threads with Chapel's “coforall” mechanism. Running 250,000,000 iterations with 8 threads (“--set pariter=8”) yields timings of:

user real sys
342.27 48.95 39.84
339.50 48.10 39.76
343.01 49.34 42.19
342.08 48.78 39.90
338.83 47.70 37.30
Mean 341.14 48.57 39.79

Now we're cooking! Going to 8 threads working on the iterations cut the total real runtime from 235.68 seconds for the serial implementation to just 48.57 seconds—almost five times faster. Note that we paid a price for this in additional user computation time across all of the threads: 341.14 seconds as opposed to 235.68, and we incurred around 40 seconds of system overhead, which was negligible in the serial program, but the bottom line was that we “got the answer out” much more quickly, even though the machine was working harder to get to the finish line.

To see just how much performance I could get from parallelism, I moved testing to the main Fourmilab server, Pallas. This machine has 64 CPU cores and runs at about the same speed as the laptop on which I was developing. To confirm this, I ran the C fbench compiled and static linked on the laptop with timings of (301.43, 300.92). This is essentially the same speed as the laptop running the same binary.

Next, I built the Chapel benchmark with pariter=32 and ran it with the following settings of CHPL_RT_NUM_THREADS_PER_LOCALE.

threads real user sys
1 459.76 458.86 0.08
16 33.08 523.78 0.12
32 17.17 530.21 0.34
64 25.35 816.64 0.43

Finally, here are timings for a pariter=64 build.

threads real user sys
32 17.12 528.46 0.29
64 14.00 824.79 0.66

By using 64 threads and cores, we are now running 16.8 times faster than the single thread, non-parallel version of the program.

Chapel is an open source software project which runs on a wide variety of computing platforms. Even without its parallel capabilities, it outperforms current releases of GCC for a scientific computation task like the floating point benchmark, and for algorithms which can be readily parallelised, it can deliver a large performance increases on multi-core computer systems without awkward or configuration-dependent programming. If you're looking at a computationally-intense project where parallel computing may make a difference, it's well worth investigating.

The relative performance of the various language implementations (with C taken as 1) is as follows. All language implementations of the benchmark listed below produced identical results to the last (11th) decimal place.

Language Relative
Time
Details
C 1 GCC 3.2.3 -O3, Linux
JavaScript 0.372
0.424
1.334
1.378
1.386
1.495
Mozilla Firefox 55.0.2, Linux
Safari 11.0, MacOS X
Brave 0.18.36, Linux
Google Chrome 61.0.3163.91, Linux
Chromium 60.0.3112.113, Linux
Node.js v6.11.3, Linux
Chapel 0.528
0.0314
Chapel 1.16.0, -fast, Linux
Parallel, 64 threads
Visual Basic .NET 0.866 All optimisations, Windows XP
FORTRAN 1.008 GNU Fortran (g77) 3.2.3 -O3, Linux
Pascal 1.027
1.077
Free Pascal 2.2.0 -O3, Linux
GNU Pascal 2.1 (GCC 2.95.2) -O3, Linux
Swift 1.054 Swift 3.0.1, -O, Linux
Rust 1.077 Rust 0.13.0, --release, Linux
Java 1.121 Sun JDK 1.5.0_04-b05, Linux
Visual Basic 6 1.132 All optimisations, Windows XP
Haskell 1.223 GHC 7.4.1-O2 -funbox-strict-fields, Linux
Scala 1.263 Scala 2.12.3, OpenJDK 9, Linux
Ada 1.401 GNAT/GCC 3.4.4 -O3, Linux
Go 1.481 Go version go1.1.1 linux/amd64, Linux
Simula 2.099 GNU Cim 5.1, GCC 4.8.1 -O2, Linux
Lua 2.515
22.7
LuaJIT 2.0.3, Linux
Lua 5.2.3, Linux
Python 2.633
30.0
PyPy 2.2.1 (Python 2.7.3), Linux
Python 2.7.6, Linux
Erlang 3.663
9.335
Erlang/OTP 17, emulator 6.0, HiPE [native, {hipe, [o3]}]
Byte code (BEAM), Linux
ALGOL 60 3.951 MARST 2.7, GCC 4.8.1 -O3, Linux
PL/I 5.667 Iron Spring PL/I 0.9.9b beta, Linux
Lisp 7.41
19.8
GNU Common Lisp 2.6.7, Compiled, Linux
GNU Common Lisp 2.6.7, Interpreted
Smalltalk 7.59 GNU Smalltalk 2.3.5, Linux
Forth 9.92 Gforth 0.7.0, Linux
Prolog 11.72
5.747
SWI-Prolog 7.6.0-rc2, Linux
GNU Prolog 1.4.4, Linux, (limited iterations)
COBOL 12.5
46.3
Micro Focus Visual COBOL 2010, Windows 7
Fixed decimal instead of computational-2
Algol 68 15.2 Algol 68 Genie 2.4.1 -O3, Linux
Perl 23.6 Perl v5.8.0, Linux
Ruby 26.1 Ruby 1.8.3, Linux
QBasic 148.3 MS-DOS QBasic 1.1, Windows XP Console
Mathematica 391.6 Mathematica 10.3.1.0, Raspberry Pi 3, Raspbian

Posted at 21:41 Permalink

Monday, October 23, 2017

ISBNiser 1.3 Update Released

I have just posted version 1.3 of ISBNiser, a utility for validating publication numbers in the ISBN-13 and ISBN-10 formats, converting between the formats, and generating Amazon associate links to purchase items with credit to a specified account.

Version 1.3 adds the ability to automatically parse the specified ISBNs and insert delimiters among the elements (unique country code [ISBN-13 only], registration group, registrant, publication, and checksum). If the number supplied contains delimiters, the same delimiter (the first if multiple different delimiters appear) will be used when re-generating the number with delimiters. For example, if all the publisher gives you is “9781481487658”, you can obtain an ISBN-13 or ISBN-10 with proper delimiters with:

$ isbniser 9781481487658
ISBN-13: 978-1-4814-8765-8  9781481487658   ISBN-10: 1481487655  1-4814-8765-5
The rules for properly placing the delimiters in an ISBN are deliciously baroque, with every language and country group having their own way of going about it. ISBNiser implements this standard with a page of ugly code. If confronted with an ISBN that does not conform to the standard (I haven't yet encountered one, but in the wild and wooly world of international publishing it wouldn't surprise me), it issues a warning message and returns a number with no delimiters.

If the “−p” option is specified, delimiters in the number given will be preserved, regardless of where they are placed. When the “−n” option is specified, allowing invalid ISBN specifications (for example, when generating links to products on Amazon with ASIN designations), no attempt to insert delimiters is made.

Posted at 14:13 Permalink

Sunday, October 22, 2017

New: Commodore Curiosities

In the late 1980s I became interested in mass market home computers as possible markets for some products I was considering developing. I bought a Commodore 128 and began to experiment with it, writing several programs, some of which were published in Commodore user magazines.

Commodore Curiosities presents three of those programs: a customisable key click generator, a moon phase calculator, and a neural network simulator. Complete source code and a floppy disc image which can be run on modern machines under the VICE C-64/C-128 emulator is included for each program.

Posted at 19:50 Permalink

Wednesday, October 18, 2017

WatchFull Updated

WatchFull is a collection of programs, written in Perl, which assist Unix systems administrators in avoiding and responding to file system space exhaustion crises. WatchFull monitors file systems and reports when they fall below a specified percentage of free space. LogJam watches system and application log files (for example Web server access and error logs) and warns when they exceed a threshold size. Top40 scans a file system or directory tree and provides a list of the largest files within it.

I have just posted the first update to WatchFull since its initial release in 2000. Version 1.1 updates the source code to current Perl syntax, corrects several warning messages, and now runs in “use strict;” and “use warnings;” modes. The source code should be compatible with any recent version of Perl 5. The HTML documentation has been updated to XHTML 1.0 Strict, CSS3, and Unicode typography.

WatchFull Home Page

Posted at 19:55 Permalink

Monday, October 16, 2017

New: Marinchip Systems: Documents and Images

Marinchip M9900CPU S-100 board I have just posted an archive of documents and images about Marinchip Systems, the company I founded and operated from 1977 through 1985. Marinchip delivered, starting in 1978, the first true 16-bit personal computer on the S-100 bus, with the goal of providing its users the same experience as conecting to a commercial timesharing service which cost many times more. While other personal computer companies were providing 8 Kb BASIC, we had a Unix-like operating system, Pascal, and eventually a multi-user system.

Marinchip (which was named after the Marinship shipyard not far from where I lived, which made Liberty ships during World War II), designed its own hardware and software, with hardware based upon the Texas Instruments TMS9900 microprocessor and the software written by, well, me.

Texas Instruments (TI) in this era was a quintessential engineering company: “Hey, we've designed something cool. Let's go design something else, entirely different, which is even cooler!” There didn't seem to be anybody who said, “No, first you need to offer follow-on products which allow those who bet on your original product to evolve and grow as technology advances.” TI built a supercomputer, the TI-ASC, at the time one of the fastest in the world, but then they lost interest in it and sold only seven.

The Marinchip 9900 did somewhat better, although its performance and unit cost were more modest. Hughes Radar Systems designed our board into the F-16 radar tester and bought the boards in large quantities for this embedded application. The 9900 processor was one of the only 16-bit processors qualified for use in geosynchronous satellites, and we sold a number of systems to satellite manufacturers for software development because our systems cost a fraction of those sold by Texas Instruments. In 1985, after Autodesk took off, and I had no more time for Marinchip, I sold the company, with all of its hardware, software, and manufacturing rights to Hughes Electronics which had, by then, been acquired by General Motors, so I said, “I sold my company to General Motors”.

What can you learn from this? Probably not a heck of a lot. Certainly, I learned little. I repeated most of my mistakes from Marinchip in Autodesk, and only learned later, from experience, that there are things which work at one scale which don't when the numbers are ten or a hundred times larger.

Still, if you haven't seen personal computing as it existed while Jimmy Carter was the U.S. president, take a glance. As far as I know, nothing we did at Marinchip contributed in any way to our current technology. Well, almost nothing. There was this curious drafting program one of our customers developed which was the inspiration for AutoCAD….

Marinchip Systems: Documents and Images

Posted at 19:15 Permalink