« Reading List: Always Another Dawn | Main | Reading List: Wrench and Claw »

Sunday, November 3, 2019

Reading List: Sunburst and Luminary

Eyles, Don. Sunburst and Luminary. Boston: Fort Point Press, 2018. ISBN 978-0-9863859-3-3.
In 1966, the author graduated from Boston University with a bachelor's degree in mathematics. He had no immediate job prospects or career plans. He thought he might be interested in computer programming due to a love of solving puzzles, but he had never programmed a computer. When asked, in one of numerous job interviews, how he would go about writing a program to alphabetise a list of names, he admitted he had no idea. One day, walking home from yet another interview, he passed an unimpressive brick building with a sign identifying it as the “MIT Instrumentation Laboratory”. He'd heard a little about the place and, on a lark, walked in and asked if they were hiring. The receptionist handed him a long application form, which he filled out, and was then immediately sent to interview with a personnel officer. Eyles was amazed when the personnel man seemed bent on persuading him to come to work at the Lab. After reference checking, he was offered a choice of two jobs: one in the “analysis group” (whatever that was), and another on the team developing computer software for landing the Apollo Lunar Module (LM) on the Moon. That sounded interesting, and the job had another benefit attractive to a 21 year old just graduating from university: it came with deferment from the military draft, which was going into high gear as U.S. involvement in Vietnam deepened.

Near the start of the Apollo project, MIT's Instrumentation Laboratory, led by the legendary “Doc” Charles Stark Draper, won a sole source contract to design and program the guidance system for the Apollo spacecraft, which came to be known as the “Apollo Primary Guidance, Navigation, and Control System” (PGNCS, pronounced “pings”). Draper and his laboratory had pioneered inertial guidance systems for aircraft, guided missiles, and submarines, and had in-depth expertise in all aspects of the challenging problem of enabling the Apollo spacecraft to navigate from the Earth to the Moon, land on the Moon, and return to the Earth without any assistance from ground-based assets. In a normal mission, it was expected that ground-based tracking and computers would assist those on board the spacecraft, but in the interest of reliability and redundancy it was required that completely autonomous navigation would permit accomplishing the mission.

The Instrumentation Laboratory developed an integrated system composed of an inertial measurement unit consisting of gyroscopes and accelerometers that provided a stable reference from which the spacecraft's orientation and velocity could be determined, an optical telescope which allowed aligning the inertial platform by taking sightings on fixed stars, and an Apollo Guidance Computer (AGC), a general purpose digital computer which interfaced to the guidance system, thrusters and engines on the spacecraft, the astronauts' flight controls, and mission control, and was able to perform the complex calculations for en route maneuvers and the unforgiving lunar landing process in real time.

Every Apollo lunar landing mission carried two AGCs: one in the Command Module and another in the Lunar Module. The computer hardware, basic operating system, and navigation support software were identical, but the mission software was customised due to the different hardware and flight profiles of the Command and Lunar Modules. (The commonality of the two computers proved essential in getting the crew of Apollo 13 safely back to Earth after an explosion in the Service Module cut power to the Command Module and disabled its computer. The Lunar Module's AGC was able to perform the critical navigation and guidance operations to put the spacecraft back on course for an Earth landing.)

By the time Don Eyles was hired in 1966, the hardware design of the AGC was largely complete (although a revision, called Block II, was underway which would increase memory capacity and add some instructions which had been found desirable during the initial software development process), the low-level operating system and support libraries (implementing such functionality as fixed point arithmetic, vector, and matrix computations), and a substantial part of the software for the Command Module had been written. But the software for actually landing on the Moon, which would run in the Lunar Module's AGC, was largely just a concept in the minds of its designers. Turning this into hard code would be the job of Don Eyles, who had never written a line of code in his life, and his colleagues. They seemed undaunted by the challenge: after all, nobody knew how to land on the Moon, so whoever attempted the task would have to make it up as they went along, and they had access, in the Instrumentation Laboratory, to the world's most experienced team in the area of inertial guidance.

Today's programmers may be amazed it was possible to get anything at all done on a machine with the capabilities of the Apollo Guidance Computer, no less fly to the Moon and land there. The AGC had a total of 36,864 15-bit words of read-only core rope memory, in which every bit was hand-woven to the specifications of the programmers. As read-only memory, the contents were completely fixed: if a change was required, the memory module in question (which was “potted” in a plastic compound) had to be discarded and a new one woven from scratch. There was no way to make “software patches”. Read-write storage was limited to 2048 15-bit words of magnetic core memory. The read-write memory was non-volatile: its contents were preserved across power loss and restoration. (Each memory word was actually 16 bits in length, but one bit was used for parity checking to detect errors and not accessible to the programmer.) Memory cycle time was 11.72 microseconds. There was no external bulk storage of any kind (disc, tape, etc.): everything had to be done with the read-only and read-write memory built into the computer.

The AGC software was an example of “real-time programming”, a discipline with which few contemporary programmers are acquainted. As opposed to an “app” which interacts with a user and whose only constraint on how long it takes to respond to requests is the user's patience, a real-time program has to meet inflexible constraints in the real world set by the laws of physics, with failure often resulting in disaster just as surely as hardware malfunctions. For example, when the Lunar Module is descending toward the lunar surface, burning its descent engine to brake toward a smooth touchdown, the LM is perched atop the thrust vector of the engine just like a pencil balanced on the tip of your finger: it is inherently unstable, and only constant corrections will keep it from tumbling over and crashing into the surface, which would be bad. To prevent this, the Lunar Module's AGC runs a piece of software called the digital autopilot (DAP) which, every tenth of a second, issues commands to steer the descent engine's nozzle to keep the Lunar Module pointed flamy side down and adjusts the thrust to maintain the desired descent velocity (the thrust must be constantly adjusted because as propellant is burned, the mass of the LM decreases, and less thrust is needed to maintain the same rate of descent). The AGC/DAP absolutely must compute these steering and throttle commands and send them to the engine every tenth of a second. If it doesn't, the Lunar Module will crash. That's what real-time computing is all about: the computer has to deliver those results in real time, as the clock ticks, and if it doesn't (for example, it decides to give up and flash a Blue Screen of Death instead), then the consequences are not an irritated or enraged user, but actual death in the real world. Similarly, every two seconds the computer must read the spacecraft's position from the inertial measurement unit. If it fails to do so, it will hopelessly lose track of which way it's pointed and how fast it is going. Real-time programmers live under these demanding constraints and, especially given the limitations of a computer such as the AGC, must deploy all of their cleverness to meet them without fail, whatever happens, including transient power failures, flaky readings from instruments, user errors, and completely unanticipated “unknown unknowns”.

The software which ran in the Lunar Module AGCs for Apollo lunar landing missions was called LUMINARY, and in its final form (version 210) used on Apollo 15, 16, and 17, consisted of around 36,000 lines of code (a mix of assembly language and interpretive code which implemented high-level operations), of which Don Eyles wrote in excess of 2,200 lines, responsible for the lunar landing from the start of braking from lunar orbit through touchdown on the Moon. This was by far the most dynamic phase of an Apollo mission, and the most demanding on the limited resources of the AGC, which was pushed to around 90% of its capacity during the final landing phase where the astronauts were selecting the landing spot and guiding the Lunar Module toward a touchdown. The margin was razor-thin, and that's assuming everything went as planned. But this was not always the case.

It was when the unexpected happened that the genius of the AGC software and its ability to make the most of the severely limited resources at its disposal became apparent. As Apollo 11 approached the lunar surface, a series of five program alarms: codes 1201 and 1202, interrupted the display of altitude and vertical velocity being monitored by Buzz Aldrin and read off to guide Neil Armstrong in flying to the landing spot. These codes both indicated out-of-memory conditions in the AGC's scarce read-write memory. The 1201 alarm was issued when all five of the 44-word vector accumulator (VAC) areas were in use when another program requested to use one, and 1202 signalled exhaustion of the eight 12-word core sets required by each running job. The computer had a single processor and could execute only one task at a time, but its operating system allowed lower priority tasks to be interrupted in order to service higher priority ones, such as the time-critical autopilot function and reading the inertial platform every two seconds. Each suspended lower-priority job used up a core set and, if it employed the interpretive mathematics library, a VAC, so exhaustion of these resources usually meant the computer was trying to do too many things at once. Task priorities were assigned so the most critical functions would be completed on time, but computer overload signalled something seriously wrong—a condition in which it was impossible to guarantee all essential work was getting done.

In this case, the computer would throw up its hands, issue a program alarm, and restart. But this couldn't be a lengthy reboot like customers of personal computers with millions of times the AGC's capacity tolerate half a century later. The critical tasks in the AGC's software incorporated restart protection, in which they would frequently checkpoint their current state, permitting them to resume almost instantaneously after a restart. Programmers estimated around 4% of the AGC's program memory was devoted to restart protection, and some questioned its worth. On Apollo 11, it would save the landing mission.

Shortly after the Lunar Module's landing radar locked onto the lunar surface, Aldrin keyed in the code to monitor its readings and immediately received a 1202 alarm: no core sets to run a task; the AGC restarted. On the communications link Armstrong called out “It's a 1202.” and Aldrin confirmed “1202.”. This was followed by fifteen seconds of silence on the “air to ground” loop, after which Armstrong broke in with “Give us a reading on the 1202 Program alarm.” At this point, neither the astronauts nor the support team in Houston had any idea what a 1202 alarm was or what it might mean for the mission. But the nefarious simulation supervisors had cranked in such “impossible” alarms in earlier training sessions, and controllers had developed a rule that if an alarm was infrequent and the Lunar Module appeared to be flying normally, it was not a reason to abort the descent.

At the Instrumentation Laboratory in Cambridge, Massachusetts, Don Eyles and his colleagues knew precisely what a 1202 was and found it was deeply disturbing. The AGC software had been carefully designed to maintain a 10% safety margin under the worst case conditions of a lunar landing, and 1202 alarms had never occurred in any of their thousands of simulator runs using the same AGC hardware, software, and sensors as Apollo 11's Lunar Module. Don Eyles' analysis, in real time, just after a second 1202 alarm occurred thirty seconds later, was:

Again our computations have been flushed and the LM is still flying. In Cambridge someone says, “Something is stealing time.” … Some dreadful thing is active in our computer and we do not know what it is or what it will do next. Unlike Garman [AGC support engineer for Mission Control] in Houston I know too much. If it were in my hands, I would call an abort.

As the Lunar Module passed 3000 feet, another alarm, this time a 1201—VAC areas exhausted—flashed. This is another indication of overload, but of a different kind. Mission control immediately calls up “We're go. Same type. We're go.” Well, it wasn't the same type, but they decided to press on. Descending through 2000 feet, the DSKY (computer display and keyboard) goes blank and stays blank for ten agonising seconds. Seventeen seconds later another 1202 alarm, and a blank display for two seconds—Armstrong's heart rate reaches 150. A total of five program alarms and resets had occurred in the final minutes of landing. But why? And could the computer be trusted to fly the return from the Moon's surface to rendezvous with the Command Module?

While the Lunar Module was still on the lunar surface Instrumentation Laboratory engineer George Silver figured out what happened. During the landing, the Lunar Module's rendezvous radar (used only during return to the Command Module) was powered on and set to a position where its reference timing signal came from an internal clock rather than the AGC's master timing reference. If these clocks were in a worst case out of phase condition, the rendezvous radar would flood the AGC with what we used to call “nonsense interrupts” back in the day, at a rate of 800 per second, each consuming one 11.72 microsecond memory cycle. This imposed an additional load of more than 13% on the AGC, which pushed it over the edge and caused tasks deemed non-critical (such as updating the DSKY) not to be completed on time, resulting in the program alarms and restarts. The fix was simple: don't enable the rendezvous radar until you need it, and when you do, put the switch in the position that synchronises it with the AGC's clock. But the AGC had proved its excellence as a real-time system: in the face of unexpected and unknown external perturbations it had completed the mission flawlessly, while alerting its developers to a problem which required their attention.

The creativity of the AGC software developers and the merit of computer systems sufficiently simple that the small number of people who designed them completely understood every aspect of their operation was demonstrated on Apollo 14. As the Lunar Module was checked out prior to the landing, the astronauts in the spacecraft and Mission Control saw the abort signal come on, which was supposed to indicate the big Abort button on the control panel had been pushed. This button, if pressed during descent to the lunar surface, immediately aborted the landing attempt and initiated a return to lunar orbit. This was a “one and done” operation: no Microsoft-style “Do you really mean it?” tea ceremony before ending the mission. Tapping the switch made the signal come and go, and it was concluded the most likely cause was a piece of metal contamination floating around inside the switch and occasionally shorting the contacts. The abort signal caused no problems during lunar orbit, but if it should happen during descent, perhaps jostled by vibration from the descent engine, it would be disastrous: wrecking a mission costing hundreds of millions of dollars and, coming on the heels of Apollo 13's mission failure and narrow escape from disaster, possibly bring an end to the Apollo lunar landing programme.

The Lunar Module AGC team, with Don Eyles as the lead, was faced with an immediate challenge: was there a way to patch the software to ignore the abort switch, protecting the landing, while still allowing an abort to be commanded, if necessary, from the computer keyboard (DSKY)? The answer to this was obvious and immediately apparent: no. The landing software, like all AGC programs, ran from read-only rope memory which had been woven on the ground months before the mission and could not be changed in flight. But perhaps there was another way. Eyles and his colleagues dug into the program listing, traced the path through the logic, and cobbled together a procedure, then tested it in the simulator at the Instrumentation Laboratory. While the AGC's programming was fixed, the AGC operating system provided low-level commands which allowed the crew to examine and change bits in locations in the read-write memory. Eyles discovered that by setting the bit which indicated that an abort was already in progress, the abort switch would be ignored at the critical moments during the descent. As with all software hacks, this had other consequences requiring their own work-arounds, but by the time Apollo 14's Lunar Module emerged from behind the Moon on course for its landing, a complete procedure had been developed which was radioed up from Houston and worked perfectly, resulting in a flawless landing.

These and many other stories of the development and flight experience of the AGC lunar landing software are related here by the person who wrote most of it and supported every lunar landing mission as it happened. Where technical detail is required to understand what is happening, no punches are pulled, even to the level of bit-twiddling and hideously clever programming tricks such as using an overflow condition to skip over an EXTEND instruction, converting the following instruction from double precision to single precision, all in order to save around forty words of precious non-bank-switched memory. In addition, this is a personal story, set in the context of the turbulent 1960s and early ’70s, of the author and other young people accomplishing things no humans had ever before attempted.

It was a time when everybody was making it up as they went along, learning from experience, and improvising on the fly; a time when a person who had never written a line of computer code would write, as his first program, the code that would land men on the Moon, and when the creativity and hard work of individuals made all the difference. Already, by the end of the Apollo project, the curtain was ringing down on this era. Even though a number of improvements had been developed for the LM AGC software which improved precision landing capability, reduced the workload on the astronauts, and increased robustness, none of these were incorporated in the software for the final three Apollo missions, LUMINARY 210, which was deemed “good enough” and the benefit of the changes not worth the risk and effort to test and incorporate them. Programmers seeking this kind of adventure today will not find it at NASA or its contractors, but instead in the innovative “New Space” and smallsat industries.

Posted at November 3, 2019 13:32