« Reading List: Old Man's War | Main | Valley of the Dells: RAID Firmware Upgrade »

Friday, April 8, 2005

Linux: Fedora X11 6.8.2-1.FC3.13 on Multiprocessor Systems

As I'm typing this, Swiss television is airing the James Bond film "A View to a Kill"--how appropriate! I recently installed the Fedora Core 3 X11 version 6.8.2-1.FC3.13 packages the on Dell PowerEdge 1850 servers which run this site. The updated X11 server and libraries are not used until you log out and back in, or reboot and log in from the console for the first time after the reboot. What happens then is not pretty; in the process of painting the desktop the mouse freezes, and a few seconds later the Orange Light of Death appears on the front panel. Inquiring with the Remote Access Controller indicates PROC_1 and PROC_2 "Status processor sensor IERR". No form of clean shutdown will work, and forcing a reboot from the RAC may result in the dismaying consequence of /etc/fstab being deleted in the fsck, forcing recovery from the Rescue CD. (Fortunately, the other server in the farm, which continued to run because I hadn't logged out and back in after installing the fatal update on it, has an identical /etc/fstab, so I was able to start the network interface and copy it over to the damaged machine with sftp. From now on I'll keep a current copy of /etc/fstab in /etc/fstab.backup for future disasters of this sort.)

Apparently, there is a problem in the Fedora Core 3 X11 6.8.2-1.FC3.13 update which causes it to crash multiprocessor machines. The Fourmilab servers have dual Intel Xeon processors, each of which is "Hyper-Threaded" and hence behaves as a dual processor system itself. Reports of this problem so far on the Fedora Forum differ on whether it affects Hyper-Threaded machines with a single physical CPU. To see if you have this version installed on your system use the command "rpm -aq | grep xorg-x11". If you see lines in the output like "xorg-x11-6.8.2-1.FC3.13" then this version is installed on your machine.

This problem can be particularly insidious on multiprocessor servers which are normally run in "headless" mode--with no keyboard, mouse, or monitor. Since the X server doesn't start until you log in from the console, the deadly X version can lie dormant until some other problem causes you to roll over the "crash cart" and hook up the console, whereupon your first attempt to log in will immediately hang the server, giving you another, entirely unrelated, inscrutability to unscrew, with a missing /etc/fstab "to boot" if your luck is like mine.

If you have a multiprocessor machine and have this version installed, either your configuration dodges the bullet or you're lucky enough not to have logged in since you installed it. To back out the new version and revert to the previous 6.8.1-12.FC3.21 release, first make a note of all the 6.8.2-1.FC3.13 packages reported by the rpm command above. Go to the Fedora Core 3 Updates archive and download the .rpm files for the 6.8.1-12.FC3.21 version of each package. Once you've downloaded the packages, you can revert your system by running the command "rpm --oldpackage -Uvh xorg-x11-*6.8.1-12.FC3.21.i386.rpm" as super user. Before actually installing the packages, you can check whether you're missing anything or have any conflicts by running the previous command with the "--test" option.

This misadventure illustrates why one should always install updates, however apparently innocuous, on a "pathfinder" system first, and then reboot it to see if anything wicked its way comes. Let the pathfinder run in production for a few days before deploying the update to the rest of your servers.

Posted at April 8, 2005 22:37