« Reading List: Mercury | Main | Reading List: Fab »

Tuesday, December 12, 2006

Applying Check Point FireWall-1 Hotfixes on a Nokia IP265 Network Appliance

(Note: The information in this item is so specialised it is probable that not a single regular reader of this chronicle will find it of interest. Why post it then? Because every time I publish such an item I receive feedback from people who found it with a search engine who write to thank me for pointing them to the solution of the obscure problem in question. That's why I try to give such items long titles with keywords to direct those searching for such information to them.)

Last year at about this time Fourmilab's firewall was upgraded to dual redundant diskless Nokia IP265 network appliances which run Check Point FireWall-1 software. The two Nokia machines are configured as an active/backup high availability cluster using the Virtual Router Redundancy Protocol (VRRP) so that if the active firewall fails, the backup, which constantly monitors its status and mirrors connection information, can take over without even dropping active TCP connections.

All of this worked pretty much as expected, but unfortunately I soon discovered a horrific bungle in the VRRP fail-over implementation. When the active firewall went down, the backup took over, then relinquished control back to the primary unit once it came back up: all well and good. But if you rebooted the backup, the active firewall would cease to forward traffic until the backup returned to service! The meant your entire “high availability” cluster and access to all of the machines behind it was vulnerable to the failure of the backup firewall—what a mess!

After researching this problem for some time, I discovered that Check Point had issued a “hot fix” to correct a problem in which the reboot of a VRRP backup machine would send a bogus gratuitous ARP packet which “could block cluster connectivity”; that certainly sounded like the problem I was having. Check Point periodically releases what they call “Hotfix Accumulators” which are like Sun's omnibus “rollup patches” for Solaris: a large collection of independent patches said to be mutually compatible which, together, constitute a minor release of the software to which they pertain. I downloaded the current such package, which was released in June of 2006 (although its documentation was most recently revised in mid-September 2006) and proceeded to try to install it first on the backup firewall, allowing the primary to remain in production so as not to disrupt access to the site.

Because the Nokia IP265 is a flash-based diskless machine, its file system structure is rather curious. It runs an operating system called IPSO, which is based on FreeBSD, and the output of the “df -k” command is as follows:

xl5[admin]# df -k
Filesystem  1K-blocks     Used    Avail Capacity  Mounted on
/dev/wd0f      127151    40525    76454    35%    /
v9fs           120224    41752    78472    35%    /image/IPSO-3.8.1-BUILD028-12.02.2004-222502-1518/rfs
/dev/wd0a       31775       33    29200     0%    /config
/dev/wd0h      317903   154047   138424    53%    /preserve
procfs              4        4        0   100%    /proc
v9fs            88712    10240    78472    12%    /var
mfs:83           7607        0     6998     0%    /var/tmp2/upgrade
v9fs           174448    95976    78472    55%    /opt
The file systems on the various partitions of “/dev/wd0” are stored in the non-volatile flash memory. At boot time, they are decompressed and copied into the RAM file systems from which the system runs; changes to the “v9fs” file systems are lost at the next reboot. The Hotfix Accumulator I wished to install was 22.8 Mb GZIP compressed. I decided to copy it to the volatile “/var” file system, with about 80 Mb of free space, for installation. I copied it, decompressed and extracted the archive, and still had what I thought was plenty of free space on /var.

How wrong I was. The process of installing one of these updates uses a huge amount of intermediate storage, and the installation script does not bother to check whether there's enough space to complete the task before commencing it. Worse, when it does exhaust the free space on the file system, it just keeps blundering on, truncating files and destroying information, and then, as a final boot in the system administrator's face, reports that everything has completed successfuly.

When you reboot the firewall after this process, you begin to appreciate the extent of the damage. Essentially nothing works; your installation consists of about an equal mix of old, new, and truncated files, and since the backup files were lost in the disc full incident, you cannot even reverse the process to restore the status quo ante. After surveying the wreckage, I decided the most expeditous course would be to restore the entire contents of the /preserve/opt/packages/installed directory from the most recent backup. (If you haven't previously installed any patches, you can restore it from the “IPSO Wrapper” for the software version you're using.) After restoring the contents of this directory, I was able to reboot the backup firewall and have it resume its backup rôle running the old version.

For the next attempt, I decided to place the update files on the /preserve file system, which is the largest on the machine. Copying them there filled this file system to the 50% level from a starting point of 37%, but that still left far more free space than on /var. The first attempt to install the patch failed, claiming that it was already installed. The first failed attempt had corrupted the registry (shudder) and so I had to edit /preserve/var/opt/CPshared-R55p/registry/HKLM_registry.data and remove the two instances of:

    : (HotFixes
        :HOTFIX_HFA_R55P_08 (1)
    )
left there by the failed installation before the update would apply successfully. During the installation I kept an eye on free space, and at the high-water mark /preserve reached the white-knuckle level of 87% of capacity. But after the installation was complete, it dropped back to just a few percent above where it started.

After the installation was complete, I rebooted the backup firewall, halted the primary, and allowed the new version of the software on the backup to enter production. After a day with no problems, I repeated the process to install the update on the primary, restoring the site to a fully redundant configuration. After all of this I was almost afraid to try the obvious test of rebooting the backup to see if the problem which launched me on this adventure had, in fact. been fixed—I'm not sure I could have maintained my composure if all of this had been for nought. But, after all, system administrators are known more for their ill-tempered meat cleavers than even tempered composure, so I went ahead and rebooted the backup and, lo and behold, the problem had indeed been fixed (cue choirs of angels singing hallelujahs).

The lesson to take away from this is that when you're installing Check Point Hotfix Accumulators on a Nokia IP265, always place the update directory on the /preserve file system, not one of the others with less capacity. And, of course, be sure you have a complete, current backup of the entire machine (not just the configuration files backed up Nokia Voyager, which do not include the critical package files modified by the Hotfix installation) before attempting the installation.

Posted at December 12, 2006 21:03