« Animal Magnetism: Not a Snake | Main | Reading List: No Way to Treat a First Lady »

Monday, December 31, 2007

Fourmilab Server Farm: One Year Uptime

On the last day of 2007, the Fourmilab server farm reached the milestone of all machines which provide public services (Web, FTP, HotBits, etc.) having run for one year or more without a reboot or other system-wide service outage. The dual redundant power supplies of the Dell PowerEdge 1850 principal servers allowed the swap-out of an Uninterruptible Power Source (UPS) which failed to live up to its name without the need to shut down the servers to which it provided partial power.

 Host name  Function  Uptime as of 2007-12-31 
server1 Active public server 365 days, 10:31 hours
server0 Backup public server 722 days, 17:03 hours
server3  Test/administration server  378 days, 12:57 hours
hotbits0 HotBits generator 0 463 days, 10:12 hours
hotbits1 HotBits generator 1 428 days, 22:42 hours

Some people berate sites which rack up lengthy uptime records, claiming that this indicates neglect of preventive software maintenance, in particular keeping systems up to “current patch level”. Now, this is largely an instance of intellectual corruption due to Microsoft, where updating a music player requires rebooting a running system, but some Linux users also assume that frequent kernel updates and reboots to install them are essential for a secure system. Fourmilab's philosophy is different—on server farm machines, essentially the only component from the Linux software distribution used in the critical path is the kernel. Everything else: Web, FTP, mail, DNS, and other servers are built from source which resides in the server's private “/server” partition and, in keeping with the Unix tradition, any of these components can be updated as required simply by restarting it—no system reboot is required.

When a security or other update to one of the public server packages is released, I build it from source and test it on server3, the “Test/administration server”, which is actually a 6 year old laptop with a software configuration identical to the production servers. After testing, the update is deployed on the active and backup production servers with rdist, then put into production by restarting the server process on these machines; the interruption to public requests due to such a restart is on the order of one second. I generally install server updates on the active server first and leave the previous version on the backup server until I'm confident the new release is working well. That way, should the update crash or otherwise become nonresponsive under the real-world load, the load balancer will automatically fail over to the previous version running on the backup server.

The Fourmilab firewall is configured to only allow packets from the Internet to reach server farm machines on the ports on which these locally-built server processes listen; all other incoming traffic is discarded, so potentially vulnerable components from the Linux distribution, even if they were listening on some port, cannot be accessed from off-site by would-be attackers.

Posted at December 31, 2007 16:45