« Reading List: The Habit | Main | Reading List: Linux iptables Pocket Reference »

Monday, February 21, 2005

It . . . is . . . alive!

On Friday, February 18th 2005 around 15:00 UTC, the prototype of the new www.fourmilab.ch server farm was put into pre-production test. The prototype is more of a server jardin potager than a farm, since it consists of only one main server with an identically configured laptop impersonating the second production server until its hardware arrives. The main server (the deep box with the black front and silver top at the bottom of the stack on the bitmobile at the right of the desk) is a Dell PowerEdge 1850 with dual Intel Xeon 3.6 GHz server0_2005-02-19.jpgprocessors. These are "hyper-threading" CPUs, so there are logically four processors as seen by the operating system. Main memory is 8 Gb of ECC DDR2 RAM, which permits keeping the Earth and Moon Viewer image databases in memory (essential to avoid page thrashing) even if one memory bank fails power on self-test and is excluded from the configuration. Two 146 Gb 10,000 RPM SCSI drives are configured as a RAID1 mirror with hardware RAID support on the motherboard. The server can run "headless", but during the development phase I've attached a flat panel monitor, keyboard, and mouse to connectors on the back panel, which are replicated on the front for use with mobile KVM carts for data centre administration. The server has dual redundant power supplies, each with its own power cord, permitting them to be plugged into separate UPSes; one suffices to run the server, so you can pull one of the plugs at any time without crashing the machine. Both power supplies and hard drives can be "hot swapped" without powering down; when a new hard drive is detected, the RAID firmware will automatically reconstruct it from the mirror on the other drive. The servers are run under the Fedora Core 3 Linux distribution with current kernel (2.6.10-1.760_FC3smp) and utilities releases. The generic, binary distribution SMP (symmetric multi-processor) kernel is used.

Atop the server, elegantly resting on two pieces of salvaged packaging foam, is the Coyote Point Equalizer E350 which front-ends the server farm. When you connect to the IP address of www.fourmilab.ch, the load balancer receives the packet and forwards it to the available server most likely to provide the best response time based on an algorithm which includes the server's own estimate of its load from a Fourmilab custom program. The load balancer is configured to maintain session persistence where required, so if a user generates dynamic content in response to a request (for example, an image from Earth and Moon Viewer), that image will be retrieved from the cache on the same server which generated it. This is the most challenging part of implementing a server farm, and accounted for about 75% of the work to date in migrating the site from a single four processor SPARC/Solaris server to the server farm.

Yes, the present packaging is less than elegant. These Dell servers are 1U high, but you can't just bung them into any old rack because they're so deep. An 80 cm rack is the absolute minimum, but when you take into account the bend radius of power cords and the need to mount power distribution components on the back rails, you really need a one metre deep rack, so that's what I've ordered from Dell along with the second server. Once that comes to hand, I'll install the current components in the rack and proceed to a fully redundant configuration. This will include a second load balancer in hot spare mode, two switches to which the servers are cross-connected by their dual Ethernet interfaces (using "Bond: Ethernet Bond" mode), and two separate UPSes powering the servers, load balancers, and switches to avoid any single point failure mode.

Of course, there remain single point failures which can take the site down; after all, there was only one ascent engine on the lunar module: sometimes things just have to work. Failure of the leased line connection to the ISP, leased line modem, gateway router, or hub which connects the router to the boundary firewalls remain single points of failure, but then none of these have ever failed in the 10 years Fourmilab has been on the Internet. History has no predictive value whatsoever but, knock on biocomposite, maybe we'll go another decade without any of these weak links rendering the site out of sight.

And since this is a production test, if something at this site doesn't seem to be working, please report it with with our feedback form.

Posted at February 21, 2005 01:17