« Reading List: On Bullshit | Main | Oops! »

Thursday, July 19, 2007

Fourmilab Internet outage

At 14:57 UTC (16:57 local time) yesterday, 2007-07-18, Fourmilab's Internet connection went down, with two “Loss of signal” lights illuminated on the leased line modem. I contacted the Internet Service Provider (ISP), who told me that the entire village in which the Point of Presence (POP) to which Fourmilab is connected had experienced a power outage, and while their equipment is on a UPS which allows riding out short power blips, there is no backup generator, so all customers connected to that POP had experienced an interruption in service.

The power outage lasted about three hours. When the power was back on and everything had rebooted, the ISP called me to ask if my service was restored. It wasn't—the very same error lights appeared on the modem, and power cycling it changed nothing. At this point they said they'd have to send a technician to check the equipment at the POP, which could not be done before 06:00 local time today. When the technician arrived, he diagnosed the problem as a genuine outage on the line, and told the ISP service desk to put in a service call to Swisscom, who furnishes the leased line. They promptly did this, but for the wrong line. You see, next week the ISP has scheduled a migration of the line to a new POP located closer to Fourmilab, and has installed a backup line to allow the switchover without more than a brief interruption. They had already updated their records to indicate my connectivity as via the new line, and that's the one they reported down to Swisscom. Swisscom swiftly diagnosed the problem with that line, which does not have a modem connected to it yet. This unleashed a cloud of confusion among Swisscom, the ISP, and me, with multiple puzzled messages back and forth. When I finally managed to explain that it was the “old” line which was still being used and out of order, Swisscom was finally able to work on the actual problem, which turned out to be that the modem at the ISP's POP somehow misconfigured itself when resetting after the power came back on. Connectivity was restored at 09:59 UTC (11:59 local time) on 2007-07-19.

The router connected to the leased line is supposed to automatically establish an ISDN backup connection when the leased line goes down, but this did not happen; it worked perfectly on previous brief leased line outages, and nothing has changed in the router or telephone system configuration of which I am aware. Since the router is due to be replaced as part of the POP migration next week, there's no point investigating this until the new router is in place. (The ISDN backup isn't remotely fast enough to handle the Web site traffic, but it allows client Internet access, DNS, and E-mail to continue to function, which is handy when you're tracking down problems and communicating with service providers.)

At 19 hours, this is the second longest Internet outage Fourmilab has experienced in the more than 12 years the site has been on the Web. The longest was in the late 1990s when the cable bundle carrying the leased line was accidentally cut during an excavation, setting off a Marx Brothers cascade of calamity within the recently-privatised, disorganised, and understaffed Swisscom which took the site down for three full days.

Posted at July 19, 2007 13:53