noesis downtime explained

dkg's picture
| |

so noesis had a spell of downtime that lasted about 20 hours yesterday.

the causes are embarassing, but i'll go ahead and own up to it all here.

basically, i upgraded my router (crusty) from debian woody to sarge on monday, following sarge becoming stable. The upgrade took a long time (1.5hrs?), primarily because the router is an ancient pentium 90Mhz machine, and there was a lot of tweaking that i wanted to do in the meantime.

When the upgrade started, crusty knew it was going to be upgrading the DHCP server for the LAN. before it upgraded that package, it was sensible enough to disable the DHCP server while the new package was being installed. however, this meant that my LAN had no DHCP server running for at least 30 minutes.

the machine inside the LAN running noesis (smoke) was for some dumb reason set up to use DHCP. Since crusty's DHCP server was configured to offer short leases, smoke's DHCP lease expired during the downtime, and i guess its dhclient gave up on reacquiring. this means it dropped its default gateway from the routing table, but for whatever reason, it kept the local IP address and netmask.

This means it could successfully communicate with any machine in the LAN, but didn't know how to route packets back out onto the WAN.

the problem was resolved with:

root@smoke# ifdown eth0; ifup eth0

but not until i had done a bunch of dumb panicky shit on crusty, like rebooting it when i didn't need to. Furthermore, pinhead had lost his DHCP lease during the upgrade as well, so i knew that such a thing could happen. However, pinhead's dhclient actually deconfigured the interface entirely, rather than just dropping entries from the routing table. this may be because pinhead is running dhcp3-client and smoke is running dhcp-client, which is version 2.

The lessons:

  1. Set up servers with static routes and IP addresses whenever possible (and if you don't, then at least remember which machines are the exceptions!)
  2. Think about systemic interactions between machines. Changing one machine can trigger or uncover problems in another machine. don't assume that the machine you changed is necessarily the one at fault.
  3. if you see a weird behavior during some tuning, think about where else it might be happening
  4. when doing a system upgrade on a crucial machine, it may be worth it to apt-get install individual services one at a time, to minimize total service outage time. if i had done apt-get install dhcp3-server on crusty before doing the system-wide upgrade, there would have been one small hiccup in DHCP service, which would probably not have been enough to cause anyone to lose their lease.
  5. monitor your services from some other subnet so that the monitoring services will catch routing errors.