Derek Atkins
2013-Jun-14 15:05 UTC
ANNOUNCE: Unplanned outage of GnuCash server/services last night
Good morning, The short: there was a power outage from ~7pm until ~6:30pm knocking out code.gnucash.org; all services were back up and running by 10:30am. If you don't care about the nitty-gritty, you can stop reading now. The long story: Many of you may have noticed that code went offline last night. We had some pretty severe storms come through the area and knocked out power to at least 100,000 homes, if not more. The power went out shortly before 8pm last night, which also knocked out my network at that time (more on this later). The UPS lasted for over 2 hours, finally expiring a bit after 10pm. The power did not return until around 6:30 this morning. When I finally got out of bed this morning I noticed a few things. First, my DHCP server's name server didn't start. This seems to be a perpetual issue and I've never been able to figure out why this happens; if I restart it by hand it works fine. *shrugs* I also noticed that the VM server had not come back online, and was throwing LVM errors about not finding one of the Physical Volumes.. and that appeared to be due to mdadm not being able to rebuild /dev/md2 because "sdc and sdc1 looked to be the same device". YIKES! But when I booted the server with a rescue CD the volumes all came up just fine. Apparently there is some issue with mdadm that I've never hit before; the fix was to adjust the mdadm.conf in the initrd to point directly to the device partitions instead of the UUID. I don't particular like this solution, but it solved the immediate problem. This just underscores my need to find a new VM solution. The VM Host is still running a base OS of Fedora 13, with a Fedora-10 kernel! The reason it's on an F10 kernel is some scheduling and disk IO issues I was hitting with the F13 kernels, and I've been extremely hesitant to perform any other upgrades on the system since hitting that one. One of these years... Anyways, I got everything fixed and the VM host booting just before 10am, and then it took ~20-30 minutes for all the VMs to fsck and come up. But at this time it appears all services are back online and running normally. In the longer term we plan to get a backup generator, but that's probably still a year or three out. Also, I was able to get Comcast to acknowledge that there is a real problem on my node; over 100 people went offline with the power outage, so hopefully the technicians wont just close the trouble ticket outright again this time. *fingers crossed* I'd love to have my network stay up for at least 60 minutes during a power outage! Anyways, I need to try to get some real work done now.. Back to your regularly scheduled gnucash hacking here.. -derek -- Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory Member, MIT Student Information Processing Board (SIPB) URL: http://web.mit.edu/warlord/ PP-ASEL-IA N1NWH warlord at MIT.EDU PGP key available