Gordon McLellan
2009-Dec-20  03:55 UTC
[CentOS] storage servers crashing, hair being pulled out!
I have a trio of servers that like to reboot during high disk / network IO operations. They don't appear to panic, as I have kernel.panic = 0 in sysctl.conf. The syslog just shows normal messages, like samba complaining about browse master and then just syslogd starting up. The machines seem to crash when I'm not near the console, usually when I'm trying to pull data off them to another machine running backups. But, they've also crashed trying to copy data off them to other servers (via iscsi). Also, they have crashed being on the receiving end of data via nfs. Two of the servers are linked using drbd and heartbeat, the third is stand alone. Centos 5.4 x86-64 is the flavor of linux on all of them, pretty much vanilla except for the drbd/iscsi stuff. I want to go after the motherboard manufactorer, since I'm more willing to suspect three mobos in a bad lot than three CPUs, especially since one cpu is completely different than the other two. The other variable is the two machines running drbd have promise raid cards in them. I also have the same raid card in my personal server at home. That server also has a nack of crashing during heavy disk IO to the raid volume. The entire OS doesn't crash, just the raid volume, and the only way to bring it back is a reboot. I'm really at a loss on what to do next... Any suggestions? Gordon The hardware config of the drbd servers: Tyan i3210 ICH9 mobo Intel C2D 7500 cpu 4GB A-Data ram Promise ex8650 raid Supermicro 742TQ-865 chassis (865w psu) 8x 1Tb western digital green power drives The third machine: Tyan i3210 ICH9 mobo Intel C2Q 9400 cpu 8GB Mushkin ram dmraid 5 Antec something or other chassis 550W PC Power and Cooling PSU 7x 250gb seagate 7200's
William Warren
2009-Dec-20  04:32 UTC
[CentOS] storage servers crashing, hair being pulled out!
I'm looking at the controller myself. Have you tried updating either the firmware on the card the drivers or both? On 12/19/2009 10:55 PM, Gordon McLellan wrote:> The other variable is the two machines running drbd have promise raid > cards in them. I also have the same raid card in my personal server > at home. That server also has a nack of crashing during heavy disk IO > to the raid volume. The entire OS doesn't crash, just the raid > volume, and the only way to bring it back is a reboot. > > I'm really at a loss on what to do next... Any suggestions? >
William Warren
2009-Dec-20  04:38 UTC
[CentOS] storage servers crashing, hair being pulled out!
Do you have a BBU on this card? Various sites report the controller has poor performance on writes without the bbu. On 12/19/2009 10:55 PM, Gordon McLellan wrote:> I'm really at a loss on what to do next... Any suggestions? >
Gordon McLellan wrote:> I'm really at a loss on what to do next... Any suggestions?Run hardware diagnostics? Run a burn in test? I use this: http://sourceforge.net/projects/va-ctcs/ For burn-in. In my experience it takes less then 4 hours at high load with this app to turn up faulty hardware. If it does crash with this then replace the system or replace components until the crashing stops, run it for a week, then you can be pretty certain at least the hardware is stable. Also noticed your using pretty poor quality components for a storage server, promise raid? western digital "green" disks? Not exactly server grade. Suggest if you want stability you go with Western Digital RE3/4 disks and 3ware RAID(with a BBU so you can enable write back caching), at least.. Seagate have high grade SATA as well, you don't mention the model your using but I'd assume they are of similar quality as the "green" disks, i.e. not made for servers. Also I assume you have a decent UPS as well on all systems, never run a computer without a UPS(well unless it's a laptop). Did you build the systems yourself or did you buy them pre assembled? If you did it yourself I would verify the power supplies themselves are of decent quality and provide adequate voltage given the number of disks your working with. While there are plenty of good power supplies out there, the only one I will go out of my way to put money down on is PC Power & Cooling. nate
Les Mikesell
2009-Dec-20  06:28 UTC
[CentOS] storage servers crashing, hair being pulled out!
Gordon McLellan wrote:> I have a trio of servers that like to reboot during high disk / > network IO operations. They don't appear to panic, as I have > kernel.panic = 0 in sysctl.conf. The syslog just shows normal > messages, like samba complaining about browse master and then just > syslogd starting up.Did this just start happening after the last update, or have they never been reliable? I have one box that just started crashing after the last kernel update but it may just be from old age instead. If they have never been reliable I'd suspect bad RAM first. I've seen cases where you had to run the memory test a few days to catch it. -- Les Mikesell lesmikesell at gmail.com
On Sat, Dec 19, 2009 at 10:55 PM, Gordon McLellan <gordonthree at gmail.com> wrote:> I have a trio of servers that like to reboot during high disk / > network IO operations. ?They don't appear to panic, as I have > kernel.panic = 0 in sysctl.conf. ?The syslog just shows normal > messages, like samba complaining about browse master and then just > syslogd starting up.If the box is panicing under high load, you should definitely check the memory / CPU / power supplies. You may also find it beneficial to enable kdump, netdump and sysrq. If the box hangs, you can issue a sysrq magic key sequence to force the box to panic. During the panic process, you should get a core file that you can analyze to see what is going on (crash has some useful options to dump thread stacks, which you can use to search the LKML archives). - Ryan -- http://prefetch.net
Gordon McLellan
2009-Dec-21  13:24 UTC
[CentOS] storage servers crashing, hair being pulled out!
Thank you all for the suggestions. I will grab a test suite or two and do some burn in testing over the upcoming weekends. These machines are new, built from scratch. I've been building systems for over fifteen years and haven't had anywhere near this amount of trouble which is really aggravating! I realize garbage in equals garbage out and some of the chosen components are pretty low-end, but I did spend close to six months researching the components, and couldn't find substantial evidence to dissuade me from any of the choices. The only parts not new are the 250G seagates where are basically left-over parts from an old server that was upgraded. They're all known-good as that server gave me no trouble through its service life. Kind Regards, Gordon