thr3ads.net - CentOS - [CentOS] storage servers crashing, hair being pulled out! [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Gordon McLellan

2009-Dec-20 03:55 UTC

[CentOS] storage servers crashing, hair being pulled out!

I have a trio of servers that like to reboot during high disk /
network IO operations.  They don't appear to panic, as I have
kernel.panic = 0 in sysctl.conf.  The syslog just shows normal
messages, like samba complaining about browse master and then just
syslogd starting up.

The machines seem to crash when I'm not near the console, usually when
I'm trying to pull data off them to another machine running backups.
But, they've also crashed trying to copy data off them to other
servers (via iscsi).  Also, they have crashed being on the receiving
end of data via nfs.

Two of the servers are linked using drbd and heartbeat, the third is
stand alone.

Centos 5.4 x86-64 is the flavor of linux on all of them, pretty much
vanilla except for the drbd/iscsi stuff.

I want to go after the motherboard manufactorer, since I'm more
willing to suspect three mobos in a bad lot than three CPUs,
especially since one cpu is completely different than the other two.

The other variable is the two machines running drbd have promise raid
cards in them.  I also have the same raid card in my personal server
at home.  That server also has a nack of crashing during heavy disk IO
to the raid volume.  The entire OS doesn't crash, just the raid
volume, and the only way to bring it back is a reboot.

I'm really at a loss on what to do next... Any suggestions?

Gordon

The hardware config of the drbd servers:

Tyan i3210 ICH9 mobo
Intel C2D 7500 cpu
4GB A-Data ram
Promise ex8650 raid
Supermicro 742TQ-865 chassis (865w psu)
8x 1Tb western digital green power drives

The third machine:

Tyan i3210 ICH9 mobo
Intel C2Q 9400 cpu
8GB Mushkin ram
dmraid 5
Antec something or other chassis
550W PC Power and Cooling PSU
7x 250gb seagate 7200's

William Warren

2009-Dec-20 04:32 UTC

head link

[CentOS] storage servers crashing, hair being pulled out!

I'm looking at the controller myself.  Have you tried updating either 
the firmware on the card the drivers or both?

On 12/19/2009 10:55 PM, Gordon McLellan wrote:> The other variable is the two machines running drbd have promise raid
> cards in them.  I also have the same raid card in my personal server
> at home.  That server also has a nack of crashing during heavy disk IO
> to the raid volume.  The entire OS doesn't crash, just the raid
> volume, and the only way to bring it back is a reboot.
>
> I'm really at a loss on what to do next... Any suggestions?
>

William Warren

2009-Dec-20 04:38 UTC

head link

[CentOS] storage servers crashing, hair being pulled out!

Do you have a BBU on this card?  Various sites report the controller has 
poor performance on writes without the bbu.

On 12/19/2009 10:55 PM, Gordon McLellan wrote:> I'm really at a loss on what to do next... Any suggestions?
>

nate

2009-Dec-20 05:11 UTC

head link

[CentOS] storage servers crashing, hair being pulled out!

Gordon McLellan wrote:
> I'm really at a loss on what to do next... Any suggestions?
Run hardware diagnostics? Run a burn in test? I use this:

http://sourceforge.net/projects/va-ctcs/

For burn-in. In my experience it takes less then 4 hours at
high load with this app to turn up faulty hardware. If it
does crash with this then replace the system or replace
components until the crashing stops, run it for a week, then
you can be pretty certain at least the hardware is stable.

Also noticed your using pretty poor quality components for
a storage server, promise raid? western digital "green" disks?
Not exactly server grade.

Suggest if you want stability you go with Western Digital RE3/4
disks and 3ware RAID(with a BBU so you can enable write back
caching), at least.. Seagate have high grade SATA as well, you
don't mention the model your using but I'd assume they are of
similar quality as the "green" disks, i.e. not made for servers.

Also I assume you have a decent UPS as well on all systems, never
run a computer without a UPS(well unless it's a laptop).

Did you build the systems yourself or did you buy them pre
assembled? If you did it yourself I would verify the power
supplies themselves are of decent quality and provide adequate
voltage given the number of disks your working with. While there
are plenty of good power supplies out there, the only one I will
go out of my way to put money down on is PC Power & Cooling.

nate

Les Mikesell

2009-Dec-20 06:28 UTC

head link

[CentOS] storage servers crashing, hair being pulled out!

Gordon McLellan wrote:> I have a trio of servers that like to reboot during high disk /
> network IO operations.  They don't appear to panic, as I have
> kernel.panic = 0 in sysctl.conf.  The syslog just shows normal
> messages, like samba complaining about browse master and then just
> syslogd starting up.
Did this just start happening after the last update, or have they never been 
reliable?  I have one box that just started crashing after the last kernel 
update but it may just be from old age instead.  If they have never been 
reliable I'd suspect bad RAM first.  I've seen cases where you had to
run the
memory test a few days to catch it.

-- 
   Les Mikesell
    lesmikesell at gmail.com

Matty

2009-Dec-20 18:34 UTC

head link

[CentOS] storage servers crashing, hair being pulled out!

On Sat, Dec 19, 2009 at 10:55 PM, Gordon McLellan <gordonthree at
gmail.com> wrote:
> I have a trio of servers that like to reboot during high disk /
> network IO operations. ?They don't appear to panic, as I have
> kernel.panic = 0 in sysctl.conf. ?The syslog just shows normal
> messages, like samba complaining about browse master and then just
> syslogd starting up.
If the box is panicing under high load, you should definitely check
the memory / CPU / power supplies. You may also find it beneficial to
enable kdump, netdump and sysrq. If the box hangs, you can issue a
sysrq magic key sequence to force the box to panic. During the panic
process, you should get a core file that you can analyze to see what
is going on (crash has some useful options to dump thread stacks,
which you can use to search the LKML archives).

- Ryan
--
http://prefetch.net

Gordon McLellan

2009-Dec-21 13:24 UTC

head link

[CentOS] storage servers crashing, hair being pulled out!

Thank you all for the suggestions.  I will grab a test suite or two
and do some burn in testing over the upcoming weekends.  These
machines are new, built from scratch.  I've been building systems for
over fifteen years and haven't had anywhere near this amount of
trouble which is really aggravating!

I realize garbage in equals garbage out and some of the chosen
components are pretty low-end, but I did spend close to six months
researching the components, and couldn't find substantial evidence to
dissuade me from any of the choices.  The only parts not new are the
250G seagates where are basically left-over parts from an old server
that was upgraded.  They're all known-good as that server gave me no
trouble through its service life.

Kind Regards,
Gordon

Possibly Parallel Threads

Search for more seemingly similar threads

CentOS - Dec 2009 - storage servers crashing, hair being pulled out!

[CentOS] storage servers crashing, hair being pulled out!

[CentOS] storage servers crashing, hair being pulled out!

[CentOS] storage servers crashing, hair being pulled out!

[CentOS] storage servers crashing, hair being pulled out!

[CentOS] storage servers crashing, hair being pulled out!

[CentOS] storage servers crashing, hair being pulled out!

[CentOS] storage servers crashing, hair being pulled out!

Possibly Parallel Threads