thr3ads.net - CentOS - [CentOS] Server hangs on CentOS 5.5 [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Michael Eager

2011-Mar-08 17:24 UTC

[CentOS] Server hangs on CentOS 5.5

Hi --

I'm running a server which is usually stable, but every
once in a while it hangs.  The server is used as a file
store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere
to indicate any problem or offer any clue why the system
was hung.

Any suggestions where I might look for a clue?

-- 
Michael Eager	 eager at eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

Les Mikesell

2011-Mar-08 17:41 UTC

head link

[CentOS] Server hangs on CentOS 5.5

On 3/8/2011 11:24 AM, Michael Eager wrote:> Hi --
>
> I'm running a server which is usually stable, but every
> once in a while it hangs.  The server is used as a file
> store using NFS and to run VMware machines.
>
> I don't see anything in /var/log/messages or elsewhere
> to indicate any problem or offer any clue why the system
> was hung.
>
> Any suggestions where I might look for a clue?
Probably something hardware related.  Bad memory, overheating, power 
supply, etc.  I've even seen some rare cases where a bios update would 
fix it although it didn't make much sense for a machine to run for 
years, then need a firmware change.

-- 
   Les Mikesell
    lesmikesell at gmail.com

compdoc

2011-Mar-08 17:44 UTC

head link

[CentOS] Server hangs on CentOS 5.5

>I'm running a server which is usually stable, but every
>once in a while it hangs.

There can be many reasons for that. One thing I'm curious about - try
looking at the reallocated sector count, and current pending sector count
for your drives with smartctl.

Brian Mathis

2011-Mar-08 17:50 UTC

head link

[CentOS] Server hangs on CentOS 5.5

On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager <eager at eagerm.com>
wrote:> Hi --
>
> I'm running a server which is usually stable, but every
> once in a while it hangs. ?The server is used as a file
> store using NFS and to run VMware machines.
>
> I don't see anything in /var/log/messages or elsewhere
> to indicate any problem or offer any clue why the system
> was hung.
>
> Any suggestions where I might look for a clue?
Please be more specific when you say it "hangs".  Does it just pause
for a minute and then continue working, or does it freeze completely
until you reboot it?  Does it respond to s "soft" reboot like
Ctrl-Alt-Del, or do you need to hard power it off?

Since this is an NFS server I'm going to guess there might be a lot of
IO.  Maybe there is some large IO load going on, like maybe all your
VMs are running anti-virus scan at the same time, or something like
that.

To troubleshoot, I recommend installing the 'sar' utilities (yum
install sysstat) and then reviewing the collected data using the
'ksar' utility (http://sourceforge.net/projects/ksar/).  sar/ksar are
good for tracking down acute problems.

Dr. Ed Morbius

2011-Mar-08 21:44 UTC

head link

[CentOS] Server hangs on CentOS 5.5

on 09:24 Tue 08 Mar, Michael Eager (eager at eagerm.com)
wrote:> Hi --
> 
> I'm running a server which is usually stable, but every
> once in a while it hangs.  The server is used as a file
> store using NFS and to run VMware machines.
> 
> I don't see anything in /var/log/messages or elsewhere
> to indicate any problem or offer any clue why the system
> was hung.
> 
> Any suggestions where I might look for a clue?
I'd very strongly recommend you configure netconsole.  Though not entire
clear from the name, it's actually an in-kernel network logging module,
which is very useful for kicking out kernel panics which otherwise
aren't logged to disk and can't be seen on a (nonresponsive) monitor.

Alternately, a serial console which actually retains all output sent to
it (some remote access systems support this, some don't) may help.

Barring that, I'd start looking at individual HW components, starting
with RAM.

The trick is in passing the appropriate parameters to the module at load
time.  I found it helpful to have an @boot cronjob to do this.

You'll need to pass the local port, local system IP, local network
device, remote syslog UDP port, remote syslog IP, and the /gateway/ MAC
address, where gateway is the syslogd (if on a contiguous ethernet
segment), or your network gateway host, if not.  Some parsing magic can
determine these values for you.

Good article describing configuration:

    http://www.cyberciti.biz/tips/linux-netconsole-log-management-tutorial.html

If you're not already remote-logging all other activity, I'd do that as
well.  You might catch the start of the hang, if not all of it.

-- 
Dr. Ed Morbius, Chief Scientist /            |
  Robot Wrangler / Staff Psychologist        | When you seek unlimited power
Krell Power Systems Unlimited                |                  Go to Krell!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL:
<http://lists.centos.org/pipermail/centos/attachments/20110308/844bd4e9/attachment-0002.sig>

Leen de Braal

2011-Mar-09 09:34 UTC

head link

[CentOS] Server hangs on CentOS 5.5

> On Wed, Mar 9, 2011 at 10:24 AM, Leen de Braal <ldb at braha.nl>
wrote:
>>> m.roth at 5-cent.us wrote:
>>>> Michael Eager wrote:
>>>
>>>>> House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.
>>>>
>>>> Any chance the problem's with the video card?
>>>
>>> Video is on the MB. ?It doesn't seem likely that it's
>>> the video, since the system doesn't respond to network
>>> when it crashes.
>>>
>>> It could be anything. ?That's why I'm looking for
>>> something that would give me a bit of a hint what
>>> to look at. ?With an infrequent failure, it's not
>>> practical to replace components piecemeal.
>>
>> While you open the case, check for the bulging capacitor problem.
>> Will have the effect you describe, freezing up the system so that even
>> bios routines don't work (your fans).
>> If that's the case, replace mainboard.
>>
>
>
> Or replace the CAPS if you're not afraid of a soldering iron :)
Very often resulting in a damaged board, because you damage the via's when
pulling the caps. But it is worth a try.
>
>
>
> --
> Kind Regards
> Rudi Ahlers
> SoftDux
>
> Website: http://www.SoftDux.com
> Technical Blog: http://Blog.SoftDux.com
> Office: 087 805 9573
> Cell: 082 554 7532
>

-- 
L. de Braal
BraHa Systems
NL - Terneuzen
T +31 115 649333
F +31 115 649444

Lamar Owen

2011-Mar-09 15:05 UTC

head link

[CentOS] Server hangs on CentOS 5.5

On Tuesday, March 08, 2011 04:44:54 pm Dr. Ed Morbius
wrote:> I'd very strongly recommend you configure netconsole. 
Ok, now this is useful indeed.  Thanks for the information, even though I'm
not the OP....  While I suspected the facility might be there, I hadn't
really dug for it, but if this will catch things after filesystems go r/o (ext3
journal things, ya know) it could be worth its weight in gold for catching
kernel errors from VMware guests (serial console not really an option with the
hosts I have, although I'm sure some enterprising soul has figured out how
to redirect the VM guest serial port to something else....).

Michael Eager

2011-Mar-09 15:06 UTC

head link

[CentOS] Server hangs on CentOS 5.5

Dr. Ed Morbius wrote:> on 09:24 Tue 08 Mar, Michael Eager (eager at eagerm.com) wrote:
>> Hi --
>>
>> I'm running a server which is usually stable, but every
>> once in a while it hangs.  The server is used as a file
>> store using NFS and to run VMware machines.
>>
>> I don't see anything in /var/log/messages or elsewhere
>> to indicate any problem or offer any clue why the system
>> was hung.
>>
>> Any suggestions where I might look for a clue?
> 
> I'd very strongly recommend you configure netconsole.  Though not
entire
> clear from the name, it's actually an in-kernel network logging module,
> which is very useful for kicking out kernel panics which otherwise
> aren't logged to disk and can't be seen on a (nonresponsive)
monitor.
I'll take a look at netconsole.
> Alternately, a serial console which actually retains all output sent to
> it (some remote access systems support this, some don't) may help.
> 
> Barring that, I'd start looking at individual HW components, starting
> with RAM.
The problem with randomly replacing various components, other than
the downtime and nuisance, is that there's no way to know that the
change actually fixed any problem.  When the base rate is one
unknown system hang every few weeks, how many wees should I wait
without a failure to conclude that the replaced component was the
cause?  A failure which happens infrequently isn't really amenable
to a random diagnostic approach.

-- 
Michael Eager	 eager at eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

Lamar Owen

2011-Mar-09 15:22 UTC

head link

[CentOS] Server hangs on CentOS 5.5

On Wednesday, March 09, 2011 03:24:48 am Leen de Braal
wrote:> While you open the case, check for the bulging capacitor problem.
> Will have the effect you describe, freezing up the system so that even
> bios routines don't work (your fans).
> If that's the case, replace mainboard.
I've seen capacitor problems in the past, and they can be rather
interesting.

What the caps do is open up (electrically speaking) meaning they no longer can
smooth out the ripple in the output of the switching regulator; this ripple is
very high frequency due to the switching regulator's design.  As the CPU
draws more current (which happens when it's loaded, of course, since MOS
gates by design consume the most power during the switching period (capacitor
charging time constants on the gates of the transistors themselves)), the
switching regulator has to supply more current, and if the caps are open they
can't smooth out the deeper ripple.

I actually had one motherboard blow two caps; one of the cases of one of the
blown capacitors was violently ejected off of the 'guts' of the cap,
hard enough that it dented the PC's case from the inside.

The PC kept running, until it was put under load, then it would lock up.  When
the second cap blew, about an hour later, the PC hung; it would power up and run
POST, and even run the BIOS setup's memory check and health check, but as
soon as the CPU was shifted into protect mode as the OS booted it would hard
hang due to the CPU's increased current draw overwhelming the ripple
absorbing capacity of the remaining good capacitors on the CPU's switching
regulator.

There's really only one way to determine this, and that's by putting an
oscilloscope on the CPU's power supply output rails and looking for ripple
while running a CPU burnin program.  The hard part of that is actually finding a
good place to measure the output, thanks to the typical motherboard's
multilayer design.

And while with the proper desoldering equipment and training/experience one can
re-cap a motherboard, I would not recommend doing so for a critical server,
unless you want and can assume personal liability for that server's
operation.  Better to get a new motherboard with a warranty.  For a personal
server that if it breaks isn't going to open you up to personal liability,
sure, you can re-cap if you'd like and have the patience, time, equipment,
and experience necessary to work on 6 to 8 layer PC boards, with may be soldered
with RoHS lead-free solder, which requires special techniques.  Otherwise, as
you said, you can damage the 'vias' (that is, the plated through holes
the capacitor leads solder to, which may be used to connect to internal layers
that you can't resolder) very easily.

Lamar Owen

2011-Mar-09 15:37 UTC

head link

[CentOS] Server hangs on CentOS 5.5

On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T.
wrote:> This would be far cheaper than the time spent troubleshooting the
> running (sometimes hanging) system.
Let me interject here, that from a budgeting standpoint 'cheaper' has to
be interpreted in the context of which budget the costs are coming out of.  New
hardware is capex, and thus would come out of the capital budget, and admin time
is opex, and thus would come out of the operating budget.  There may be
sufficient funds in the operating budget to pay an admin $x,000 but the funds in
the capital budget may be insufficient to buy a server costing $y,000, where
y=x.  And if this is an educational institution, and there are grants involved,
it may be the reverse situation.  So 'cheaper' only has meaning when the
costs are coming out of the same budget.  So, yes, while it's easy for a
single-budget entity to make this decision, it's not so easy when you have
multiple budgets involved with different spending parameters and different
funding entities.
> Starting with RAM and Power Supply is not random ... They're "The
Usual
> Suspects".
This is a very true statement.  

Heat and airflow are two others.

Michael Eager

2011-Mar-09 23:17 UTC

head link

[CentOS] Server hangs on CentOS 5.5

Rudi Ahlers wrote:> On Thu, Mar 10, 2011 at 12:31 AM, Michael Eager <eager at eagerm.com>
wrote:
>> Dr. Ed Morbius wrote:
>>
>>> If the issue is repeated but rare system failures on one of a set
of
>>> similarly configured hosts, I'd RMA the box and get a
replacement.  End
>>> of story.
>> I'll repeat:  this is a house-made system.  There's no vendor
to RMA to.
> 
> 
> 
> I don't know where you are, but in our country we can RMA anything and
> everything. Apart from CPU's. So, even a cheap desktop mobo could be
> RMA'd, as long as I can prove to the suppliers it's faulty, and
it's
> within the warrenty period
I responded to Dr. Morbius' suggestion that I "RMA the box".
There is vendor to RMA the box to.

If I knew that it was a motherboard problem, I could RMA it.
Or disk, or PSU, or network card, or whatever.  But, as I've mentioned,
there's no indication what causes the system to hang.  There is no
way at this point to prove that it is a defective motherboard.


-- 
Michael Eager	 eager at eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

Michael Eager

2011-Mar-10 00:04 UTC

head link

[CentOS] Server hangs on CentOS 5.5

Rudi Ahlers wrote:
> As far as I can see you were giving a bucked load of advice, which you
> haven't even bothered to follow yet. You're the only one who could
> actually do anything about the problem.
I have followed quite a bit of the advice, which I have
appreciated and noted.  I've set up the monitor so that it
will not be blanked on a crash, installed monitoring software,
and checked a number of conditions which people have suggested.

No, I have not responded to the philosophical discussions
about vender management, nor to the suggestions to RMA
something to somebody for unknown reasons.  No, I'm not
going to replace RAM or capacitors here and there on the off
chance that something might be bad.  (But I will look for
capacitors which show signs of bulging or leaking.)
> No amount of suggestions made on this list will fix the problem for
> you. You need to actually take apart the server and see what's going
> on.
I wasn't interested in anyone fixing the server for me.
I did ask for suggestions on how improve the diagnostics
for the problem, which several people have responded to.
Again, I appreciate their suggestions greatly.

As I've said, I have a list of things to check when the
server is next taken down.

-- 
Michael Eager	 eager at eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077

Lamar Owen

2011-Mar-10 14:07 UTC

head link

[CentOS] Server hangs on CentOS 5.5

On Thursday, March 10, 2011 05:35:29 am Rudi Ahlers
wrote:> I prefer to use a dust blower instead. It doesn't risk pulling loose
> components with "dry" or loose "soldering"
I use both: antistatic canned air to blow the dust and a metal-tubed vacuum
rested on a part of the case away from any boards to grab the dust that's
being blown.  Works great, and you don't 'recycle' the dust.....

Seemingly Similar Threads

Search for more reasonably related threads

CentOS - Mar 2011 - Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

[CentOS] Server hangs on CentOS 5.5

Seemingly Similar Threads