thr3ads.net - CentOS - [CentOS] CentOS 6 spontaneous reboots [May 2016]

If this information is useful, please help other people find it:
Share via:

Bill Gee

2016-May-30 00:10 UTC

[CentOS] CentOS 6 spontaneous reboots

Hello everyone -

My CentOS 6.8 server has been rebooting itself every 2 to 4 hours for the last 
several days.  I do not know where to look for logs that might give a clue what 
the problem is.  There are no unusual entries in /var/log/messages.  I looked 
over other log files in /var/log and found nothing suggestive.  Where else can I
look?

By luck I saw the beginning of a reboot on the server console.  Normally I have 
other systems up on the KVM switch.  It appears to have dumped core.  I
don't
know where to look for the core dump files.  They are not in /root.

The problem started while the server was still 6.7.  It had almost 290 days of 
uptime when the problem started.  I have tried the following, none of which 
made any difference.  

I ran the upgrade to 6.8.

I tried stopping non-essential services two at a time.

I ran MemTest 86+.  No memory errors were found.

I turned off swap.

I unplugged the USB hard drive that I use to hold daily and weekly backups.  It 
was being recognized as /dev/sda for some reason.

Lm_sensors shows the processor running between 45 and 50C.  Hddtemp 
shows the hard drives running between 35 and 40C.  LM_Sensors does not 
produce valid data on fan speed, but a visual check shows all fans running 
normally and no build-up of dust.

The system is behind a big UPS that runs several other systems.  The UPS log 
file does not record any power failures and none of the other systems are 
rebooting at random.

What else can I look at?

Thanks - Bill Gee

Keith Keller

2016-May-30 00:48 UTC

head link

[CentOS] CentOS 6 spontaneous reboots

Hi Bill,

On 2016-05-30, Bill Gee <bgee at campercaver.net>
wrote:>
> By luck I saw the beginning of a reboot on the server console.  Normally I
have
> other systems up on the KVM switch.  It appears to have dumped core.  I
don't
> know where to look for the core dump files.  They are not in /root.
One place you might check is under /var/lib.  I think there may be a
/var/lib/crash directory which contains core dumps.
> I ran MemTest 86+.  No memory errors were found.
Another option is to try Advanced Cluster Breakin, which runs other
tests besides memory.

http://www.advancedclustering.com/products/software/breakin/

I've had it find problems that memtest hasn't (and vice-versa).
> Lm_sensors shows the processor running between 45 and 50C.
If the system supports IPMI, check those sensors and logs, there may be
something useful there.  If you don't have IPMI, there may still be
something in the BIOS logs (how you get to those varies wildly, you may
need to boot into the BIOS to do it).

I hope that helps!

--keith

-- 
kkeller at wombat.san-francisco.ca.us

FrancisM

2016-May-30 00:59 UTC

head link

[CentOS] CentOS 6 spontaneous reboots

Check the hardware system health it could be that there is a faulty
component that triggering to reboot or maybe high temperature (overheated)
processor check your hardware fan if still working

On Monday, 30 May 2016, Keith Keller <kkeller at
wombat.san-francisco.ca.us>
wrote:
> Hi Bill,
>
> On 2016-05-30, Bill Gee <bgee at campercaver.net
<javascript:;>> wrote:
> >
> > By luck I saw the beginning of a reboot on the server console. 
Normally
> I have
> > other systems up on the KVM switch.  It appears to have dumped core. 
I
> don't
> > know where to look for the core dump files.  They are not in /root.
>
> One place you might check is under /var/lib.  I think there may be a
> /var/lib/crash directory which contains core dumps.
>
> > I ran MemTest 86+.  No memory errors were found.
>
> Another option is to try Advanced Cluster Breakin, which runs other
> tests besides memory.
>
> http://www.advancedclustering.com/products/software/breakin/
>
> I've had it find problems that memtest hasn't (and vice-versa).
>
> > Lm_sensors shows the processor running between 45 and 50C.
>
> If the system supports IPMI, check those sensors and logs, there may be
> something useful there.  If you don't have IPMI, there may still be
> something in the BIOS logs (how you get to those varies wildly, you may
> need to boot into the BIOS to do it).
>
> I hope that helps!
>
> --keith
>
> --
> kkeller at wombat.san-francisco.ca.us <javascript:;>
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org <javascript:;>
> https://lists.centos.org/mailman/listinfo/centos
>

-- 

This email or attachments may contain confidential or legally privileged
information intended for the sole use of the addressee(s). Any use,
redistribution, disclosure, or reproduction of this message, except as
intended, is prohibited. If you received this email in error, please notify
the sender and reformat your hard drive to remove all copies of the
message, including any attachments; failure to do so may result in your
floppy drive being filled with jelly. Any views or opinions expressed in
this email (unless otherwise stated) may not represent those of the Vatican
City, George W Bush, or the Sisters of the Perpetual Motion. Cheers [image:
?]

Anthony K

2016-May-30 03:20 UTC

head link

[CentOS] CentOS 6 spontaneous reboots

On 30/05/16 10:10, Bill Gee wrote:> What else can I look at?
>
TL;DR

sar -m TEMP | less
(sar can be found in the sysstat package)

---

I have a Debian based media server that was exhibiting similar symptoms 
after having served me well for more than 4 years.  During my 
troubleshooting, I came across Brendan Gregg's website *[0]* via his 
2-part Youtube video *[1]* on Linux Performance Tools**.

Suffice it to say that I was able to nail my problem by installing the 
package 'sysstat' and enabling all performance counters. Within 2 days, 
I identified the root cause - temperature spikes during media cataloging 
which occurred on a periodic basis at 3am *_and_* whenever I added a 
video to my library.  The following command was instrumental in 
determining this:

sar -m TEMP | less

I was able to correlate the output of the command above with my Plex 
logs.  Since I did not have time to open the system up and do the 
thorough cleaning that was necessary, I wrote a script to monitor the 
temperature and to throttle the CPU down if the temperature hit 85% of 
max (the AMD CPU in the media server doesn't throttle internally).  
After time availed itself, I cleaned the server innards and changed the 
thermal paste which had dried up and wasn't performing optimally.

In any case, Brendan Gregg's website and Youtube video were (and still 
are) very helpful.


My $0.02
ak.

*[0]*    http://www.brendangregg.com/index.html
*[1]*    https://www.youtube.com/watch?v=FJW8nGV4jxY

Bill Gee

2016-May-30 19:46 UTC

head link

[CentOS] CentOS 6 spontaneous reboots

Hello everyone -

I found the core dumps.  They are in /var/crash.  This directory contains a 
directory for each crash, named by IP address-date-time.  Each directory 
contains a vmcore and a vmcore-dmesg.txt file.

The vmcore-dmesg.txt files are mostly the kernel initialization stuff, same as
you
would see in dmesg.  At the end, though, is some information about the 
process that was executing when the crash happened.  

I reviewed several of those and found a common process - aiccu!  That seems 
very odd since I have been running aiccu and Sixxs for over five years.  It has 
never given me any trouble before.  The package I have on this server came 
from the EPEL repository and has not changed for several years.  The Sixxs web 
site also shows no change in aiccu for many years.

I also found, by chance, an operation that seems to always trigger the crash. 
If I
go to my main workstation (Fedora 23) and tell Akregator to "refresh all
feeds",
that is guaranteed to produce a crash.  There are probably other operations that
can force a crash, but I have not found them.  

For now I have turned off ipv6 forwarding and stopped the radvd service.  That 
should keep aiccu from handling anything.

It is nice to know it is not some funky hardware problem.  Still, it would be
nice to
have it working.  Any thoughts?

Thanks - Bill Gee

On Sunday, May 29, 2016 17:48:09 Keith Keller wrote:> Hi Bill,
> 
> On 2016-05-30, Bill Gee <bgee at campercaver.net> wrote:
> > By luck I saw the beginning of a reboot on the server console. 
Normally I
> > have other systems up on the KVM switch.  It appears to have dumped 
core.> >  I don't know where to look for the core dump files.  They are not
in
> > /root.
> One place you might check is under /var/lib.  I think there may be a
> /var/lib/crash directory which contains core dumps.
> 
> > I ran MemTest 86+.  No memory errors were found.
> 
> Another option is to try Advanced Cluster Breakin, which runs other
> tests besides memory.
> 
> http://www.advancedclustering.com/products/software/breakin/
> 
> I've had it find problems that memtest hasn't (and vice-versa).
> 
> > Lm_sensors shows the processor running between 45 and 50C.
> 
> If the system supports IPMI, check those sensors and logs, there may be
> something useful there.  If you don't have IPMI, there may still be
> something in the BIOS logs (how you get to those varies wildly, you may
> need to boot into the BIOS to do it).
> 
> I hope that helps!
> 
> --keith

Seemingly Similar Threads

Search for more possibly parallel threads

CentOS - May 2016 - CentOS 6 spontaneous reboots

[CentOS] CentOS 6 spontaneous reboots

[CentOS] CentOS 6 spontaneous reboots

[CentOS] CentOS 6 spontaneous reboots

[CentOS] CentOS 6 spontaneous reboots

[CentOS] CentOS 6 spontaneous reboots

Seemingly Similar Threads