PJ
2011-Mar-11 17:33 UTC
[CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
This may or may not be CentOS related, but am out of ideas at this point and wanted to bounce this off the list. I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5. Almost everyday around 3:30 AM the server completely locks up and has to be power cycled before it will come back online. (this means someone hat to wake up and reboot the server, oh how I love being an internet janitor! :) Smells like a hardware issue to me too, but I went through all of the dell diagnostics, updated the firmware, everything checks out as being okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell tech. No hardware issues, at least that we were able to find. There are no cron jobs that run at 3:30, no backups, the server has a load of 0, nothing is scheduled around that time... The only crontab entry at all is "*/5 * * * * wget -q www.websitedomain.com/cron.php >/dev/null 2>&1" They are running Magento for commerce purposes and this runs every 5 minutes. Why does the server only lockup around 3:30 AM? Because it's knows I am fast asleep? I was able to pull this from /var/log/messages, this happens just seconds before locking up completely... Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than 120 seconds. Mar 8 03:33:19 web1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 8 03:33:19 web1 kernel: wget D ffff810001004420 0 13608 13607 (NOTLB) Mar 8 03:33:19 web1 kernel: ffff81007bc7bc78 0000000000000086 ffff81007bc7bd88 ffff81000100d3f8 Mar 8 03:33:19 web1 kernel: ffff81007bc7bbf0 0000000000000007 ffff8100849db0c0 ffffffff80308b60 Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 0000000000003237 ffff8100849db2a8 0000000064c82eae Mar 8 03:33:19 web1 kernel: Call Trace: Mar 8 03:33:20 web1 kernel: [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b Mar 8 03:33:20 web1 kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 Mar 8 03:33:20 web1 kernel: [<ffffffff8000cf82>] do_lookup+0x90/0x1e6 Mar 8 03:33:20 web1 kernel: [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b Mar 8 03:33:20 web1 kernel: [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2 Mar 8 03:33:20 web1 kernel: [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1 Mar 8 03:33:23 web1 kernel: [<ffffffff80012851>] getname+0x15b/0x1c2 Mar 8 03:33:23 web1 kernel: [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c Mar 8 03:33:23 web1 kernel: [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a Mar 8 03:33:23 web1 kernel: [<ffffffff80023703>] sys_newstat+0x19/0x31 Mar 8 03:33:23 web1 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 If anyone has some advice on where to go from here it would be greatly appreciated. Thanks in advance. -- PJF
Boris Epstein
2011-Mar-11 17:42 UTC
[CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
On Fri, Mar 11, 2011 at 12:33 PM, PJ <pauljerome at gmail.com> wrote:> This may or may not be CentOS related, but am out of ideas at this > point and wanted to bounce this off the list. > > I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5. > > Almost everyday around 3:30 AM the server completely locks up and has > to be power cycled before it will come back online. > (this means someone hat to wake up and reboot the server, oh how I > love being an internet janitor! :) > > Smells like a hardware issue to me too, but I went through all of the > dell diagnostics, updated the firmware, everything checks out as being > okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell > tech. No hardware issues, at least that we were able to find. > > There are no cron jobs that run at 3:30, no backups, the server has a > load of 0, nothing is scheduled around that time... > > The only crontab entry at all is "*/5 * * * * wget -q > www.websitedomain.com/cron.php >/dev/null 2>&1" > They are running Magento for commerce purposes and this runs every 5 minutes. > > Why does the server only lockup around 3:30 AM? Because it's knows I > am fast asleep? > > I was able to pull this from /var/log/messages, this happens just > seconds before locking up completely... > > Mar ?8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more > than 120 seconds. > Mar ?8 03:33:19 web1 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Mar ?8 03:33:19 web1 kernel: wget ? ? ? ? ?D ffff810001004420 ? ? 0 > 13608 ?13607 ? ? ? ? ? ? ? ? ? ? (NOTLB) > Mar ?8 03:33:19 web1 kernel: ?ffff81007bc7bc78 0000000000000086 > ffff81007bc7bd88 ffff81000100d3f8 > Mar ?8 03:33:19 web1 kernel: ?ffff81007bc7bbf0 0000000000000007 > ffff8100849db0c0 ffffffff80308b60 > Mar ?8 03:33:19 web1 kernel: ?00013a2964cdf439 0000000000003237 > ffff8100849db2a8 0000000064c82eae > Mar ?8 03:33:19 web1 kernel: Call Trace: > Mar ?8 03:33:20 web1 kernel: ?[<ffffffff80063c6f>] > __mutex_lock_slowpath+0x60/0x9b > Mar ?8 03:33:20 web1 kernel: ?[<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 > Mar ?8 03:33:20 web1 kernel: ?[<ffffffff8000cf82>] do_lookup+0x90/0x1e6 > Mar ?8 03:33:20 web1 kernel: ?[<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b > Mar ?8 03:33:20 web1 kernel: ?[<ffffffff8000ea4b>] link_path_walk+0x42/0xb2 > Mar ?8 03:33:20 web1 kernel: ?[<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1 > Mar ?8 03:33:23 web1 kernel: ?[<ffffffff80012851>] getname+0x15b/0x1c2 > Mar ?8 03:33:23 web1 kernel: ?[<ffffffff800239d1>] __user_walk_fd+0x37/0x4c > Mar ?8 03:33:23 web1 kernel: ?[<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a > Mar ?8 03:33:23 web1 kernel: ?[<ffffffff80023703>] sys_newstat+0x19/0x31 > Mar ?8 03:33:23 web1 kernel: ?[<ffffffff8005d116>] system_call+0x7e/0x83 > > If anyone has some advice on where to go from here it would be greatly > appreciated. > > Thanks in advance. > > -- > PJF > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >Have you tried disabling the cron job you think is at fault to see if the lock up goes away? Also, have you checked all the users' crontabs? Boris.
Denniston, Todd A CIV NAVSURFWARCENDIV Crane
2011-Mar-11 21:31 UTC
[CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
> -----Original Message----- > From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On > Behalf Of PJ > Sent: Friday, March 11, 2011 12:34 > To: centos at centos.org > Subject: [CentOS] Server locking up everyday around 3:30 AM - (INFO:task> wget:13608 blocked for more than 120 seconds) need sleep, help.<SNIP>> There are no cron jobs that run at 3:30, no backups, the server has a > load of 0, nothing is scheduled around that time... ><SNIP> Are you sure the stuff in /etc/cron.daily/ is done by then or not started yet? Could be something like the mlocate or makewhatis chewing up CPU/Mem. IIRC the stuff in /etc/cron.daily/ runs in alphabetic order so, are you (root) getting the logwatch messages, and at what time?
Ross Walker
2011-Mar-12 02:07 UTC
[CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
On Mar 11, 2011, at 12:33 PM, PJ <pauljerome at gmail.com> wrote:> This may or may not be CentOS related, but am out of ideas at this > point and wanted to bounce this off the list. > > I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5. > > Almost everyday around 3:30 AM the server completely locks up and has > to be power cycled before it will come back online. > (this means someone hat to wake up and reboot the server, oh how I > love being an internet janitor! :) > > Smells like a hardware issue to me too, but I went through all of the > dell diagnostics, updated the firmware, everything checks out as being > okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell > tech. No hardware issues, at least that we were able to find. > > There are no cron jobs that run at 3:30, no backups, the server has a > load of 0, nothing is scheduled around that time... > > The only crontab entry at all is "*/5 * * * * wget -q > www.websitedomain.com/cron.php >/dev/null 2>&1" > They are running Magento for commerce purposes and this runs every 5 minutes. > > Why does the server only lockup around 3:30 AM? Because it's knows I > am fast asleep? > > I was able to pull this from /var/log/messages, this happens just > seconds before locking up completely... > > Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more > than 120 seconds. > Mar 8 03:33:19 web1 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Mar 8 03:33:19 web1 kernel: wget D ffff810001004420 0 > 13608 13607 (NOTLB) > Mar 8 03:33:19 web1 kernel: ffff81007bc7bc78 0000000000000086 > ffff81007bc7bd88 ffff81000100d3f8 > Mar 8 03:33:19 web1 kernel: ffff81007bc7bbf0 0000000000000007 > ffff8100849db0c0 ffffffff80308b60 > Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 0000000000003237 > ffff8100849db2a8 0000000064c82eae > Mar 8 03:33:19 web1 kernel: Call Trace: > Mar 8 03:33:20 web1 kernel: [<ffffffff80063c6f>] > __mutex_lock_slowpath+0x60/0x9b > Mar 8 03:33:20 web1 kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 > Mar 8 03:33:20 web1 kernel: [<ffffffff8000cf82>] do_lookup+0x90/0x1e6 > Mar 8 03:33:20 web1 kernel: [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b > Mar 8 03:33:20 web1 kernel: [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2 > Mar 8 03:33:20 web1 kernel: [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1 > Mar 8 03:33:23 web1 kernel: [<ffffffff80012851>] getname+0x15b/0x1c2 > Mar 8 03:33:23 web1 kernel: [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c > Mar 8 03:33:23 web1 kernel: [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a > Mar 8 03:33:23 web1 kernel: [<ffffffff80023703>] sys_newstat+0x19/0x31 > Mar 8 03:33:23 web1 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 > > If anyone has some advice on where to go from here it would be greatly > appreciated.Do a fsck of the file system wget is writing to as there might be a corruption it hits only on the 3:30am run as that's when the other vendor dumps data to be downloaded. You could also check to see if a RAID patrol read (scrub/predictive failure detection) is happening around this time as well and disable/reschedule it. -Ross
Alexander Georgiev
2011-Mar-12 08:07 UTC
[CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.
> > Almost everyday around 3:30 AM the server completely locks up and has > to be power cycled before it will come back online. > (this means someone hat to wake up and reboot the server, oh how I > love being an internet janitor! :) > > Smells like a hardware issue to me too, but I went through all of the > dell diagnostics, updated the firmware, everything checks out as being > okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell > tech. No hardware issues, at least that we were able to find. > > There are no cron jobs that run at 3:30, no backups, the server has a > load of 0, nothing is scheduled around that time...do you have smartd set to run short/long hard disk checks during the night? it is done via /etc/smartd.conf, not via cron.