Every so often, one of our servers will go into what I can only describe
as an undefined state: it pings, but there's zero access - you can't ssh
in, and if I go plug a keyboard and monitor into the server itself, you
can see the monitor's live, it's not the "monitor turned off"
color, but
there is zero response to the keyboard. The upshot is that I wind up
having to power cycle it.
Well, it just happened again on one of our servers Friday evening, as I
found this morning. Looking at the logs this morning, I see that sar last
shows
10:20:01 PM all 34.38 0.00 8.29 0.00 0.00
57.33
On of my users dropped me an email at 22:45 that it was "off", and the
last things I see in /var/log/messages are one of those annoying
Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more
than 120 seconds.
Feb 21 22:26:23 <server> kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
I also see
Feb 21 22:26:23 <server> kernel: perl D ffffffff80158250 0
20596 20557
which, as I just found by googling perl NOTLD, means that this is in a
kernel uninterruptable state
In addition, in the stack trace, some nfs messages
Feb 21 22:26:23 <server> kernel: [<ffffffff886b58d1>]
:nfs:nfs_wait_bit_uninterruptible+0x0/0xd
So, it *appears* to be either an NFS issue, or a NIC issue. The user's
home directory server is CentOS running 6.5, and the server that hung is
5.10. Mount on the formerly hung server, su-d to his account shows merely
nfs, so I'm guessing it's NFS3. Looking at lsmod and /var/log/dmesg, I
see
it's running the tg3 NIC driver.
Anyone else seeing this, and if so, any thoughts on the matter? Note that
I've had this on Penguins, which are all Supermicro, and they're using
the
igb NIC driver, but the one this past weekend is a Dell, so it's not just
one system.
mark
On Mon, 24 Feb 2014, m.roth at 5-cent.us wrote:> Every so often, one of our servers will go into what I can only describe > as an undefined state: it pings, but there's zero access - you can't ssh > in, and if I go plug a keyboard and monitor into the server itself, you > can see the monitor's live, it's not the "monitor turned off" color, but > there is zero response to the keyboard. The upshot is that I wind up > having to power cycle it. > > Well, it just happened again on one of our servers Friday evening, as I > found this morning. Looking at the logs this morning, I see that sar last > shows > 10:20:01 PM all 34.38 0.00 8.29 0.00 0.00 > 57.33 > > On of my users dropped me an email at 22:45 that it was "off", and the > last things I see in /var/log/messages are one of those annoying > Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for more > than 120 seconds. > Feb 21 22:26:23 <server> kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > I also see > Feb 21 22:26:23 <server> kernel: perl D ffffffff80158250 0 > 20596 20557 > which, as I just found by googling perl NOTLD, means that this is in a > kernel uninterruptable state > In addition, in the stack trace, some nfs messages > Feb 21 22:26:23 <server> kernel: [<ffffffff886b58d1>] > :nfs:nfs_wait_bit_uninterruptible+0x0/0xd > > So, it *appears* to be either an NFS issue, or a NIC issue. The user's > home directory server is CentOS running 6.5, and the server that hung is > 5.10. Mount on the formerly hung server, su-d to his account shows merely > nfs, so I'm guessing it's NFS3. Looking at lsmod and /var/log/dmesg, I see > it's running the tg3 NIC driver. > > Anyone else seeing this, and if so, any thoughts on the matter? Note that > I've had this on Penguins, which are all Supermicro, and they're using the > igb NIC driver, but the one this past weekend is a Dell, so it's not just > one system. > > > mark > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >What CPU's do these systems have? AMD or Intel. What kernel are the server and client running? -Connie Sieh
It seems system was in hung state . The message> Feb 21 22:26:23 <server> kernel: INFO: task perl:20596 blocked for morethan 120 seconds. Just indicates that process 20596 was stuck/hang in cpu for more than 120 seconds. To begin with the troubleshooting, I would suggest you to check what this process does. Whether this required any REMOTE storage/disk access. Btw, the same perl process is going to D state/hang state first/always ? If no remote storage/disk access is required for this perl application AND In case you are running this application as root user, try run this application as a normal user in which a resource limitation is applicable via limits.conf. If the process required a storage/NFS access, you may want to check the disk/storage status at the time when application moved to D state. I understand that you can't predict the issue time and perform all the checks mentioned above. afaik, to find the root cause of this problem, you may want to analyse core dump collected at the time of the issue. Cheers, Dominic
Here are some suggestions:
1. Enable and configure kdump
2. Enable Magic SysRq
3. Consider enabling "kernel.softlockup_panic" and
"vm.panic_on_oom",
but doing so will cause you server to crash sooner than it would
normally --> it depends upon whether you want to capture the first
instance (e.g. smoking gun) or that you want to wait until the system
is completely hosed (and may have more evidence of the issue).
Then test and verify that Magic SysRq can be used to generate a
kernel core dump.
Then, sit back and wait .....
I do this on all my production servers -- saving the pain of having
to do this under pressure plus capturing the vmcore on the first
instance is very much worth the effort ....
HTH
-rak-