Warren Young
2016-Feb-27 01:17 UTC
[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd
A CentOS 7 box here recently started being very slow to give the Password prompt when using sudo. (25 seconds!) I eventually tracked this down to the following complaint in the kernel message log:> Feb 26 12:55:05 badboy kernel: INFO: task fprintd:1932 blocked for more than 120 seconds. > Feb 26 12:55:05 badboy kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Feb 26 12:55:05 badboy kernel: fprintd D ffff88021f7300f8 0 1932 1 0x00000080 > Feb 26 12:55:05 badboy kernel: ffff88022354bdc0 0000000000000086 ffff8800c54e8000 ffff88022354bfd8 > Feb 26 12:55:05 badboy kernel: ffff88022354bfd8 ffff88022354bfd8 ffff8800c54e8000 ffff88021f7300f0 > Feb 26 12:55:05 badboy kernel: ffff88021f7300f4 ffff8800c54e8000 00000000ffffffff ffff88021f7300f8 > Feb 26 12:55:05 badboy kernel: Call Trace: > Feb 26 12:55:05 badboy kernel: [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70 > Feb 26 12:55:05 badboy kernel: [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0 > Feb 26 12:55:05 badboy kernel: [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f > Feb 26 12:55:05 badboy kernel: [<ffffffff81449729>] read_descriptors+0x39/0x110 > Feb 26 12:55:05 badboy kernel: [<ffffffff8125c28b>] ? read+0x6b/0x1f0 > Feb 26 12:55:05 badboy kernel: [<ffffffff8125c2fb>] read+0xdb/0x1f0 > Feb 26 12:55:05 badboy kernel: [<ffffffff811de43c>] vfs_read+0x9c/0x170 > Feb 26 12:55:05 badboy kernel: [<ffffffff811def8f>] SyS_read+0x7f/0xe0 > Feb 26 12:55:05 badboy kernel: [<ffffffff81645909>] system_call_fastpath+0x16/0x1bSince fprintd is the fingerprint reader daemon and this server doesn?t have such a peripheral, I removed the service. Now sudo is fast again. [*] Although that solves my immediate problem, while debugging all of this I noticed substantially the same error message also appearing for libvirtd:> Feb 26 12:57:05 badboy kernel: INFO: task libvirtd:1514 blocked for more than 120 seconds. > Feb 26 12:57:05 badboy kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Feb 26 12:57:05 badboy kernel: libvirtd D ffff88021f7300f8 0 1514 1 0x00000080 > Feb 26 12:57:05 badboy kernel: ffff8800c08ffde0 0000000000000086 ffff8800c550e780 ffff8800c08fffd8 > Feb 26 12:57:05 badboy kernel: ffff8800c08fffd8 ffff8800c08fffd8 ffff8800c550e780 ffff88021f7300f0 > Feb 26 12:57:05 badboy kernel: ffff88021f7300f4 ffff8800c550e780 00000000ffffffff ffff88021f7300f8 > Feb 26 12:57:05 badboy kernel: Call Trace: > Feb 26 12:57:05 badboy kernel: [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70 > Feb 26 12:57:05 badboy kernel: [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0 > Feb 26 12:57:05 badboy kernel: [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f > Feb 26 12:57:05 badboy kernel: [<ffffffff81449100>] manufacturer_show+0x20/0x50 > Feb 26 12:57:05 badboy kernel: [<ffffffff813f1c20>] dev_attr_show+0x20/0x60 > Feb 26 12:57:05 badboy kernel: [<ffffffff8125928a>] sysfs_read_file+0x9a/0x1a0 > Feb 26 12:57:05 badboy kernel: [<ffffffff811de43c>] vfs_read+0x9c/0x170 > Feb 26 12:57:05 badboy kernel: [<ffffffff811def8f>] SyS_read+0x7f/0xe0 > Feb 26 12:57:05 badboy kernel: [<ffffffff81645909>] system_call_fastpath+0x16/0x1bI don?t want to employ the same solution in this case ? i.e. disable libvirtd ? because while I don?t yet use KVM on that server, I actually had plans to do so soon. I could host those VMs elsewhere, but that feels like sweeping the problem under the rug. I want to fix this, but I?m now stuck. I?ve ruled out several possible causes already: 1. It isn?t DNS. DNS responds quickly. 2. It isn?t a flaky OS drive. Swapping the spinning rust drive out for an SSD was already on the wish list for this system, so I took the opportunity to reinstall the OS fresh on a brand new SSD. The same problem occurred shortly after booting CentOS 7.2 for the first time, and it continues to happen periodically. The only thing I?ve copied over from the old spinning drive so far is /home and /usr/local, and I can?t see how that could affect the kernel. Besides, the problem still occurs while booted into single-user mode. 3. It isn?t the RAID card. A RAID verify pass completed successfully, as did smartctl -t long tests on all of the individual drives. I even disabled the RAID card?s OPROM and blacklisted its driver at one point, just to see if the driver or RAID BIOS were the cause of the problem. I haven?t actually pulled the card yet, but that seems unlikely to solve anything, given that the array seems to be storing data reliably. 4. It isn?t the RAM. It just passed a memtest86+ run. That probably exonerates the CPU and north bridge, too. (I used v4.20 from the CentOS 7.2 install ISO.) 5. It isn?t stale packages. The OS reinstall proves that. It?s currently running the tip of CentOS 7.2 and still showing the same symptoms. 6. It isn?t high-uptime kernel space cruft, since it?s been rebooted a bunch of times during all of this, including several hard reboots. Rebooting is especially painful at the moment because the system hangs on shutdown waiting for libvirtd to respond. While I had fprintd installed, that was causing 3-minute hangs on boot, too. Very frustrating! [*] In case it isn?t clear how fprintd could affect sudo, it?s because fprintd installs a PAM module, and sudo uses PAM for authentication. If fprintd is stuck, the PAM call stalls until it gives up and moves on. I tracked this down with strace. The 25 seconds above is the timeout value passed to poll(2); that timeout is hit while sudo (via PAM) is trying to talk to dbus, since apparently fprintd communicates via dbus. Because the problem is in PAM and not sudo, it also affected su. It did not affect console or ssh user logins for some reason. I?m posting these details in case this diagnosis helps someone. While chasing this, I found a bunch of postings on the net from people who have run into such problems before, but none of the threads mentioned this particular failure mode. For all I know, I?m the first person it?s ever happened to, so this needs to be in the archives.
Gordon Messmer
2016-Feb-27 03:17 UTC
[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd
On 02/26/2016 05:17 PM, Warren Young wrote:> I tracked this down with strace. The 25 seconds above is the timeout value passed to poll(2); that timeout is hit while sudo (via PAM) is trying to talk to dbus, since apparently fprintd communicates via dbus.If you can still replicate the problem, it might be instructive to use "dbus-monitor" to find out what is (and isn't) happening during authentication.
Warren Young
2016-Mar-01 00:50 UTC
[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd
On Feb 26, 2016, at 8:17 PM, Gordon Messmer <gordon.messmer at gmail.com> wrote:> > On 02/26/2016 05:17 PM, Warren Young wrote: >> I tracked this down with strace. The 25 seconds above is the timeout value passed to poll(2); that timeout is hit while sudo (via PAM) is trying to talk to dbus, since apparently fprintd communicates via dbus. > > If you can still replicate the problem, it might be instructive to use "dbus-monitor" to find out what is (and isn't) happening during authentication.That didn?t turn up anything enlightening, so I went back to strace, this time on libvirtd, and NAILED IT. (Don'tcha love that feeling?) It turns out that libvirtd was blocking on a read(2) call for /sys/devices/pci0000:00/0000:00:1c.0/0000:02:00.0/usb2/manufacturer That call should complete near-instantly. Hmm!! So, I said lsusb, and *it blocked, too!* I had the lid off this box recently, and remembered seeing another card beside the RAID card, a cheap 2-port USB-3 card we dropped in there a few years ago for a one-off project and then never used it again. I removed that card, rebooted the server, and libvirtd is now behaving correctly. Presumably this is also what was going wrong with fprintd: there must be USB-attached fingerprint readers, so if it went and tried to enumerate the USB bus looking for one, it would get stuck just like libvirtd did. I?ve had more problems with cheap USB gear?grrrr.