thr3ads.net - CentOS - [CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Warren Young

2016-Feb-27 01:17 UTC

[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd

A CentOS 7 box here recently started being very slow to give the Password prompt
when using sudo.  (25 seconds!)  I eventually tracked this down to the following
complaint in the kernel message log:
> Feb 26 12:55:05 badboy kernel: INFO: task fprintd:1932 blocked for more
than 120 seconds.
> Feb 26 12:55:05 badboy kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 26 12:55:05 badboy kernel: fprintd         D ffff88021f7300f8     0 
1932      1 0x00000080
> Feb 26 12:55:05 badboy kernel: ffff88022354bdc0 0000000000000086
ffff8800c54e8000 ffff88022354bfd8
> Feb 26 12:55:05 badboy kernel: ffff88022354bfd8 ffff88022354bfd8
ffff8800c54e8000 ffff88021f7300f0
> Feb 26 12:55:05 badboy kernel: ffff88021f7300f4 ffff8800c54e8000
00000000ffffffff ffff88021f7300f8
> Feb 26 12:55:05 badboy kernel: Call Trace:
> Feb 26 12:55:05 badboy kernel: [<ffffffff8163b9e9>]
schedule_preempt_disabled+0x29/0x70
> Feb 26 12:55:05 badboy kernel: [<ffffffff816396e5>]
__mutex_lock_slowpath+0xc5/0x1c0
> Feb 26 12:55:05 badboy kernel: [<ffffffff81638b4f>]
mutex_lock+0x1f/0x2f
> Feb 26 12:55:05 badboy kernel: [<ffffffff81449729>]
read_descriptors+0x39/0x110
> Feb 26 12:55:05 badboy kernel: [<ffffffff8125c28b>] ? read+0x6b/0x1f0
> Feb 26 12:55:05 badboy kernel: [<ffffffff8125c2fb>] read+0xdb/0x1f0
> Feb 26 12:55:05 badboy kernel: [<ffffffff811de43c>]
vfs_read+0x9c/0x170
> Feb 26 12:55:05 badboy kernel: [<ffffffff811def8f>]
SyS_read+0x7f/0xe0
> Feb 26 12:55:05 badboy kernel: [<ffffffff81645909>]
system_call_fastpath+0x16/0x1b

Since fprintd is the fingerprint reader daemon and this server doesn?t have such
a peripheral, I removed the service.  Now sudo is fast again. [*]

Although that solves my immediate problem, while debugging all of this I noticed
substantially the same error message also appearing for libvirtd:
> Feb 26 12:57:05 badboy kernel: INFO: task libvirtd:1514 blocked for more
than 120 seconds.
> Feb 26 12:57:05 badboy kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 26 12:57:05 badboy kernel: libvirtd        D ffff88021f7300f8     0 
1514      1 0x00000080
> Feb 26 12:57:05 badboy kernel: ffff8800c08ffde0 0000000000000086
ffff8800c550e780 ffff8800c08fffd8
> Feb 26 12:57:05 badboy kernel: ffff8800c08fffd8 ffff8800c08fffd8
ffff8800c550e780 ffff88021f7300f0
> Feb 26 12:57:05 badboy kernel: ffff88021f7300f4 ffff8800c550e780
00000000ffffffff ffff88021f7300f8
> Feb 26 12:57:05 badboy kernel: Call Trace:
> Feb 26 12:57:05 badboy kernel: [<ffffffff8163b9e9>]
schedule_preempt_disabled+0x29/0x70
> Feb 26 12:57:05 badboy kernel: [<ffffffff816396e5>]
__mutex_lock_slowpath+0xc5/0x1c0
> Feb 26 12:57:05 badboy kernel: [<ffffffff81638b4f>]
mutex_lock+0x1f/0x2f
> Feb 26 12:57:05 badboy kernel: [<ffffffff81449100>]
manufacturer_show+0x20/0x50
> Feb 26 12:57:05 badboy kernel: [<ffffffff813f1c20>]
dev_attr_show+0x20/0x60
> Feb 26 12:57:05 badboy kernel: [<ffffffff8125928a>]
sysfs_read_file+0x9a/0x1a0
> Feb 26 12:57:05 badboy kernel: [<ffffffff811de43c>]
vfs_read+0x9c/0x170
> Feb 26 12:57:05 badboy kernel: [<ffffffff811def8f>]
SyS_read+0x7f/0xe0
> Feb 26 12:57:05 badboy kernel: [<ffffffff81645909>]
system_call_fastpath+0x16/0x1b

I don?t want to employ the same solution in this case ? i.e. disable libvirtd ?
because while I don?t yet use KVM on that server, I actually had plans to do so
soon.  I could host those VMs elsewhere, but that feels like sweeping the
problem under the rug.

I want to fix this, but I?m now stuck.

I?ve ruled out several possible causes already:

1. It isn?t DNS.  DNS responds quickly.

2. It isn?t a flaky OS drive.  Swapping the spinning rust drive out for an SSD
was already on the wish list for this system, so I took the opportunity to
reinstall the OS fresh on a brand new SSD.  The same problem occurred shortly
after booting CentOS 7.2 for the first time, and it continues to happen
periodically.

The only thing I?ve copied over from the old spinning drive so far is /home and
/usr/local, and I can?t see how that could affect the kernel.  Besides, the
problem still occurs while booted into single-user mode.

3. It isn?t the RAID card.  A RAID verify pass completed successfully, as did
smartctl -t long tests on all of the individual drives.  I even disabled the
RAID card?s OPROM and blacklisted its driver at one point, just to see if the
driver or RAID BIOS were the cause of the problem.  I haven?t actually pulled
the card yet, but that seems unlikely to solve anything, given that the array
seems to be storing data reliably.

4. It isn?t the RAM.  It just passed a memtest86+ run.  That probably exonerates
the CPU and north bridge, too.  (I used v4.20 from the CentOS 7.2 install ISO.)

5. It isn?t stale packages.  The OS reinstall proves that.  It?s currently
running the tip of CentOS 7.2 and still showing the same symptoms.

6. It isn?t high-uptime kernel space cruft, since it?s been rebooted a bunch of
times during all of this, including several hard reboots.

Rebooting is especially painful at the moment because the system hangs on
shutdown waiting for libvirtd to respond.  While I had fprintd installed, that
was causing 3-minute hangs on boot, too.

Very frustrating!



[*] In case it isn?t clear how fprintd could affect sudo, it?s because fprintd
installs a PAM module, and sudo uses PAM for authentication.  If fprintd is
stuck, the PAM call stalls until it gives up and moves on.

I tracked this down with strace.  The 25 seconds above is the timeout value
passed to poll(2); that timeout is hit while sudo (via PAM) is trying to talk to
dbus, since apparently fprintd communicates via dbus.

Because the problem is in PAM and not sudo, it also affected su.  It did not
affect console or ssh user logins for some reason.

I?m posting these details in case this diagnosis helps someone.  While chasing
this, I found a bunch of postings on the net from people who have run into such
problems before, but none of the threads mentioned this particular failure mode.
For all I know, I?m the first person it?s ever happened to, so this needs to be
in the archives.

Gordon Messmer

2016-Feb-27 03:17 UTC

head link

[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd

On 02/26/2016 05:17 PM, Warren Young wrote:> I tracked this down with strace.  The 25 seconds above is the timeout value
passed to poll(2); that timeout is hit while sudo (via PAM) is trying to talk to
dbus, since apparently fprintd communicates via dbus.
If you can still replicate the problem, it might be instructive to use 
"dbus-monitor" to find out what is (and isn't) happening during 
authentication.

Warren Young

2016-Mar-01 00:50 UTC

head link

[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd

On Feb 26, 2016, at 8:17 PM, Gordon Messmer <gordon.messmer at gmail.com>
wrote:> 
> On 02/26/2016 05:17 PM, Warren Young wrote:
>> I tracked this down with strace.  The 25 seconds above is the timeout
value passed to poll(2); that timeout is hit while sudo (via PAM) is trying to
talk to dbus, since apparently fprintd communicates via dbus.
> 
> If you can still replicate the problem, it might be instructive to use
"dbus-monitor" to find out what is (and isn't) happening during
authentication.
That didn?t turn up anything enlightening, so I went back to strace, this time
on libvirtd, and NAILED IT.  (Don'tcha love that feeling?)

It turns out that libvirtd was blocking on a read(2) call for 

  /sys/devices/pci0000:00/0000:00:1c.0/0000:02:00.0/usb2/manufacturer

That call should complete near-instantly.  Hmm!!

So, I said lsusb, and *it blocked, too!*

I had the lid off this box recently, and remembered seeing another card beside
the RAID card, a cheap 2-port USB-3 card we dropped in there a few years ago for
a one-off project and then never used it again.  I removed that card, rebooted
the server, and libvirtd is now behaving correctly.

Presumably this is also what was going wrong with fprintd: there must be
USB-attached fingerprint readers, so if it went and tried to enumerate the USB
bus looking for one, it would get stuck just like libvirtd did.

I?ve had more problems with cheap USB gear?grrrr.

Apparently Analagous Threads

Search for more seemingly similar threads

CentOS - Feb 2016 - Hung kernel task on CentOS 7.2, fprintd and libvirtd

[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd

[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd

[CentOS] Hung kernel task on CentOS 7.2, fprintd and libvirtd

Apparently Analagous Threads