Luis Chamberlain
2021-Nov-11 22:48 UTC
update_balloon_size_func blocked for more than 120 seconds
I get the following splats with a kvm guest in idle, after a few seconds it starts: [ 242.412806] INFO: task kworker/6:2:271 blockedfor more than 120 seconds. [ 242.415790] Tainted: G E 5.15.0-next-20211111 #68 [ 242.417755] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 242.418332] task:kworker/6:2 state:D stack: 0 pid: 271 ppid: 2 flags:0x00004000 [ 242.418954] Workqueue: events_freezable update_balloon_size_func [virtio_balloon] [ 242.419518] Call Trace: [ 242.419709] <TASK> [ 242.419873] __schedule+0x2fd/0x990 [ 242.420142] schedule+0x4e/0xc0 [ 242.420382] tell_host+0xaa/0xf0 [virtio_balloon] [ 242.420757] ? do_wait_intr_irq+0xa0/0xa0 [ 242.421065] update_balloon_size_func+0x2c9/0x2e0 [virtio_balloon] [ 242.421527] process_one_work+0x1e5/0x3c0 [ 242.421833] worker_thread+0x50/0x3b0 [ 242.422204] ? rescuer_thread+0x370/0x370 [ 242.422507] kthread+0x169/0x190 [ 242.422754] ? set_kthread_struct+0x40/0x40 [ 242.423073] ret_from_fork+0x1f/0x30 [ 242.423347] </TASK> And this goes on endlessly. The last one says it blocked for more than 1208 seconds. This was not happening until the last few weeks but I see no relevant recent commits for virtio_balloon, so the related change could be elsewhere. I could bisect but first I figured I'd check to see if someone already had spotted this. Luis
David Hildenbrand
2021-Nov-12 09:13 UTC
update_balloon_size_func blocked for more than 120 seconds
On Thu, Nov 11, 2021 at 11:49 PM Luis Chamberlain <mcgrof at kernel.org> wrote:> > I get the following splats with a kvm guest in idle, after a few seconds > it starts: > > [ 242.412806] INFO: task kworker/6:2:271 blockedfor more than 120 seconds. > [ 242.415790] Tainted: G E 5.15.0-next-20211111 #68 > [ 242.417755] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 242.418332] task:kworker/6:2 state:D stack: 0 pid: 271 ppid: 2 flags:0x00004000 > [ 242.418954] Workqueue: events_freezable update_balloon_size_func [virtio_balloon] > [ 242.419518] Call Trace: > [ 242.419709] <TASK> > [ 242.419873] __schedule+0x2fd/0x990 > [ 242.420142] schedule+0x4e/0xc0 > [ 242.420382] tell_host+0xaa/0xf0 [virtio_balloon] > [ 242.420757] ? do_wait_intr_irq+0xa0/0xa0 > [ 242.421065] update_balloon_size_func+0x2c9/0x2e0 [virtio_balloon] > [ 242.421527] process_one_work+0x1e5/0x3c0 > [ 242.421833] worker_thread+0x50/0x3b0 > [ 242.422204] ? rescuer_thread+0x370/0x370 > [ 242.422507] kthread+0x169/0x190 > [ 242.422754] ? set_kthread_struct+0x40/0x40 > [ 242.423073] ret_from_fork+0x1f/0x30 > [ 242.423347] </TASK> > > And this goes on endlessly. The last one says it blocked for more than 1208 > seconds. This was not happening until the last few weeks but I see no > relevant recent commits for virtio_balloon, so the related change could > be elsewhere.We're stuck somewhere in: wq: update_balloon_size_func()->fill_balloon()->tell_host() Most probably in wait_event(). I am no waitqueue expert, but my best guess would be that we're waiting more than 2 minutes on a host reply with TASK_UNINTERRUPTIBLE. At least that's my interpretation, In case we're really stuck for more than 2 minutes, the hypervisor might not be processing our requests anymore -- or it's not getting processed for some other reason (or the waitqueue is not getting woken up due do some other BUG). IIUC, we can sleep longer via wait_event_interruptible(), TASK_UNINTERRUPTIBLE seems to be the issue that triggers the warning. But by changing that might just be hiding the fact that we're waiting more than 2 minutes on a reply.> > I could bisect but first I figured I'd check to see if someone already > had spotted this.Bisecting would be awesome, then we might at least know if this is a guest or a hypervisor issue. Note that the environment matters: the hypervisor seems to be requesting the guest to inflate the balloon right when booting up. So you might not be able to reproduce in a different environment.