Christopher S. Aker
2008-Nov-21 16:54 UTC
[Xen-devel] High Net and Disk Use == stuck domain
For the past year or so we''ve been seeing a bug whereby a domU''s CPU would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console would freeze, and some or all of the network-facing services within the domU would connect but block without any output. Disk IO would flatline. The domU would never recover and required rebooting. Since pv_ops hasn''t always been around, we previously had only seen this behavior with xen-patched domUs (2.6.18.x), but now we''re seeing it with pv_ops. Identical symptoms. And, I have a user that is able to reliable reproduce it on 2.6.27.4! His recipe is downloading an ISO from a very fast and close-by news server using nzbget. The trigger appears to be a combination of high network use and high disk use (like download from a very fast mirror) -- because we weren''t able to reproduce the problem when saving to a tmpfs mount. I was able to grab the output of sysrq t while it was in the bad state: http://theshore.net/~caker/xen/BUGS/D-state/console.log The number of processes in D state (39) is quite suspicious. Let me know if there''s anything else I can provide. -Chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stefan de Konink
2008-Nov-21 17:07 UTC
Re: [Xen-devel] High Net and Disk Use == stuck domain
Christopher S. Aker wrote:> Let me know if there''s anything else I can provide.iSCSI/loop/blktap? Stefan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christopher S. Aker
2008-Nov-21 17:16 UTC
Re: [Xen-devel] High Net and Disk Use == stuck domain
Stefan de Konink wrote:> iSCSI/loop/blktap?Local LVM volumes exported via "phy:" in the domU''s config. -Chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christopher S. Aker
2008-Dec-01 15:05 UTC
Re: [Xen-devel] High Net and Disk Use == stuck domain
Christopher S. Aker wrote:> For the past year or so we''ve been seeing a bug whereby a domU''s CPU > would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console would > freeze, and some or all of the network-facing services within the domU > would connect but block without any output. Disk IO would flatline. The > domU would never recover and required rebooting. > > Since pv_ops hasn''t always been around, we previously had only seen this > behavior with xen-patched domUs (2.6.18.x), but now we''re seeing it with > pv_ops. Identical symptoms. And, I have a user that is able to > reliable reproduce it on 2.6.27.4! > > His recipe is downloading an ISO from a very fast and close-by news > server using nzbget. The trigger appears to be a combination of high > network use and high disk use (like download from a very fast mirror) -- > because we weren''t able to reproduce the problem when saving to a tmpfs > mount. > > I was able to grab the output of sysrq t while it was in the bad state: > > http://theshore.net/~caker/xen/BUGS/D-state/console.log > > The number of processes in D state (39) is quite suspicious. > > Let me know if there''s anything else I can provide. > > -ChrisJeremy, Did this one slip by you? I figured a reproducible bug would be just too tantalizing to resist. What''s the correct venue for these issues that overlap xen-devel, lkml, and virtualization/pv_ops stuff -- should I be blasting these to everybody? -Chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2008-Dec-01 20:19 UTC
Re: [Xen-devel] High Net and Disk Use == stuck domain
Christopher S. Aker wrote:> Christopher S. Aker wrote: >> For the past year or so we''ve been seeing a bug whereby a domU''s CPU >> would spin up to a steady 100, 200, 300 or 400% (4 vcpus), console >> would freeze, and some or all of the network-facing services within >> the domU would connect but block without any output. Disk IO would >> flatline. The domU would never recover and required rebooting. >> >> Since pv_ops hasn''t always been around, we previously had only seen >> this behavior with xen-patched domUs (2.6.18.x), but now we''re seeing >> it with pv_ops. Identical symptoms. And, I have a user that is able >> to reliable reproduce it on 2.6.27.4! >> >> His recipe is downloading an ISO from a very fast and close-by news >> server using nzbget. The trigger appears to be a combination of high >> network use and high disk use (like download from a very fast mirror) >> -- because we weren''t able to reproduce the problem when saving to a >> tmpfs mount. >> >> I was able to grab the output of sysrq t while it was in the bad state: >> >> http://theshore.net/~caker/xen/BUGS/D-state/console.log >> >> The number of processes in D state (39) is quite suspicious. >> >> Let me know if there''s anything else I can provide. >> >> -Chris > > Jeremy, > > Did this one slip by you? I figured a reproducible bug would be just > too tantalizing to resist.Hoping it would go away by itself? ;) I''m trying to repro it now, copying ISOs at 25 Mbytes/sec. How long does it take to happen?> What''s the correct venue for these issues that overlap xen-devel, > lkml, and virtualization/pv_ops stuff -- should I be blasting these to > everybody?Me and xen-devel are a good start, and posting in a bugzilla cc:ing me if it looks like its been dropped on the floor. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christopher S. Aker
2008-Dec-01 21:00 UTC
Re: [Xen-devel] High Net and Disk Use == stuck domain
Jeremy Fitzhardinge wrote:>> Did this one slip by you? I figured a reproducible bug would be just >> too tantalizing to resist. > > Hoping it would go away by itself? ;) > > I''m trying to repro it now, copying ISOs at 25 Mbytes/sec. How long > does it take to happen?Under a few minutes, usually within 30 seconds. The affected kernel binary is here: http://theshore.net/~caker/xen/BUGS/D-state/2.6.27.4-linode14 This was built with my non-broken toolchain, too, btw :) Meanwhile, I''ll try to reproduce it in a new environment and come up with a better recipe.>> What''s the correct venue for these issues that overlap xen-devel, >> lkml, and virtualization/pv_ops stuff -- should I be blasting these to >> everybody? > > Me and xen-devel are a good start, and posting in a bugzilla cc:ing me > if it looks like its been dropped on the floor.OK -- targets acquired! Thanks, -Chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel