Hi, There''s a problem I''m struggling with for quite some time in our Xen hosting environment. Basically, after a couple of months'' smooth running time, suddenly most virtual machines get stuck into r state and stop responding to anything, including xm console and xm sysrq. It happens rather regularly, but I can''t reproduce it by taxing the domUs or the dom0 with disk I/O, CPU or console I/O. However, a couple of days ago it turned out that this situation can be cured by restarting xenconsoled! After that, xm console spit out the previous random typing, sysrq help strings and whatnot for the domUs which weren''t stuck in r state, and the stuck ones also started to respond and run normally (spending most of their time in b state) again. The whole phenomenon looked like xenconsoled stopped emptying the domU console buffers, and those domUs which were constantly writing to their consoles quickly filled it up and started busy-looping trying to put more characters onto their consoles, not caring to respond to ping, even. But those domUs which didn''t write to their consoles, stayed functional until the desperate operator forced them to create enough console output to fill up their buffers as well, and then they stuck into r state just like the others. After restarting xenconsoled all were able to recover successfully. Of course the above is just guessing, I don''t know the details of Xen console handling. But I wonder if it rings any bells here, or maybe this issue is known and fixed already. Oh, I experience this under Xen 3.2 and pv-ops guests (2.6.26+patches). -- Thanks, Feri. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, May 29, 2009 at 08:26:33PM +0200, Ferenc Wagner wrote:> Hi, > > There''s a problem I''m struggling with for quite some time in our Xen > hosting environment. Basically, after a couple of months'' smooth > running time, suddenly most virtual machines get stuck into r state > and stop responding to anything, including xm console and xm sysrq. > It happens rather regularly, but I can''t reproduce it by taxing the > domUs or the dom0 with disk I/O, CPU or console I/O. > > However, a couple of days ago it turned out that this situation can be > cured by restarting xenconsoled! After that, xm console spit out the > previous random typing, sysrq help strings and whatnot for the domUs > which weren''t stuck in r state, and the stuck ones also started to > respond and run normally (spending most of their time in b state) again. > > The whole phenomenon looked like xenconsoled stopped emptying the domU > console buffers, and those domUs which were constantly writing to > their consoles quickly filled it up and started busy-looping trying to > put more characters onto their consoles, not caring to respond to > ping, even. But those domUs which didn''t write to their consoles, > stayed functional until the desperate operator forced them to create > enough console output to fill up their buffers as well, and then they > stuck into r state just like the others. After restarting xenconsoled > all were able to recover successfully. > > Of course the above is just guessing, I don''t know the details of Xen > console handling. But I wonder if it rings any bells here, or maybe > this issue is known and fixed already. Oh, I experience this under > Xen 3.2 and pv-ops guests (2.6.26+patches).I''ve seen the exact same bug/problem with Xen in RHEL5/CentOS (5.0, 5.1, 5.2). I believe it''s also in 5.3. I reported the problem to xen-devel, but I couldn''t provide the needed strace/backtrace to figure out the reason _why_ that happens.. (I had already restarted xenconsoled..) I think developers would need more information to figure out what the actual bug is. -- Pasi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 29/05/2009 22:53, "Pasi Kärkkäinen" <pasik@iki.fi> wrote:> I''ve seen the exact same bug/problem with Xen in RHEL5/CentOS (5.0, 5.1, 5.2). > I believe it''s also in 5.3. > > I reported the problem to xen-devel, but I couldn''t provide the needed > strace/backtrace to figure out the reason _why_ that happens.. (I had > already restarted xenconsoled..) > > I think developers would need more information to figure out what the > actual bug is.Yes, I think any kind of xenconsoled hang can eventually result in guests spinning waiting for their console buffers to be emptied. It might be interesting to build xenconsoled with debug symbols (-g compile option) and attach gdb when it gets in this state. Without that kind of info it''ll be hard to track down. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen <pasik@iki.fi> writes:> On Fri, May 29, 2009 at 08:26:33PM +0200, Ferenc Wagner wrote: > >> There''s a problem I''m struggling with for quite some time in our Xen >> hosting environment. Basically, after a couple of months'' smooth >> running time, suddenly most virtual machines get stuck into r state >> and stop responding to anything, including xm console and xm sysrq. >> It happens rather regularly, but I can''t reproduce it by taxing the >> domUs or the dom0 with disk I/O, CPU or console I/O. >> >> However, a couple of days ago it turned out that this situation can be >> cured by restarting xenconsoled! After that, xm console spit out the >> previous random typing, sysrq help strings and whatnot for the domUs >> which weren''t stuck in r state, and the stuck ones also started to >> respond and run normally (spending most of their time in b state) again. >> >> The whole phenomenon looked like xenconsoled stopped emptying the domU >> console buffers, and those domUs which were constantly writing to >> their consoles quickly filled it up and started busy-looping trying to >> put more characters onto their consoles, not caring to respond to >> ping, even. But those domUs which didn''t write to their consoles, >> stayed functional until the desperate operator forced them to create >> enough console output to fill up their buffers as well, and then they >> stuck into r state just like the others. After restarting xenconsoled >> all were able to recover successfully. >> >> Of course the above is just guessing, I don''t know the details of Xen >> console handling. But I wonder if it rings any bells here, or maybe >> this issue is known and fixed already. Oh, I experience this under >> Xen 3.2 and pv-ops guests (2.6.26+patches). > > I''ve seen the exact same bug/problem with Xen in RHEL5/CentOS (5.0, 5.1, 5.2). > I believe it''s also in 5.3. > > I reported the problem to xen-devel, but I couldn''t provide the needed > strace/backtrace to figure out the reason _why_ that happens.. (I had > already restarted xenconsoled..) > > I think developers would need more information to figure out what the > actual bug is.Indeed I found your report now. This means you''re running for almost a year without experiencing this! I get it much more often, but still pretty rarely. I also noticed that the more or less regular WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 200 ms (> 50 ms) (GSource: 0x811bf80) messages from heartbeat came 50 times more often while xenstored was stuck (it didn''t take any significant CPU at least). However, four domUs in constantly r state surely sucked up all the CPU power of the 4-way host machine. And this phenomenon is always triggered by some extra load, typically by tiger starting an md5sum check of the installed packages at the same time on a couple of domUs. (Btw. doesn''t some randomized crond exist for helping this in general?) -- Cheers, Feri. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser <keir.fraser@eu.citrix.com> writes:> On 29/05/2009 22:53, "Pasi Kärkkäinen" <pasik@iki.fi> wrote: > >> I''ve seen the exact same bug/problem with Xen in RHEL5/CentOS (5.0, 5.1, 5.2). >> I believe it''s also in 5.3. >> >> I reported the problem to xen-devel, but I couldn''t provide the needed >> strace/backtrace to figure out the reason _why_ that happens.. (I had >> already restarted xenconsoled..) >> >> I think developers would need more information to figure out what the >> actual bug is. > > Yes, I think any kind of xenconsoled hang can eventually result in guests > spinning waiting for their console buffers to be emptied. It might be > interesting to build xenconsoled with debug symbols (-g compile option) and > attach gdb when it gets in this state. Without that kind of info it''ll be > hard to track down.I haven''t had the opportunity to run xenconsoled with debugging enabled yet, but the disaster stroke again while I was on holiday. My co-workers restarted some stuck domains, but left a couple around. Attaching strace to xenconsoled showed a pretty large timeout on select: select(43, [6 8 9 11 12 14 15 18 20 21 24 26 27 29 30 32 33 35 36 38 39 41 42], [9 12 21 24], NULL, {4144869, 572000} <unfinished ...> which may or may not be a clue. The lsof output seemed reasonable: COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME xenconsol 4566 root cwd DIR 253,4 4096 128 / xenconsol 4566 root rtd DIR 253,4 4096 128 / xenconsol 4566 root txt REG 253,2 21296 577488 /usr/lib/xen-3.2-1/bin/xenconsoled xenconsol 4566 root mem REG 0,3 2147483647 /proc/xen/privcmd (path inode=4026533301) xenconsol 4566 root mem REG 253,4 116414 3175190 /lib/i686/cmov/libpthread-2.7.so xenconsol 4566 root mem REG 253,4 1413540 3170117 /lib/i686/cmov/libc-2.7.so xenconsol 4566 root mem REG 253,2 15300 2621918 /usr/lib/libxenstore.so.3.0.0 xenconsol 4566 root mem REG 253,2 71684 3217152 /usr/lib/xen-3.2-1/lib/libxenctrl.so xenconsol 4566 root mem REG 253,4 9684 3175197 /lib/i686/cmov/libutil-2.7.so xenconsol 4566 root mem REG 253,4 113248 1050535 /lib/ld-2.7.so xenconsol 4566 root 0u CHR 1,3 936 /dev/null xenconsol 4566 root 1u CHR 1,3 936 /dev/null xenconsol 4566 root 2u CHR 1,3 936 /dev/null xenconsol 4566 root 3uW REG 253,3 5 1573306 /var/run/xenconsoled.pid xenconsol 4566 root 4u unix 0xcfb47180 10030 socket xenconsol 4566 root 5u REG 0,3 0 4026533301 /proc/xen/privcmd xenconsol 4566 root 6r FIFO 0,6 10032 pipe xenconsol 4566 root 7w FIFO 0,6 10032 pipe xenconsol 4566 root 8u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 9u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 10u CHR 136,1 3 /dev/pts/1 xenconsol 4566 root 11u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 12u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 13u CHR 136,2 4 /dev/pts/2 xenconsol 4566 root 14u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 15u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 16u CHR 136,3 5 /dev/pts/3 xenconsol 4566 root 17u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 18u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 19u CHR 136,4 6 /dev/pts/4 xenconsol 4566 root 20u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 21u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 22u CHR 136,5 7 /dev/pts/5 xenconsol 4566 root 23u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 24u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 25u CHR 136,6 8 /dev/pts/6 xenconsol 4566 root 26u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 27u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 28u CHR 136,7 9 /dev/pts/7 xenconsol 4566 root 29u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 30u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 31u CHR 136,8 10 /dev/pts/8 xenconsol 4566 root 32u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 33u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 34u CHR 136,9 11 /dev/pts/9 xenconsol 4566 root 35u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 36u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 37u CHR 136,10 12 /dev/pts/10 xenconsol 4566 root 38u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 39u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 40u CHR 136,11 13 /dev/pts/11 xenconsol 4566 root 41u CHR 10,63 1491 /dev/xen/evtchn xenconsol 4566 root 42u CHR 5,2 1538 /dev/ptmx xenconsol 4566 root 43u CHR 136,12 14 /dev/pts/12 After restarting xenconsoled, the stuck domain said: [1052088.070488] BUG: soft lockup - CPU#0 stuck for 136469s! [nscd:1796] pretty much as expected. I still plan to investigate this, but sending now just in case it rings a bell somewhere... -- Regards, Feri. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel