Hi, "xm save" doesn''t work for me, the command just blocks forever. In the process list it looks like this: 6567 ? Ssl 0:00 python /usr/sbin/xend restart 7978 ? SL 0:00 \_ /usr/lib/xen/bin/xc_save 14 17 10 0 0 0 xc_save is blocked, in a write call to stderr: master-xen root /vm/ttylinux# strace -p7978 Process 7978 attached - interrupt to quit write(2, "FNI 28 : [10000004,815] pte=2b92"..., 80 <unfinished ...> Process 7978 detached stderr is a pipe to xend: master-xen root /vm/ttylinux# ll /proc/7978/fd/2 l-wx------ 1 root root 64 Nov 1 17:25 /proc/7978/fd/2 -> pipe:[40419] master-xen root /vm/ttylinux# ll /proc/*/fd/* | grep 40419 lr-x------ 1 root root 64 Nov 1 17:25 /proc/6567/fd/22 -> pipe:[40419] l-wx------ 1 root root 64 Nov 1 17:25 /proc/7978/fd/2 -> pipe:[40419] xend in turn doesn''t read from the pipe but is waiting for some lock: master-xen root /vm/ttylinux# strace -p6567 Process 6567 attached - interrupt to quit futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...> Process 6567 detached Ideas anyone what is going on here? Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> xend in turn doesn''t read from the pipe but is waiting for some lock: > > master-xen root /vm/ttylinux# strace -p6567 > Process 6567 attached - interrupt to quit > futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...> > Process 6567 detachedOh, xend is multithreaded: master-xen root /vm/ttylinux# ls /proc/6567/task . .. 6567 6568 6569 6570 6571 6581 7977 7977 seems to be responsible for the xc_save and does this: master-xen root /vm/ttylinux# strace -p7977 Process 7977 attached - interrupt to quit read(20, <unfinished ...> Process 7977 detached fd 20 is the other end of the *stdout* pipe, whereas xc_save writes stuff to *stderr*. Hmm. Maybe xend causes the deadlock by simply reading from the wrong file handle? Some of the other threads behave in a strange way as well: master-xen root /vm/ttylinux# strace -p6568 Process 6568 attached - interrupt to quit select(4, [3], [], [], {0, 960000}) = 0 (Timeout) futex(0x80e53b8, FUTEX_WAKE, 1) = 0 accept(3, 0x408193f8, [110]) = -1 EAGAIN (Resource temporarily unavailable) There is no point in calling accept(3) unless select() flags file handle #3 as readable. Looks like I''ll go browse some python code tomorrow ... Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Nov 01, 2005 at 06:15:27PM +0100, Gerd Knorr wrote:> >xend in turn doesn''t read from the pipe but is waiting for some lock: > > > > master-xen root /vm/ttylinux# strace -p6567 > > Process 6567 attached - interrupt to quit > > futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...> > > Process 6567 detached > > Oh, xend is multithreaded: > > master-xen root /vm/ttylinux# ls /proc/6567/task > . .. 6567 6568 6569 6570 6571 6581 7977 > > 7977 seems to be responsible for the xc_save and does this: > > master-xen root /vm/ttylinux# strace -p7977 > Process 7977 attached - interrupt to quit > read(20, <unfinished ...> > Process 7977 detached > > fd 20 is the other end of the *stdout* pipe, whereas xc_save writes > stuff to *stderr*. Hmm. Maybe xend causes the deadlock by simply > reading from the wrong file handle?The code that does this is in XendCheckpoint.py:forkHelper. It''s using select.poll() and file.readline() to read from both the stdout and the stderr. This is a pretty daft thing to do -- there''s definitely potential for deadlock here. I''ll rewrite this to use a separate thread to pull the data from stderr, which should solve the problem. Thanks for your diagnostic efforts, Ewan. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Nov 01, 2005 at 06:15:27PM +0100, Gerd Knorr wrote:> Some of the other threads behave in a strange way as well: > > master-xen root /vm/ttylinux# strace -p6568 > Process 6568 attached - interrupt to quit > select(4, [3], [], [], {0, 960000}) = 0 (Timeout) > futex(0x80e53b8, FUTEX_WAKE, 1) = 0 > accept(3, 0x408193f8, [110]) = -1 EAGAIN (Resource > temporarily unavailable) > > There is no point in calling accept(3) unless select() flags file handle > #3 as readable.This mindboggling piece of loveliness is in xen/web/connection.py. If you can unpick it, a patch would be more than welcome! Ewan. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi,> The code that does this is in XendCheckpoint.py:forkHelper. It''s using > select.poll() and file.readline() to read from both the stdout and the > stderr. This is a pretty daft thing to do -- there''s definitely potential for > deadlock here.>> I''ll rewrite this to use a separate thread to pull the data from stderr, which > should solve the problem.Should be fixable without a new thread, I''ll have a look. log.debug("stuff") ends up in /var/log/xend.log I guess? Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, Nov 02, 2005 at 10:25:36AM +0100, Gerd Knorr wrote:> Hi, > > >The code that does this is in XendCheckpoint.py:forkHelper. It''s using > >select.poll() and file.readline() to read from both the stdout and the > >stderr. This is a pretty daft thing to do -- there''s definitely potential > >for > >deadlock here. > > > >I''ll rewrite this to use a separate thread to pull the data from stderr, > >which > >should solve the problem. > > Should be fixable without a new thread, I''ll have a look.I''ve done a threaded fix already. You''re welcome to have a go at doing it without a thread if you want, but I think it''ll be messy.> log.debug("stuff") ends up in /var/log/xend.log I guess?Yes, it does. I''ve assigned this bug #378. Ewan. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I''ve done a threaded fix already. You''re welcome to have a go at doing it > without a thread if you want, but I think it''ll be messy.Looks like, yes. Mixing the high-level buffered file I/O together with select() (and non-blocking fd''s) usually doesn''t work out very well. Going down using os.read() instead likely makes the code more complex than using one thread per file descriptor ... cheers, Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>> master-xen root /vm/ttylinux# strace -p6568 >> Process 6568 attached - interrupt to quit >> select(4, [3], [], [], {0, 960000}) = 0 (Timeout) >> futex(0x80e53b8, FUTEX_WAKE, 1) = 0 >> accept(3, 0x408193f8, [110]) = -1 EAGAIN (Resource >> temporarily unavailable) >> >> There is no point in calling accept(3) unless select() flags file handle >> #3 as readable. > > This mindboggling piece of loveliness is in xen/web/connection.py. If you can > unpick it, a patch would be more than welcome!Can someone explain the comment on the start of the file? <quote> """We make sockets non-blocking so that operations like accept() don''t block. We also select on a timeout. Otherwise we have no way of getting the threads to shutdown. """ </quote> What exactly is the thread shutdown problem here? Why the timeout is needed in the first place? cheers, Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> I''ve done a threaded fix already. You''re welcome to have a go at doing it > without a thread if you want, but I think it''ll be messy.Ok, wile waiting for the fix show up in the public mercurial tree I''ve workarounded the issue with a wrapper script which redirects xc_save stderr to a file. There I get this: FNI 21 : [10000003,768] pte=25fae063, mfn=00025fae, pfn=00001b00 [mfn]=deadbeef FNI 21 : [10000003,769] pte=25faf063, mfn=00025faf, pfn=00001b01 [mfn]=deadbeef FNI 21 : [10000003,770] pte=25fb0063, mfn=00025fb0, pfn=00001b02 [mfn]=deadbeef [ ... many more of these ... ] In the source code there where message is printed I find a comment saying "/* I don''t think this should ever happen */". Hmm. It does. And probably it is a problem. "xm save" works now, but I can''t restore the domain: master-xen root /tmp# xm restore /vm/ttylinux/suspend.img Error: Could not read store/console MFN Ideas anyone? This is a ttylinux instance running out of a ramdisk. Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, Nov 02, 2005 at 04:35:00PM +0100, Gerd Knorr wrote:> >I''ve done a threaded fix already. You''re welcome to have a go at doing it > >without a thread if you want, but I think it''ll be messy. > > Ok, wile waiting for the fix show up in the public mercurial tree I''ve > workarounded the issue with a wrapper script which redirects xc_save > stderr to a file. There I get this: > > FNI 21 : [10000003,768] pte=25fae063, mfn=00025fae, pfn=00001b00 > [mfn]=deadbeef > FNI 21 : [10000003,769] pte=25faf063, mfn=00025faf, pfn=00001b01 > [mfn]=deadbeef > FNI 21 : [10000003,770] pte=25fb0063, mfn=00025fb0, pfn=00001b02 > [mfn]=deadbeef > [ ... many more of these ... ] > > In the source code there where message is printed I find a comment > saying "/* I don''t think this should ever happen */". Hmm. It does. > And probably it is a problem.In and of itself, this diagnostic message is harmless, despite the comment to the contrary.> "xm save" works now, but I can''t restore the domain: > > master-xen root /tmp# xm restore /vm/ttylinux/suspend.img > Error: Could not read store/console MFNIt is trying to read two values that are output by the xc_restore helper program on its stdout. Have you inadvertently lost xc_restore''s stdout? If not, then xc_restore is broken -- check for corresponding diagnostic information in /var/log/xend.log, /var/log/xend-debug.log, and dmesg. Ewan. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>> "xm save" works now, but I can''t restore the domain: >> >> master-xen root /tmp# xm restore /vm/ttylinux/suspend.img >> Error: Could not read store/console MFN > > It is trying to read two values that are output by the xc_restore helper > program on its stdout. Have you inadvertently lost xc_restore''s stdout? If > not, then xc_restore is broken -- check for corresponding diagnostic > information in /var/log/xend.log, /var/log/xend-debug.log, and dmesg.Well, depends on at which point in time I suspend the ttylinux Domain. When suspending it quickly, so I catch it during kernel boot, suspend and resume works ok. The resumed domain quickly stops though and starts eating CPU time. I also see the mfn messages in the stdout logfile when starting xc_restore using the logging wrapper script. When suspending it once it bootet to the login prompt the resume doesn''t work. The stdout log also is empty. BTW: what is the latest linux kernel code? Ian announced linux-2.6-xen.hg some weeks ago and also mentioned the sparse trees in the xen-unstable.hg repository will be kept in sync. Is that still the case? cheers, Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> What exactly is the thread shutdown problem here? Why the timeout is > needed in the first place?I didn't see an answer on this thread so I'll take a stab. If you do a select without a timeout and no activity occurs on the file descriptors the thread may have no way of exiting cleanly. -Kip _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kip Macy wrote:> What exactly is the thread shutdown problem here? Why the timeout is > needed in the first place? > > > I didn''t see an answer on this thread so I''ll take a stab. > > If you do a select without a timeout and no activity occurs on the file > descriptors the thread may have no way of exiting cleanly.Hmm, it''s still not clear to me how this is supposed to work. How it is signaled to the threads that they should exit? What I see when stracing the thread, then run "xend stop" in another tty, is that the thread is simply killed off with SIGHUP, with no cleanup being done by the thread. The select() system call will also return on signals (with errno=EINTR) unless you explicitly set SA_RESTART when calling sigaction(2). So if SIGHUP is used to signal the thread it should exit the timeout can go away. Probably the whole select() can go away as well as the accept() will return on signals as well, so just sitting in the accept syscall should work just fine too. At the moment I still don''t see the point in using select() in the first place when there is one thread per socket anyway ... cheers, Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel