thr3ads.net - Xen devel - [Xen-devel] "xm save" trouble -- deadlock? [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Gerd Knorr

2005-Nov-01 16:43 UTC

[Xen-devel] "xm save" trouble -- deadlock?

Hi,

"xm save" doesn''t work for me, the command just blocks
forever.  In the
process list it looks like this:

  6567 ?        Ssl    0:00 python /usr/sbin/xend restart
  7978 ?        SL     0:00  \_ /usr/lib/xen/bin/xc_save 14 17 10 0 0 0

xc_save is blocked, in a write call to stderr:

   master-xen root /vm/ttylinux# strace -p7978
   Process 7978 attached - interrupt to quit
   write(2, "FNI 28 : [10000004,815] pte=2b92"..., 80 <unfinished
...>
   Process 7978 detached

stderr is a pipe to xend:

   master-xen root /vm/ttylinux# ll /proc/7978/fd/2
   l-wx------  1 root root 64 Nov  1 17:25 /proc/7978/fd/2 -> pipe:[40419]
   master-xen root /vm/ttylinux# ll /proc/*/fd/* | grep 40419
   lr-x------  1 root root 64 Nov  1 17:25 /proc/6567/fd/22 -> pipe:[40419]
   l-wx------  1 root root 64 Nov  1 17:25 /proc/7978/fd/2 -> pipe:[40419]

xend in turn doesn''t read from the pipe but is waiting for some lock:

   master-xen root /vm/ttylinux# strace -p6567
   Process 6567 attached - interrupt to quit
   futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...>
   Process 6567 detached

Ideas anyone what is going on here?

   Gerd

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerd Knorr

2005-Nov-01 17:15 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

> xend in turn doesn''t read from the pipe but is waiting for some
lock:
> 
>   master-xen root /vm/ttylinux# strace -p6567
>   Process 6567 attached - interrupt to quit
>   futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...>
>   Process 6567 detached
Oh, xend is multithreaded:

   master-xen root /vm/ttylinux# ls /proc/6567/task
   .  ..  6567  6568  6569  6570  6571  6581  7977

7977 seems to be responsible for the xc_save and does this:

   master-xen root /vm/ttylinux# strace -p7977
   Process 7977 attached - interrupt to quit
   read(20,  <unfinished ...>
   Process 7977 detached

fd 20 is the other end of the *stdout* pipe, whereas xc_save writes 
stuff to *stderr*.  Hmm.  Maybe xend causes the deadlock by simply 
reading from the wrong file handle?

Some of the other threads behave in a strange way as well:

   master-xen root /vm/ttylinux# strace -p6568
   Process 6568 attached - interrupt to quit
   select(4, [3], [], [], {0, 960000})     = 0 (Timeout)
   futex(0x80e53b8, FUTEX_WAKE, 1)         = 0
   accept(3, 0x408193f8, [110])            = -1 EAGAIN (Resource 
temporarily unavailable)

There is no point in calling accept(3) unless select() flags file handle 
#3 as readable.

Looks like I''ll go browse some python code tomorrow ...

   Gerd


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ewan Mellor

2005-Nov-01 18:54 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

On Tue, Nov 01, 2005 at 06:15:27PM +0100, Gerd Knorr wrote:
> >xend in turn doesn''t read from the pipe but is waiting for
some lock:
> >
> >  master-xen root /vm/ttylinux# strace -p6567
> >  Process 6567 attached - interrupt to quit
> >  futex(0x8087370, FUTEX_WAIT, 0, NULL <unfinished ...>
> >  Process 6567 detached
> 
> Oh, xend is multithreaded:
> 
>   master-xen root /vm/ttylinux# ls /proc/6567/task
>   .  ..  6567  6568  6569  6570  6571  6581  7977
> 
> 7977 seems to be responsible for the xc_save and does this:
> 
>   master-xen root /vm/ttylinux# strace -p7977
>   Process 7977 attached - interrupt to quit
>   read(20,  <unfinished ...>
>   Process 7977 detached
> 
> fd 20 is the other end of the *stdout* pipe, whereas xc_save writes 
> stuff to *stderr*.  Hmm.  Maybe xend causes the deadlock by simply 
> reading from the wrong file handle?
The code that does this is in XendCheckpoint.py:forkHelper.  It''s using
select.poll() and file.readline() to read from both the stdout and the
stderr.  This is a pretty daft thing to do -- there''s definitely
potential for
deadlock here.

I''ll rewrite this to use a separate thread to pull the data from
stderr, which
should solve the problem.

Thanks for your diagnostic efforts,

Ewan.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ewan Mellor

2005-Nov-01 18:58 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

On Tue, Nov 01, 2005 at 06:15:27PM +0100, Gerd Knorr wrote:
> Some of the other threads behave in a strange way as well:
> 
>   master-xen root /vm/ttylinux# strace -p6568
>   Process 6568 attached - interrupt to quit
>   select(4, [3], [], [], {0, 960000})     = 0 (Timeout)
>   futex(0x80e53b8, FUTEX_WAKE, 1)         = 0
>   accept(3, 0x408193f8, [110])            = -1 EAGAIN (Resource 
> temporarily unavailable)
> 
> There is no point in calling accept(3) unless select() flags file handle 
> #3 as readable.
This mindboggling piece of loveliness is in xen/web/connection.py.  If you can
unpick it, a patch would be more than welcome!

Ewan.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerd Knorr

2005-Nov-02 09:25 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

Hi,
> The code that does this is in XendCheckpoint.py:forkHelper.  It''s
using
> select.poll() and file.readline() to read from both the stdout and the
> stderr.  This is a pretty daft thing to do -- there''s definitely
potential for
> deadlock here.
 >> I''ll rewrite this to use a separate thread to pull the data from
stderr, which
> should solve the problem.
Should be fixable without a new thread, I''ll have a look. 
log.debug("stuff") ends up in /var/log/xend.log I guess?

   Gerd


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ewan Mellor

2005-Nov-02 10:04 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

On Wed, Nov 02, 2005 at 10:25:36AM +0100, Gerd Knorr wrote:
>   Hi,
> 
> >The code that does this is in XendCheckpoint.py:forkHelper. 
It''s using
> >select.poll() and file.readline() to read from both the stdout and the
> >stderr.  This is a pretty daft thing to do -- there''s
definitely potential
> >for
> >deadlock here.
> >
> >I''ll rewrite this to use a separate thread to pull the data
from stderr,
> >which
> >should solve the problem.
> 
> Should be fixable without a new thread, I''ll have a look. 
I''ve done a threaded fix already.  You''re welcome to have a go
at doing it
without a thread if you want, but I think it''ll be messy.
> log.debug("stuff") ends up in /var/log/xend.log I guess?
Yes, it does.

I''ve assigned this bug #378.

Ewan.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerd Knorr

2005-Nov-02 11:24 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

> I''ve done a threaded fix already.  You''re welcome to have
a go at doing it
> without a thread if you want, but I think it''ll be messy.
Looks like, yes.  Mixing the high-level buffered file I/O together with 
select() (and non-blocking fd''s) usually doesn''t work out very
well.
Going down using os.read() instead likely makes the code more complex 
than using one thread per file descriptor ...

cheers,

   Gerd


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerd Knorr

2005-Nov-02 11:34 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

>>   master-xen root /vm/ttylinux# strace -p6568
>>   Process 6568 attached - interrupt to quit
>>   select(4, [3], [], [], {0, 960000})     = 0 (Timeout)
>>   futex(0x80e53b8, FUTEX_WAKE, 1)         = 0
>>   accept(3, 0x408193f8, [110])            = -1 EAGAIN (Resource 
>> temporarily unavailable)
>>
>> There is no point in calling accept(3) unless select() flags file
handle
>> #3 as readable.
> 
> This mindboggling piece of loveliness is in xen/web/connection.py.  If you
can
> unpick it, a patch would be more than welcome!
Can someone explain the comment on the start of the file?

<quote>
    """We make sockets non-blocking so that operations like
accept()
    don''t block. We also select on a timeout. Otherwise we have no way
    of getting the threads to shutdown.
    """
</quote>

What exactly is the thread shutdown problem here?  Why the timeout is 
needed in the first place?

cheers,

   Gerd

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerd Knorr

2005-Nov-02 15:35 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

> I''ve done a threaded fix already.  You''re welcome to have
a go at doing it
> without a thread if you want, but I think it''ll be messy.
Ok, wile waiting for the fix show up in the public mercurial tree I''ve 
workarounded the issue with a wrapper script which redirects xc_save 
stderr to a file.  There I get this:

    FNI 21 : [10000003,768] pte=25fae063, mfn=00025fae, pfn=00001b00 
[mfn]=deadbeef
    FNI 21 : [10000003,769] pte=25faf063, mfn=00025faf, pfn=00001b01 
[mfn]=deadbeef
    FNI 21 : [10000003,770] pte=25fb0063, mfn=00025fb0, pfn=00001b02 
[mfn]=deadbeef
    [ ... many more of these ... ]

In the source code there where message is printed I find a comment 
saying "/* I don''t think this should ever happen */".  Hmm. 
It does.
And probably it is a problem.  "xm save" works now, but I
can''t restore
the domain:

   master-xen root /tmp# xm restore /vm/ttylinux/suspend.img
   Error: Could not read store/console MFN

Ideas anyone?  This is a ttylinux instance running out of a ramdisk.

   Gerd

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ewan Mellor

2005-Nov-02 15:41 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

On Wed, Nov 02, 2005 at 04:35:00PM +0100, Gerd Knorr wrote:
> >I''ve done a threaded fix already.  You''re welcome to
have a go at doing it
> >without a thread if you want, but I think it''ll be messy.
> 
> Ok, wile waiting for the fix show up in the public mercurial tree
I''ve
> workarounded the issue with a wrapper script which redirects xc_save 
> stderr to a file.  There I get this:
> 
>    FNI 21 : [10000003,768] pte=25fae063, mfn=00025fae, pfn=00001b00 
> [mfn]=deadbeef
>    FNI 21 : [10000003,769] pte=25faf063, mfn=00025faf, pfn=00001b01 
> [mfn]=deadbeef
>    FNI 21 : [10000003,770] pte=25fb0063, mfn=00025fb0, pfn=00001b02 
> [mfn]=deadbeef
>    [ ... many more of these ... ]
> 
> In the source code there where message is printed I find a comment 
> saying "/* I don''t think this should ever happen */". 
Hmm.  It does.
> And probably it is a problem.
In and of itself, this diagnostic message is harmless, despite the comment to
the contrary.
> "xm save" works now, but I can''t restore the domain:
> 
>   master-xen root /tmp# xm restore /vm/ttylinux/suspend.img
>   Error: Could not read store/console MFN
It is trying to read two values that are output by the xc_restore helper
program on its stdout.  Have you inadvertently lost xc_restore''s
stdout?  If
not, then xc_restore is broken -- check for corresponding diagnostic
information in /var/log/xend.log, /var/log/xend-debug.log, and dmesg.

Ewan.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerd Knorr

2005-Nov-02 17:23 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

>> "xm save" works now, but I can''t restore the domain:
>>
>>   master-xen root /tmp# xm restore /vm/ttylinux/suspend.img
>>   Error: Could not read store/console MFN
> 
> It is trying to read two values that are output by the xc_restore helper
> program on its stdout.  Have you inadvertently lost xc_restore''s
stdout?  If
> not, then xc_restore is broken -- check for corresponding diagnostic
> information in /var/log/xend.log, /var/log/xend-debug.log, and dmesg.
Well, depends on at which point in time I suspend the ttylinux Domain.

When suspending it quickly, so I catch it during kernel boot, suspend 
and resume works ok.  The resumed domain quickly stops though and starts 
eating CPU time.  I also see the mfn messages in the stdout logfile when 
starting xc_restore using the logging wrapper script.

When suspending it once it bootet to the login prompt the resume
doesn''t
work.  The stdout log also is empty.

BTW: what is the latest linux kernel code?  Ian announced 
linux-2.6-xen.hg some weeks ago and also mentioned the sparse trees in 
the xen-unstable.hg repository will be kept in sync.  Is that still the 
case?

cheers,

   Gerd

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Kip Macy

2005-Nov-02 18:28 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

> What exactly is the thread shutdown problem here? Why the timeout is
> needed in the first place?

I didn't see an answer on this thread so I'll take a stab.

If you do a select without a timeout and no activity occurs on the file
descriptors the thread may have no way of exiting cleanly.

-Kip


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerd Knorr

2005-Nov-03 08:53 UTC

head link

Re: [Xen-devel] "xm save" trouble -- deadlock?

Kip Macy wrote:>     What exactly is the thread shutdown problem here?  Why the timeout is
>     needed in the first place? 
> 
> 
> I didn''t see an answer on this thread so I''ll take a
stab.
> 
> If you do a select without a timeout and no activity occurs on the file 
> descriptors the thread may have no way of exiting cleanly.
Hmm, it''s still not clear to me how this is supposed to work.  How it
is
signaled to the threads that they should exit?  What I see when stracing 
the thread, then run "xend stop" in another tty, is that the thread is
simply killed off with SIGHUP, with no cleanup being done by the thread.

The select() system call will also return on signals (with errno=EINTR) 
unless you explicitly set SA_RESTART when calling sigaction(2).  So if 
SIGHUP is used to signal the thread it should exit the timeout can go 
away.  Probably the whole select() can go away as well as the accept() 
will return on signals as well, so just sitting in the accept syscall 
should work just fine too.

At the moment I still don''t see the point in using select() in the
first
place when there is one thread per socket anyway ...

cheers,

   Gerd

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Nov 2005 - "xm save" trouble -- deadlock?

[Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?

Re: [Xen-devel] "xm save" trouble -- deadlock?