thr3ads.net - Xen devel - [Xen-devel] xm pause causing lockup [Apr 2005]

If this information is useful, please help other people find it:
Share via:

Kip Macy

2005-Apr-14 02:46 UTC

[Xen-devel] xm pause causing lockup

Is there some critical section where "xm pause" might cause a lockup?
When I run FreeBSD to some point (I don''t know where - the console
isn''t giving output) - and then pause it, the machine locks up. I
can''t even get info from xen apart from switching the switching the
serial input.


kmacy@siml4 ssh moe
kmacy@moe xm list
Name              Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0           0      507    0  r----     54.5        
xen-vm2            1      128    1  r----      0.1    9601
kmacy@moe xm pause 1

<....>
Note that this is with domu_debug = n.

The changeset I''m using is from the -unstable on the 11th. 


         -Kip

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-14 03:03 UTC

head link

Re: [Xen-devel] xm pause causing lockup

Dom0 is unpingable, all network connections timeout.
               
                             -Kip 

> Does dom0 wedge completely (e.g., unpingable) or might it just be xend
> that goes awol? The pause code isn''t that complicated, and most of
it
> is used in various not uncommon situations, so I''ll be surprised
if
> this is the hypervisor''s fault.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Keir Fraser

2005-Apr-14 03:03 UTC

head link

Re: [Xen-devel] xm pause causing lockup

On 14 Apr 2005, at 03:46, Kip Macy wrote:
> Is there some critical section where "xm pause" might cause a
lockup?
> When I run FreeBSD to some point (I don''t know where - the console
> isn''t giving output) - and then pause it, the machine locks up. I
> can''t even get info from xen apart from switching the switching
the
> serial input.
Does dom0 wedge completely (e.g., unpingable) or might it just be xend 
that goes awol? The pause code isn''t that complicated, and most of it 
is used in various not uncommon situations, so I''ll be surprised if 
this is the hypervisor''s fault.

  -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Keir Fraser

2005-Apr-14 03:16 UTC

head link

Re: [Xen-devel] xm pause causing lockup

Probably easiest way to trace this is with printk''s in Xen. The guts of
the work is done by domain_pause_by_systemcontroller() in xen/sched.h. 
This in turn calls domain_sleep() in common/schedule.c. A particularly 
interesting place to look will be teh synchronous spin loop at the end 
of domain_sleep -- if the paused domain isn''t descheduled for some 
weird reason then the spin loop would never exit and domain0 would 
hang.

  -- Keir

On 14 Apr 2005, at 04:03, Kip Macy wrote:
> Dom0 is unpingable, all network connections timeout.
>
>                              -Kip
>
>
>> Does dom0 wedge completely (e.g., unpingable) or might it just be xend
>> that goes awol? The pause code isn''t that complicated, and
most of it
>> is used in various not uncommon situations, so I''ll be
surprised if
>> this is the hypervisor''s fault.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-14 03:18 UTC

head link

Re: [Xen-devel] xm pause causing lockup

On 4/13/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk>
wrote:> Probably easiest way to trace this is with printk''s in Xen. The
guts of
> the work is done by domain_pause_by_systemcontroller() in xen/sched.h.
> This in turn calls domain_sleep() in common/schedule.c. 
I traced through that code a while back when trying to decide what to
call from the int3 handler.

A particularly> interesting place to look will be teh synchronous spin loop at the end
> of domain_sleep -- if the paused domain isn''t descheduled for some
> weird reason then the spin loop would never exit and domain0 would
> hang.
Good point. It will be interesting to see. 

I sometimes wonder if I should keep some of the buggy versions of
FreeBSD around for regression testing as they trigger some interesting
behaviours in xen and xend.

           -Kip

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-14 19:41 UTC

head link

Re: [Xen-devel] xm pause causing lockup

I haven''t tracked down the problem yet, but I thought the following
was sufficiently interesting to post:

kmacy@curly while (1)
while? xm list
while? sleep 5
while? end
Name              Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0           0      507    0  r----     67.9        
xen-vm2            1      128    1  r----      4.0    9601
Name              Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0           0      507    0  r----     68.1        
xen-vm2            1      128    1  r----      4.0    9601
Name              Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0           0      507    0  r----     68.3        
xen-vm2            1      128    1  r----      4.0    9601
Name              Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0           0      507    0  r----     68.5        
xen-vm2            1      128    1  r----      4.0    9601
Name              Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0           0      507    0  r----     68.7        
xen-vm2            1      128    1  r----      4.0    9601
Name              Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0           0      507    0  r----     68.9        
xen-vm2            1      128    1  r----      4.0    9601

xen-vm2 is always shown as running, but its time is not increasing.

               -Kip



On 4/13/05, Kip Macy <kip.macy@gmail.com> wrote:> On 4/13/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:
> > Probably easiest way to trace this is with printk''s in Xen.
The guts of
> > the work is done by domain_pause_by_systemcontroller() in xen/sched.h.
> > This in turn calls domain_sleep() in common/schedule.c.
> 
> I traced through that code a while back when trying to decide what to
> call from the int3 handler.
> 
> A particularly
> > interesting place to look will be teh synchronous spin loop at the end
> > of domain_sleep -- if the paused domain isn''t descheduled for
some
> > weird reason then the spin loop would never exit and domain0 would
> > hang.
> 
> Good point. It will be interesting to see.
> 
> I sometimes wonder if I should keep some of the buggy versions of
> FreeBSD around for regression testing as they trigger some interesting
> behaviours in xen and xend.
> 
>            -Kip
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-14 21:04 UTC

head link

Re: [Xen-devel] xm pause causing lockup

I think there may be a bug in your page pinning validation logic - the
lockup occurs when stepping through xen_pgd_pin. I don''t know if
I''m
really passing in 0, as register locals can quickly get overwritten,
but it is certainly worth checking.

Breakpoint 15, pmap_pinit (pmap=0xc06900c0) at
../../../i386-xen/i386-xen/pmap.c:1206
1206                    xen_pgd_pin(ma);
(gdb) 
Continuing.

Breakpoint 8, xen_pgd_pin (ma=0x0) at
../../../i386-xen/i386-xen/xen_machdep.c:490
490         op.cmd = MMUEXT_PIN_L2_TABLE;
(gdb) s
491         op.mfn = ma >> PAGE_SHIFT;
(gdb) 
492         xen_flush_queue();
(gdb) 

Breakpoint 4, xen_flush_queue () at ../../../i386-xen/i386-xen/xen_machdep.c:431
431         if (XPQ_IDX != 0) _xen_flush_queue();
(gdb) 
432     }
(gdb) 
xen_pgd_pin (ma=0x630f) at hypervisor.h:72
72      {
(gdb) 
76          __asm__ __volatile__ (
(gdb) 


On 4/14/05, Kip Macy <kip.macy@gmail.com> wrote:> I haven''t tracked down the problem yet, but I thought the
following
> was sufficiently interesting to post:
> 
> kmacy@curly while (1)
> while? xm list
> while? sleep 5
> while? end
> Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> Domain-0           0      507    0  r----     67.9
> xen-vm2            1      128    1  r----      4.0    9601
> Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> Domain-0           0      507    0  r----     68.1
> xen-vm2            1      128    1  r----      4.0    9601
> Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> Domain-0           0      507    0  r----     68.3
> xen-vm2            1      128    1  r----      4.0    9601
> Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> Domain-0           0      507    0  r----     68.5
> xen-vm2            1      128    1  r----      4.0    9601
> Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> Domain-0           0      507    0  r----     68.7
> xen-vm2            1      128    1  r----      4.0    9601
> Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> Domain-0           0      507    0  r----     68.9
> xen-vm2            1      128    1  r----      4.0    9601
> 
> xen-vm2 is always shown as running, but its time is not increasing.
> 
>                -Kip
> 
> 
> On 4/13/05, Kip Macy <kip.macy@gmail.com> wrote:
> > On 4/13/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:
> > > Probably easiest way to trace this is with printk''s in
Xen. The guts of
> > > the work is done by domain_pause_by_systemcontroller() in
xen/sched.h.
> > > This in turn calls domain_sleep() in common/schedule.c.
> >
> > I traced through that code a while back when trying to decide what to
> > call from the int3 handler.
> >
> > A particularly
> > > interesting place to look will be teh synchronous spin loop at
the end
> > > of domain_sleep -- if the paused domain isn''t
descheduled for some
> > > weird reason then the spin loop would never exit and domain0
would
> > > hang.
> >
> > Good point. It will be interesting to see.
> >
> > I sometimes wonder if I should keep some of the buggy versions of
> > FreeBSD around for regression testing as they trigger some interesting
> > behaviours in xen and xend.
> >
> >            -Kip
> >
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-15 04:36 UTC

head link

Re: [Xen-devel] xm pause causing lockup

To further check this I added:
 printk("%s %d %d %d %d %d\n", __FUNCTION__, op->cmd, op->mfn,
count,
success_count, domid);
to HYPERVISOR_mmuext_op and something similar to mmu_update.


HYPERVISOR_mmu_update 0xc0200ba0 1 0 32752
HYPERVISOR_mmu_update 0xc0200ba0 1 0 32752
HYPERVISOR_mmuext_op 7 -1069543424 1 0 32752
HYPERVISOR_mmu_update 0xc0200ba0 1 0 32752
HYPERVISOR_mmu_update 0xc0200ba0 1 0 32752
HYPERVISOR_mmuext_op 7 -955666432 1 0 32752
HYPERVISOR_mmuext_op 1 25359 1 0 32752
<lockup>

I''m not sure where I could add printks to
get_page_and_type_from_pagenr without making DOM0 take forever to
boot. Suggestions are welcome. Alternatively you could do me a favor
and just run my FreeBSD binary locally.


On 4/14/05, Kip Macy <kip.macy@gmail.com> wrote:> I think there may be a bug in your page pinning validation logic - the
> lockup occurs when stepping through xen_pgd_pin. I don''t know if
I''m
> really passing in 0, as register locals can quickly get overwritten,
> but it is certainly worth checking.
> 
> Breakpoint 15, pmap_pinit (pmap=0xc06900c0) at
> ../../../i386-xen/i386-xen/pmap.c:1206
> 1206                    xen_pgd_pin(ma);
> (gdb)
> Continuing.
> 
> Breakpoint 8, xen_pgd_pin (ma=0x0) at
> ../../../i386-xen/i386-xen/xen_machdep.c:490
> 490         op.cmd = MMUEXT_PIN_L2_TABLE;
> (gdb) s
> 491         op.mfn = ma >> PAGE_SHIFT;
> (gdb)
> 492         xen_flush_queue();
> (gdb)
> 
> Breakpoint 4, xen_flush_queue () at
../../../i386-xen/i386-xen/xen_machdep.c:431
> 431         if (XPQ_IDX != 0) _xen_flush_queue();
> (gdb)
> 432     }
> (gdb)
> xen_pgd_pin (ma=0x630f) at hypervisor.h:72
> 72      {
> (gdb)
> 76          __asm__ __volatile__ (
> (gdb)
> 
> 
> On 4/14/05, Kip Macy <kip.macy@gmail.com> wrote:
> > I haven''t tracked down the problem yet, but I thought the
following
> > was sufficiently interesting to post:
> >
> > kmacy@curly while (1)
> > while? xm list
> > while? sleep 5
> > while? end
> > Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> > Domain-0           0      507    0  r----     67.9
> > xen-vm2            1      128    1  r----      4.0    9601
> > Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> > Domain-0           0      507    0  r----     68.1
> > xen-vm2            1      128    1  r----      4.0    9601
> > Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> > Domain-0           0      507    0  r----     68.3
> > xen-vm2            1      128    1  r----      4.0    9601
> > Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> > Domain-0           0      507    0  r----     68.5
> > xen-vm2            1      128    1  r----      4.0    9601
> > Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> > Domain-0           0      507    0  r----     68.7
> > xen-vm2            1      128    1  r----      4.0    9601
> > Name              Id  Mem(MB)  CPU  State  Time(s)  Console
> > Domain-0           0      507    0  r----     68.9
> > xen-vm2            1      128    1  r----      4.0    9601
> >
> > xen-vm2 is always shown as running, but its time is not increasing.
> >
> >                -Kip
> >
> >
> > On 4/13/05, Kip Macy <kip.macy@gmail.com> wrote:
> > > On 4/13/05, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:
> > > > Probably easiest way to trace this is with printk''s
in Xen. The guts of
> > > > the work is done by domain_pause_by_systemcontroller() in
xen/sched.h.
> > > > This in turn calls domain_sleep() in common/schedule.c.
> > >
> > > I traced through that code a while back when trying to decide
what to
> > > call from the int3 handler.
> > >
> > > A particularly
> > > > interesting place to look will be teh synchronous spin loop
at the end
> > > > of domain_sleep -- if the paused domain isn''t
descheduled for some
> > > > weird reason then the spin loop would never exit and domain0
would
> > > > hang.
> > >
> > > Good point. It will be interesting to see.
> > >
> > > I sometimes wonder if I should keep some of the buggy versions of
> > > FreeBSD around for regression testing as they trigger some
interesting
> > > behaviours in xen and xend.
> > >
> > >            -Kip
> > >
> >
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Ian Pratt

2005-Apr-15 06:53 UTC

head link

RE: [Xen-devel] xm pause causing lockup

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Kip Macy
> Sent: 15 April 2005 05:36
> To: Keir Fraser
> Cc: xen-devel
> Subject: Re: [Xen-devel] xm pause causing lockup
> 
> To further check this I added:
>  printk("%s %d %d %d %d %d\n", __FUNCTION__, op->cmd, 
> op->mfn, count, success_count, domid); to 
> HYPERVISOR_mmuext_op and something similar to mmu_update.
Is your hypothesis that Xen gets stuck in either the mmuext_op or
mmu_update loops?
Are you running with watchdog enabled?

It might be good to add a printk at the end so that you can prove this. 

Hitting ''d'' on the debug console will give us an EIP on CPU 1.

Ian

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-15 17:12 UTC

head link

Re: [Xen-devel] xm pause causing lockup

Great, thanks. I''m now running a completely fresh tree from last night.

Over the course of several minutes I hit ''d'' a number of
times. The
addresses I got were:

0xfc51c742
0xfc51c746
0xfc51c74b
0xfc51c740

(gdb) x/i 0xfc51c742
0xfc51c742 <get_page_type+1218>:        mov    0x40(%esp,1),%eax
(gdb) x/i 0xfc51c746
0xfc51c746 <get_page_type+1222>:        mov    0x14(%eax),%ebx
(gdb) x/i 0xfc51c74b
0xfc51c74b <get_page_type+1227>:        je     0xfc51c740
<get_page_type+1216>
(gdb) x/i 0xfc51c740
0xfc51c740 <get_page_type+1216>:        repz nop 


               -Kip

On 4/14/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
wrote:> 
> 
> > -----Original Message-----
> > From: xen-devel-bounces@lists.xensource.com
> > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Kip Macy
> > Sent: 15 April 2005 05:36
> > To: Keir Fraser
> > Cc: xen-devel
> > Subject: Re: [Xen-devel] xm pause causing lockup
> >
> > To further check this I added:
> >  printk("%s %d %d %d %d %d\n", __FUNCTION__, op->cmd,
> > op->mfn, count, success_count, domid); to
> > HYPERVISOR_mmuext_op and something similar to mmu_update.
> 
> Is your hypothesis that Xen gets stuck in either the mmuext_op or
> mmu_update loops?
> Are you running with watchdog enabled?
> 
> It might be good to add a printk at the end so that you can prove this.
> 
> Hitting ''d'' on the debug console will give us an EIP on
CPU 1.
> 
> Ian
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Ian Pratt

2005-Apr-15 17:25 UTC

head link

RE: [Xen-devel] xm pause causing lockup

Wild! It really is looping in get_page_type.

Any chance you could use the serial debugger to find out what x, nx and
y are in the cmpxchg?

I''ve tried to think of duff inputs that could cause it to loop, but
I''m
not smart enough.

Ian 
> -----Original Message-----
> From: Kip Macy [mailto:kip.macy@gmail.com] 
> Sent: 15 April 2005 18:13
> To: Ian Pratt
> Cc: Keir Fraser; xen-devel; ian.pratt@cl.cam.ac.uk
> Subject: Re: [Xen-devel] xm pause causing lockup
> 
> Great, thanks. I''m now running a completely fresh tree from 
> last night.
> 
> Over the course of several minutes I hit ''d'' a number of 
> times. The addresses I got were:
> 
> 0xfc51c742
> 0xfc51c746
> 0xfc51c74b
> 0xfc51c740
> 
> (gdb) x/i 0xfc51c742
> 0xfc51c742 <get_page_type+1218>:        mov    0x40(%esp,1),%eax
> (gdb) x/i 0xfc51c746
> 0xfc51c746 <get_page_type+1222>:        mov    0x14(%eax),%ebx
> (gdb) x/i 0xfc51c74b
> 0xfc51c74b <get_page_type+1227>:        je     0xfc51c740 
> <get_page_type+1216>
> (gdb) x/i 0xfc51c740
> 0xfc51c740 <get_page_type+1216>:        repz nop 
> 
> 
>                -Kip
> 
> On 4/14/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: xen-devel-bounces@lists.xensource.com
> > > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf 
> Of Kip Macy
> > > Sent: 15 April 2005 05:36
> > > To: Keir Fraser
> > > Cc: xen-devel
> > > Subject: Re: [Xen-devel] xm pause causing lockup
> > >
> > > To further check this I added:
> > >  printk("%s %d %d %d %d %d\n", __FUNCTION__,
op->cmd,
> > > op->mfn, count, success_count, domid); to
> > > HYPERVISOR_mmuext_op and something similar to mmu_update.
> > 
> > Is your hypothesis that Xen gets stuck in either the mmuext_op or 
> > mmu_update loops?
> > Are you running with watchdog enabled?
> > 
> > It might be good to add a printk at the end so that you can 
> prove this.
> > 
> > Hitting ''d'' on the debug console will give us an EIP
on CPU 1.
> > 
> > Ian
> >
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-15 18:57 UTC

head link

Re: [Xen-devel] xm pause causing lockup

Without the ability to continue and only a very basic understanding of
the page typing code there is not a whole lot to go on. Let me know if
there is some other bit of information that I can provide you with.

         -Kip

Before attaching:
(XEN) ''d'' pressed -> dumping registers
(XEN) CPU:    1
(XEN) EIP:    0808:[<fc52d59f>]      
(XEN) EFLAGS: 00000246   CONTEXT: hypervisor
(XEN) eax: 40000001   ebx: 00000000   ecx: fcfe3740   edx: fcfe3740
(XEN) esi: 00007ff0   edi: 00000001   ebp: fcffbda0   esp: fcffbd58
(XEN) ds: 0810   es: 0810   fs: 0810   gs: 0810   ss: 0810   cs: 0808
(XEN) Stack trace from ESP=fcffbd58:
(XEN)    80000003 00000001 fcfe3740 fcfe3740 fcfe3740 80000003
80000004 80000003
(XEN)    00000000 00007ff0 fcffbda0 [fc52bfec] fd494968 fcfe3740
fcffbdc0 40000001
(XEN)    40000001 40000002 fcffbdd0 [fc52c07b] fd494968 25fe0000
00000000 00000000
(XEN)    000003d1 00000000 fcffbde0 [fc52bcec] 00000000 fd494968
fcffbe00 [fc52c52e]
(XEN)    0000630f 25fe0000 fcfe3740 [fc52d100] fffffffc 00000000
fcffe000 00000001
(XEN)    00000001 ff85b000 fcffbe40 [fc52c889] 0630f061 0000630f
fcfe3740 000002ff
(XEN)    00000001 f0000000 f0000000 00000004 f0000001 f0000000
000002ff ff85b000
(XEN)    0000630f fcfe3740 fcffbe60 [fc52d0f0] fd494968 000001fa
fc5b20c0 [fc53185d]
(XEN)    40000000 00000002 fcffbeb0 [fc52d771] fd494968 40000000
fcfe3740 fcfe3740
(XEN)    fcfe3740 80000002 80000003 00000004 00000000 f0000000
f0000000 00000004
(XEN)    40000001 f0000000 fd49497c f0000000 f0000000 40000001
fcffbee0 [fc52c07b]
(XEN)    fd494968 40000000 002ed518 00000000 a089075b 00000001
fcfe3740 00000000
(XEN)    00007ff0 fd494968 fcffbfb0 [fc52df98] 0000630f 40000000
fcfe3740 00000292
(XEN)    fc5781c0 00000001 0019b901 00000000 00804e95 00000000
a089075b 000000a1
(XEN)    a10955f0 000000a1 00000001 fcfea040 00007ff0 00000001
fcffbf80 00000000
(XEN)    fcfe3740 00000000 fcfe3740 00000000 a10955f0 000000a1
00000000 fcffbf98
(XEN)    c0293bac 0000000c 00000003 [fc515bfc] a08902cd 000000a1
00000002 fcfe3740
(XEN)    fcfea040 fd494968 00000000 40000000 00000001 00000001
00000000 00000000
(XEN)    00000001 0000630f c018a19b 00000001 fcfea040 00007ff0
c0293bc8 [fc54e923]
(XEN)    c0293bac 00000001 00000000 00007ff0 00000001 c0293bc8
0000001a 00000000
(XEN) Call Trace from ESP=fcffbd58:
(XEN)    [<fc52bfec>] [<fc52c07b>] [<fc52bcec>]
[<fc52c52e>]
[<fc52d100>] [<fc52c889>]
(XEN)    [<fc52d0f0>] [<fc53185d>] [<fc52d771>]
[<fc52c07b>]
[<fc52df98>] [<fc515bfc>]
(XEN)    [<fc54e923>] 
(XEN) Waiting for GDB to attach to XenDBG


gdb) bt
#0  0xfc52d59f in get_page_type (page=0xfd494968, type=0x25fe0000) at mm.c:1235
#1  0xfc52c07b in get_page_and_type_from_pagenr (page_nr=0x630f,
type=0x25fe0000, d=0xfcfe3740) at mm.c:360
#2  0xfc52c52e in get_page_from_l2e (l2e={l2_lo = 0x630f061},
pfn=0x630f, d=0xfcfe3740, va_idx=0x2ff) at mm.c:495
#3  0xfc52c889 in alloc_l2_table (page=0xfd494968) at mm.c:679
#4  0xfc52d0f0 in alloc_page_type (page=0xfd494968, type=0x40000000)
at mm.c:1083
#5  0xfc52d771 in get_page_type (page=0xfd494968, type=0x40000000) at mm.c:1269
#6  0xfc52c07b in get_page_and_type_from_pagenr (page_nr=0x630f,
type=0x40000000, d=0xfcfe3740) at mm.c:360
#7  0xfc52df98 in do_mmuext_op (uops=0xc0293bac, count=0x1, pdone=0x0,
foreigndom=0x7ff0) at mm.c:1499
#8  0xfc54e923 in test_all_events () at bitops.h:239
#9  0xc0293bac in ?? ()

(gdb) f 7
#7  0xfc52df98 in do_mmuext_op (uops=0xc0293bac, count=0x1, pdone=0x0,
foreigndom=0x7ff0)  at mm.c:1499
1499                okay = get_page_and_type_from_pagenr(op.mfn, type,
FOREIGNDOM);
(gdb) p op
$9 = {
  cmd = 0x1,
  {
    mfn = 0x630f,
    linear_addr = 0x630f
  },
  {
    nr_ents = 0xc018a19b,
    cpuset = 0xc018a19b
  }
}
(gdb) p x
$1 = 0x40000001
(gdb) x nx
0x40000002:     Ignoring packet error, continuing...
Reply contains invalid hex digit 40
(gdb) p y
$2 = 0x40000001
(gdb) p page->u.inuse.type_info
$3 = 0x40000001
(gdb) p x
$4 = 0x40000001
(gdb) p nx
$5 = 0x40000002
(gdb) p y
$6 = 0x40000001
(gdb) p x
$7 = 0x40000001
(gdb) p sizeof(page->u.inuse.type_info)
$8 = 0x4



On 4/15/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
wrote:> Wild! It really is looping in get_page_type.
> 
> Any chance you could use the serial debugger to find out what x, nx and
> y are in the cmpxchg?
> 
> I''ve tried to think of duff inputs that could cause it to loop,
but I''m
> not smart enough.
> 
> Ian
> 
> > -----Original Message-----
> > From: Kip Macy [mailto:kip.macy@gmail.com]
> > Sent: 15 April 2005 18:13
> > To: Ian Pratt
> > Cc: Keir Fraser; xen-devel; ian.pratt@cl.cam.ac.uk
> > Subject: Re: [Xen-devel] xm pause causing lockup
> >
> > Great, thanks. I''m now running a completely fresh tree from
> > last night.
> >
> > Over the course of several minutes I hit ''d'' a
number of
> > times. The addresses I got were:
> >
> > 0xfc51c742
> > 0xfc51c746
> > 0xfc51c74b
> > 0xfc51c740
> >
> > (gdb) x/i 0xfc51c742
> > 0xfc51c742 <get_page_type+1218>:        mov    0x40(%esp,1),%eax
> > (gdb) x/i 0xfc51c746
> > 0xfc51c746 <get_page_type+1222>:        mov    0x14(%eax),%ebx
> > (gdb) x/i 0xfc51c74b
> > 0xfc51c74b <get_page_type+1227>:        je     0xfc51c740
> > <get_page_type+1216>
> > (gdb) x/i 0xfc51c740
> > 0xfc51c740 <get_page_type+1216>:        repz nop
> >
> >
> >                -Kip
> >
> > On 4/14/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: xen-devel-bounces@lists.xensource.com
> > > > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf
> > Of Kip Macy
> > > > Sent: 15 April 2005 05:36
> > > > To: Keir Fraser
> > > > Cc: xen-devel
> > > > Subject: Re: [Xen-devel] xm pause causing lockup
> > > >
> > > > To further check this I added:
> > > >  printk("%s %d %d %d %d %d\n", __FUNCTION__,
op->cmd,
> > > > op->mfn, count, success_count, domid); to
> > > > HYPERVISOR_mmuext_op and something similar to mmu_update.
> > >
> > > Is your hypothesis that Xen gets stuck in either the mmuext_op or
> > > mmu_update loops?
> > > Are you running with watchdog enabled?
> > >
> > > It might be good to add a printk at the end so that you can
> > prove this.
> > >
> > > Hitting ''d'' on the debug console will give us
an EIP on CPU 1.
> > >
> > > Ian
> > >
> >
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Ian Pratt

2005-Apr-15 19:29 UTC

head link

RE: [Xen-devel] xm pause causing lockup

I need to think about this more, but it looks like you have an L2 page
that has a type count of 1 but hasn''t been validated. You''re
then
looping when you try and increment it to 2 thinking that you''re racing
someone else. 

Does this happen if you boot with ''nosmp''? I don''t
really believe it''s a
race, but might be worth checking.

Also, it''s worth adding a printk into this loop just to check that that
is where you''re getting caught.

            /* Someone else is updating validation of this page. Wait...
*/
            while ( (y = page->u.inuse.type_info) == x )
                cpu_relax();
            goto again;

We need to figure out how the type count managed to get to one without
the page being validated. I presume you''re doing a debug=y build of
Xen?
Do you get any warnings about illegal mmu_update attempts when you boot
FreeBSD?

Ian
> Without the ability to continue and only a very basic 
> understanding of the page typing code there is not a whole 
> lot to go on. Let me know if there is some other bit of 
> information that I can provide you with.
> 
>          -Kip
> 
> Before attaching:
> (XEN) ''d'' pressed -> dumping registers
> (XEN) CPU:    1
> (XEN) EIP:    0808:[<fc52d59f>]      
> (XEN) EFLAGS: 00000246   CONTEXT: hypervisor
> (XEN) eax: 40000001   ebx: 00000000   ecx: fcfe3740   edx: fcfe3740
> (XEN) esi: 00007ff0   edi: 00000001   ebp: fcffbda0   esp: fcffbd58
> (XEN) ds: 0810   es: 0810   fs: 0810   gs: 0810   ss: 0810   cs: 0808
> (XEN) Stack trace from ESP=fcffbd58:
> (XEN)    80000003 00000001 fcfe3740 fcfe3740 fcfe3740 80000003
> 80000004 80000003
> (XEN)    00000000 00007ff0 fcffbda0 [fc52bfec] fd494968 fcfe3740
> fcffbdc0 40000001
> (XEN)    40000001 40000002 fcffbdd0 [fc52c07b] fd494968 25fe0000
> 00000000 00000000
> (XEN)    000003d1 00000000 fcffbde0 [fc52bcec] 00000000 fd494968
> fcffbe00 [fc52c52e]
> (XEN)    0000630f 25fe0000 fcfe3740 [fc52d100] fffffffc 00000000
> fcffe000 00000001
> (XEN)    00000001 ff85b000 fcffbe40 [fc52c889] 0630f061 0000630f
> fcfe3740 000002ff
> (XEN)    00000001 f0000000 f0000000 00000004 f0000001 f0000000
> 000002ff ff85b000
> (XEN)    0000630f fcfe3740 fcffbe60 [fc52d0f0] fd494968 000001fa
> fc5b20c0 [fc53185d]
> (XEN)    40000000 00000002 fcffbeb0 [fc52d771] fd494968 40000000
> fcfe3740 fcfe3740
> (XEN)    fcfe3740 80000002 80000003 00000004 00000000 f0000000
> f0000000 00000004
> (XEN)    40000001 f0000000 fd49497c f0000000 f0000000 40000001
> fcffbee0 [fc52c07b]
> (XEN)    fd494968 40000000 002ed518 00000000 a089075b 00000001
> fcfe3740 00000000
> (XEN)    00007ff0 fd494968 fcffbfb0 [fc52df98] 0000630f 40000000
> fcfe3740 00000292
> (XEN)    fc5781c0 00000001 0019b901 00000000 00804e95 00000000
> a089075b 000000a1
> (XEN)    a10955f0 000000a1 00000001 fcfea040 00007ff0 00000001
> fcffbf80 00000000
> (XEN)    fcfe3740 00000000 fcfe3740 00000000 a10955f0 000000a1
> 00000000 fcffbf98
> (XEN)    c0293bac 0000000c 00000003 [fc515bfc] a08902cd 000000a1
> 00000002 fcfe3740
> (XEN)    fcfea040 fd494968 00000000 40000000 00000001 00000001
> 00000000 00000000
> (XEN)    00000001 0000630f c018a19b 00000001 fcfea040 00007ff0
> c0293bc8 [fc54e923]
> (XEN)    c0293bac 00000001 00000000 00007ff0 00000001 c0293bc8
> 0000001a 00000000
> (XEN) Call Trace from ESP=fcffbd58:
> (XEN)    [<fc52bfec>] [<fc52c07b>] [<fc52bcec>]
[<fc52c52e>]
> [<fc52d100>] [<fc52c889>]
> (XEN)    [<fc52d0f0>] [<fc53185d>] [<fc52d771>]
[<fc52c07b>]
> [<fc52df98>] [<fc515bfc>]
> (XEN)    [<fc54e923>] 
> (XEN) Waiting for GDB to attach to XenDBG
> 
> 
> gdb) bt
> #0  0xfc52d59f in get_page_type (page=0xfd494968, 
> type=0x25fe0000) at mm.c:1235
> #1  0xfc52c07b in get_page_and_type_from_pagenr 
> (page_nr=0x630f, type=0x25fe0000, d=0xfcfe3740) at mm.c:360
> #2  0xfc52c52e in get_page_from_l2e (l2e={l2_lo = 0x630f061}, 
> pfn=0x630f, d=0xfcfe3740, va_idx=0x2ff) at mm.c:495
> #3  0xfc52c889 in alloc_l2_table (page=0xfd494968) at mm.c:679
> #4  0xfc52d0f0 in alloc_page_type (page=0xfd494968, 
> type=0x40000000) at mm.c:1083
> #5  0xfc52d771 in get_page_type (page=0xfd494968, 
> type=0x40000000) at mm.c:1269
> #6  0xfc52c07b in get_page_and_type_from_pagenr 
> (page_nr=0x630f, type=0x40000000, d=0xfcfe3740) at mm.c:360
> #7  0xfc52df98 in do_mmuext_op (uops=0xc0293bac, count=0x1, pdone=0x0,
> foreigndom=0x7ff0) at mm.c:1499
> #8  0xfc54e923 in test_all_events () at bitops.h:239
> #9  0xc0293bac in ?? ()
> 
> (gdb) f 7
> #7  0xfc52df98 in do_mmuext_op (uops=0xc0293bac, count=0x1, pdone=0x0,
> foreigndom=0x7ff0)  at mm.c:1499
> 1499                okay = get_page_and_type_from_pagenr(op.mfn, type,
> FOREIGNDOM);
> (gdb) p op
> $9 = {
>   cmd = 0x1,
>   {
>     mfn = 0x630f,
>     linear_addr = 0x630f
>   },
>   {
>     nr_ents = 0xc018a19b,
>     cpuset = 0xc018a19b
>   }
> }
> (gdb) p x
> $1 = 0x40000001
> (gdb) x nx
> 0x40000002:     Ignoring packet error, continuing...
> Reply contains invalid hex digit 40
> (gdb) p y
> $2 = 0x40000001
> (gdb) p page->u.inuse.type_info
> $3 = 0x40000001
> (gdb) p x
> $4 = 0x40000001
> (gdb) p nx
> $5 = 0x40000002
> (gdb) p y
> $6 = 0x40000001
> (gdb) p x
> $7 = 0x40000001
> (gdb) p sizeof(page->u.inuse.type_info)
> $8 = 0x4
> 
> 
> 
> On 4/15/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:
> > Wild! It really is looping in get_page_type.
> > 
> > Any chance you could use the serial debugger to find out what x, nx 
> > and y are in the cmpxchg?
> > 
> > I''ve tried to think of duff inputs that could cause it to
loop, but
> > I''m not smart enough.
> > 
> > Ian
> > 
> > > -----Original Message-----
> > > From: Kip Macy [mailto:kip.macy@gmail.com]
> > > Sent: 15 April 2005 18:13
> > > To: Ian Pratt
> > > Cc: Keir Fraser; xen-devel; ian.pratt@cl.cam.ac.uk
> > > Subject: Re: [Xen-devel] xm pause causing lockup
> > >
> > > Great, thanks. I''m now running a completely fresh tree
from last
> > > night.
> > >
> > > Over the course of several minutes I hit ''d'' a
number of
> times. The 
> > > addresses I got were:
> > >
> > > 0xfc51c742
> > > 0xfc51c746
> > > 0xfc51c74b
> > > 0xfc51c740
> > >
> > > (gdb) x/i 0xfc51c742
> > > 0xfc51c742 <get_page_type+1218>:        mov   
0x40(%esp,1),%eax
> > > (gdb) x/i 0xfc51c746
> > > 0xfc51c746 <get_page_type+1222>:        mov   
0x14(%eax),%ebx
> > > (gdb) x/i 0xfc51c74b
> > > 0xfc51c74b <get_page_type+1227>:        je     0xfc51c740
> > > <get_page_type+1216>
> > > (gdb) x/i 0xfc51c740
> > > 0xfc51c740 <get_page_type+1216>:        repz nop
> > >
> > >
> > >                -Kip
> > >
> > > On 4/14/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: xen-devel-bounces@lists.xensource.com
> > > > > [mailto:xen-devel-bounces@lists.xensource.com] On
Behalf
> > > Of Kip Macy
> > > > > Sent: 15 April 2005 05:36
> > > > > To: Keir Fraser
> > > > > Cc: xen-devel
> > > > > Subject: Re: [Xen-devel] xm pause causing lockup
> > > > >
> > > > > To further check this I added:
> > > > >  printk("%s %d %d %d %d %d\n", __FUNCTION__,
op->cmd,
> > > > > op->mfn, count, success_count, domid); to
> > > > > HYPERVISOR_mmuext_op and something similar to
mmu_update.
> > > >
> > > > Is your hypothesis that Xen gets stuck in either the 
> mmuext_op or 
> > > > mmu_update loops?
> > > > Are you running with watchdog enabled?
> > > >
> > > > It might be good to add a printk at the end so that you can
> > > prove this.
> > > >
> > > > Hitting ''d'' on the debug console will give
us an EIP on CPU 1.
> > > >
> > > > Ian
> > > >
> > >
> >
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-15 21:04 UTC

head link

Re: [Xen-devel] xm pause causing lockup

> Does this happen if you boot with ''nosmp''? I
don''t really believe it''s a
> race, but might be worth checking.
Yes, it still happens. It would have found it quite astonishing if it
were a race.
(XEN) EIP:    0808:[<fc52d5a3>]
(gdb) x/i 0xfc52d5a3
0xfc52d5a3 <get_page_type+265>: mov    0x14(%eax),%eax
(gdb) info line *0xfc52d5a3
Line 1236 of "mm.c" starts at address 0xfc52d5a0
<get_page_type+262>
and ends at 0xfc52d5b0 <get_page_type+278>.
(gdb) 

Line 1236-1240 of local mm.c:
            while ( (y = page->u.inuse.type_info) == x )
                cpu_relax();
            counter++;
            printk("page was not validated");
            goto again;
> Also, it''s worth adding a printk into this loop just to check that
that
> is where you''re getting caught.
Obviously wasn''t thinking and stuck it in the wrong place.
Nonetheless, even without the printk I think I''ve proven my point.

> 
>             /* Someone else is updating validation of this page. Wait...
> */
>             while ( (y = page->u.inuse.type_info) == x )
>                 cpu_relax();
>             goto again;
Yep.
> 
> We need to figure out how the type count managed to get to one without
> the page being validated. I presume you''re doing a debug=y build
of Xen?
Correct. Nothing comes out on the console apart from debug output from FreeBSD.
> Do you get any warnings about illegal mmu_update attempts when you boot
> FreeBSD?
No, I don''t. This is the offending code snippet from pmap_pinit:

        /* install self-referential address mapping entry(s) */
	for (i = 0; i < NPGPTD; i++) {
		ma = xpmap_ptom(VM_PAGE_TO_PHYS(ptdpg[i]));
		pmap->pm_pdir[PTDPTDI + i] = ma | PG_V | PG_A | PG_M;
#ifdef PAE
		pmap->pm_pdpt[i] = ma | PG_V;
#endif
		/* re-map page directory read-only */
		PT_SET_MA(pmap->pm_pdir, *vtopte((vm_offset_t)pmap->pm_pdir) &
~PG_RW);
		xen_pgd_pin(ma);
	}

PT_SET_MA is just a wrapper for update_va_mapping. Have there been any
recent changes to the page typing code that would cause it to get
confused by a self-referential mapping?

                          -Kip

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Kip Macy

2005-Apr-16 19:59 UTC

head link

buggy linear page table handling Re: [Xen-devel] xm pause causing lockup

I went through a few quick iterations to test page table reference
counting. In short, if I L2 pin a zeroed page that I''ve re-mapped
read-only the pin succeeds. If the page has a self-referential mapping
before it is remapped read-only the pin never returns. It is probably
safe to conclude that the type count is not correctly changed when the
page is re-mapped if there is a self-referential entry. This used to
work, thus it is also safe to say that this is a regression introduced
some time between 3/22 and 4/11. Test code from pmap_pinit below.

                          -Kip 


	/* ***** TEMP \/ ********** */
	ma = xpmap_ptom(VM_PAGE_TO_PHYS(ptdpg[0]));
#if 0
	/* works */
	pmap_qremove((vm_offset_t)pmap->pm_pdir, NPGPTD);
#elif 0
 	/* works */
	PT_SET_MA(pmap->pm_pdir, 0);
#elif 0
	/* works */
	PT_SET_MA(pmap->pm_pdir, ma | PG_V | PG_A);
#else 		
	/* causes lockup on pin */
	pmap->pm_pdir[PTDPTDI + i] = ma | PG_V | PG_A | PG_M;
	PT_SET_MA(pmap->pm_pdir, ma | PG_V | PG_A);
#endif
	
	printk("pinning %p - pass 0\n", ma);
	xen_pgd_pin(xpmap_ptom(VM_PAGE_TO_PHYS(ptdpg[0])));
	printk("pinned %p - pass 0\n", ma);
	/* ***** TEMP ^ ********** */

On 4/15/05, Kip Macy <kip.macy@gmail.com> wrote:> > Does this happen if you boot with ''nosmp''? I
don''t really believe it''s a
> > race, but might be worth checking.
> 
> Yes, it still happens. It would have found it quite astonishing if it
> were a race.
> (XEN) EIP:    0808:[<fc52d5a3>]
> (gdb) x/i 0xfc52d5a3
> 0xfc52d5a3 <get_page_type+265>: mov    0x14(%eax),%eax
> (gdb) info line *0xfc52d5a3
> Line 1236 of "mm.c" starts at address 0xfc52d5a0
<get_page_type+262>
> and ends at 0xfc52d5b0 <get_page_type+278>.
> (gdb)
> 
> Line 1236-1240 of local mm.c:
>             while ( (y = page->u.inuse.type_info) == x )
>                 cpu_relax();
>             counter++;
>             printk("page was not validated");
>             goto again;
> 
> > Also, it''s worth adding a printk into this loop just to check
that that
> > is where you''re getting caught.
> 
> Obviously wasn''t thinking and stuck it in the wrong place.
> Nonetheless, even without the printk I think I''ve proven my point.
> 
> 
> >
> >             /* Someone else is updating validation of this page.
Wait...
> > */
> >             while ( (y = page->u.inuse.type_info) == x )
> >                 cpu_relax();
> >             goto again;
> 
> Yep.
> 
> >
> > We need to figure out how the type count managed to get to one without
> > the page being validated. I presume you''re doing a debug=y
build of Xen?
> 
> Correct. Nothing comes out on the console apart from debug output from
FreeBSD.
> 
> > Do you get any warnings about illegal mmu_update attempts when you
boot
> > FreeBSD?
> 
> No, I don''t. This is the offending code snippet from pmap_pinit:
> 
>         /* install self-referential address mapping entry(s) */
>         for (i = 0; i < NPGPTD; i++) {
>                 ma = xpmap_ptom(VM_PAGE_TO_PHYS(ptdpg[i]));
>                 pmap->pm_pdir[PTDPTDI + i] = ma | PG_V | PG_A | PG_M;
> #ifdef PAE
>                 pmap->pm_pdpt[i] = ma | PG_V;
> #endif
>                 /* re-map page directory read-only */
>                 PT_SET_MA(pmap->pm_pdir,
*vtopte((vm_offset_t)pmap->pm_pdir) & ~PG_RW);
>                 xen_pgd_pin(ma);
>         }
> 
> PT_SET_MA is just a wrapper for update_va_mapping. Have there been any
> recent changes to the page typing code that would cause it to get
> confused by a self-referential mapping?
> 
>                           -Kip
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
lists.xensource.com/xen-devel

Xen devel - Apr 2005 - xm pause causing lockup

[Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

RE: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

RE: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

RE: [Xen-devel] xm pause causing lockup

Re: [Xen-devel] xm pause causing lockup

buggy linear page table handling Re: [Xen-devel] xm pause causing lockup