thr3ads.net - Xen devel - [Xen-devel] DomU lockups after resume from S3 on Core i5 processors [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Joanna Rutkowska

2010-Jul-05 10:38 UTC

[Xen-devel] DomU lockups after resume from S3 on Core i5 processors

I''m experiencing very reproducible DomU lockups that occur after I
resume the system from an S3 sleep. Strangely this seem to happen only
on my Core i5 systems (tested on two different machines), but not on
older Core 2 Duo systems.

Usually this causes the apps (e.g. Firefox) running in DomUs to become
unresponsive, but sometimes I see that some very limited functionality
of the app is still available (e.g. I can open/close Tabs in Firefox,
but cannot do much anything more). Also, when I log in to the DomU via
xm console, I usually can see the login prompt, can enter the username,
but then the console hangs.

I tried to attach to such a hanged DomU using gdbserver-xen, but when I
subsequently try to attach to the server from gdb (via the target
127.0.0.1:9999 command), my gdb segfaults (how funny!).

I''m running Xen 3.4.3, and fairly recent pvops0 kernel in DomU. In Dom0
I run 2.6.34-xenlinux kernel (opensuse patches), but I doubt it is
relevant in any way.

This seems like a scheduling problem, and, because it seems to affect
Core i5 processors, but not Core 2 Duos, it might have something to do
with Hyperthreading perhaps?

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Jul-05 21:28 UTC

head link

[Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

On 07/05/10 12:38, Joanna Rutkowska wrote:> I''m experiencing very reproducible DomU lockups that occur after I
> resume the system from an S3 sleep. Strangely this seem to happen only
> on my Core i5 systems (tested on two different machines), but not on
> older Core 2 Duo systems.
> 
> Usually this causes the apps (e.g. Firefox) running in DomUs to become
> unresponsive, but sometimes I see that some very limited functionality
> of the app is still available (e.g. I can open/close Tabs in Firefox,
> but cannot do much anything more). Also, when I log in to the DomU via
> xm console, I usually can see the login prompt, can enter the username,
> but then the console hangs.
> 
> I tried to attach to such a hanged DomU using gdbserver-xen, but when I
> subsequently try to attach to the server from gdb (via the target
> 127.0.0.1:9999 command), my gdb segfaults (how funny!).
> 
> I''m running Xen 3.4.3, and fairly recent pvops0 kernel in DomU. In
Dom0
> I run 2.6.34-xenlinux kernel (opensuse patches), but I doubt it is
> relevant in any way.
> 
> This seems like a scheduling problem, and, because it seems to affect
> Core i5 processors, but not Core 2 Duos, it might have something to do
> with Hyperthreading perhaps?
> Ok, finally got the gdbsever working. This is the backtrace I get when
attaching to a lockedup DomU after resume:

#0  0xffffffff810093aa in ?? ()
#1  0xffffffff8168be18 in ?? ()
#2  0xffff880003a21600 in ?? ()
#3  0xffffffff8100ee63 in HYPERVISOR_sched_op ()
    at
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/xen/hypercall.h:292
#4  xen_safe_halt () at arch/x86/xen/irq.c:104
#5  0xffffffff8100c33e in raw_safe_halt () at
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/paravirt.h:110
#6  xen_idle () at arch/x86/xen/setup.c:193
#7  0xffffffff81011cdd in cpu_idle () at arch/x86/kernel/process_64.c:143
#8  0xffffffff8144b997 in rest_init () at init/main.c:445
#9  0xffffffff81824ddc in start_kernel () at init/main.c:695
#10 0xffffffff818242c1 in x86_64_start_reservations
(real_mode_data=<value optimized out>) at arch/x86/kernel/head64.c:123
#11 0xffffffff81828160 in xen_start_kernel () at
arch/x86/xen/enlighten.c:1300
#12 0xffffffff838f3000 in ?? ()
#13 0xffffffff838f4000 in ?? ()
#14 0xffffffff838f5000 in ?? ()

Any ideas?

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Jul-05 22:07 UTC

head link

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

On 07/05/10 23:28, Joanna Rutkowska wrote:> On 07/05/10 12:38, Joanna Rutkowska wrote:
>> I''m experiencing very reproducible DomU lockups that occur
after I
>> resume the system from an S3 sleep. Strangely this seem to happen only
>> on my Core i5 systems (tested on two different machines), but not on
>> older Core 2 Duo systems.
>>
>> Usually this causes the apps (e.g. Firefox) running in DomUs to become
>> unresponsive, but sometimes I see that some very limited functionality
>> of the app is still available (e.g. I can open/close Tabs in Firefox,
>> but cannot do much anything more). Also, when I log in to the DomU via
>> xm console, I usually can see the login prompt, can enter the username,
>> but then the console hangs.
>>
>> I tried to attach to such a hanged DomU using gdbserver-xen, but when I
>> subsequently try to attach to the server from gdb (via the target
>> 127.0.0.1:9999 command), my gdb segfaults (how funny!).
>>
>> I''m running Xen 3.4.3, and fairly recent pvops0 kernel in
DomU. In Dom0
>> I run 2.6.34-xenlinux kernel (opensuse patches), but I doubt it is
>> relevant in any way.
>>
>> This seems like a scheduling problem, and, because it seems to affect
>> Core i5 processors, but not Core 2 Duos, it might have something to do
>> with Hyperthreading perhaps?
>>
> Ok, finally got the gdbsever working. This is the backtrace I get when
> attaching to a lockedup DomU after resume:
> 
> #0  0xffffffff810093aa in ?? ()
> #1  0xffffffff8168be18 in ?? ()
> #2  0xffff880003a21600 in ?? ()
> #3  0xffffffff8100ee63 in HYPERVISOR_sched_op ()
>     at
>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/xen/hypercall.h:292
> #4  xen_safe_halt () at arch/x86/xen/irq.c:104
> #5  0xffffffff8100c33e in raw_safe_halt () at
>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/paravirt.h:110
> #6  xen_idle () at arch/x86/xen/setup.c:193
> #7  0xffffffff81011cdd in cpu_idle () at arch/x86/kernel/process_64.c:143
> #8  0xffffffff8144b997 in rest_init () at init/main.c:445
> #9  0xffffffff81824ddc in start_kernel () at init/main.c:695
> #10 0xffffffff818242c1 in x86_64_start_reservations
> (real_mode_data=<value optimized out>) at
arch/x86/kernel/head64.c:123
> #11 0xffffffff81828160 in xen_start_kernel () at
> arch/x86/xen/enlighten.c:1300
> #12 0xffffffff838f3000 in ?? ()
> #13 0xffffffff838f4000 in ?? ()
> #14 0xffffffff838f5000 in ?? ()
> 
> Any ideas?
> ... and when I disabled Hyperthreading in BIOS, the problem seems to
gone. Obviously this is not a desired solution...

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jul-05 22:43 UTC

head link

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

On 07/05/2010 03:07 PM, Joanna Rutkowska wrote:> On 07/05/10 23:28, Joanna Rutkowska wrote:
>   
>> On 07/05/10 12:38, Joanna Rutkowska wrote:
>>     
>>> I''m experiencing very reproducible DomU lockups that occur
after I
>>> resume the system from an S3 sleep. Strangely this seem to happen
only
>>> on my Core i5 systems (tested on two different machines), but not
on
>>> older Core 2 Duo systems.
>>>
>>> Usually this causes the apps (e.g. Firefox) running in DomUs to
become
>>> unresponsive, but sometimes I see that some very limited
functionality
>>> of the app is still available (e.g. I can open/close Tabs in
Firefox,
>>> but cannot do much anything more). Also, when I log in to the DomU
via
>>> xm console, I usually can see the login prompt, can enter the
username,
>>> but then the console hangs.
>>>
>>> I tried to attach to such a hanged DomU using gdbserver-xen, but
when I
>>> subsequently try to attach to the server from gdb (via the target
>>> 127.0.0.1:9999 command), my gdb segfaults (how funny!).
>>>
>>> I''m running Xen 3.4.3, and fairly recent pvops0 kernel in
DomU. In Dom0
>>> I run 2.6.34-xenlinux kernel (opensuse patches), but I doubt it is
>>> relevant in any way.
>>>
>>> This seems like a scheduling problem, and, because it seems to
affect
>>> Core i5 processors, but not Core 2 Duos, it might have something to
do
>>> with Hyperthreading perhaps?
>>>
>>>       
>> Ok, finally got the gdbsever working. This is the backtrace I get when
>> attaching to a lockedup DomU after resume:
>>
>> #0  0xffffffff810093aa in ?? ()
>> #1  0xffffffff8168be18 in ?? ()
>> #2  0xffff880003a21600 in ?? ()
>> #3  0xffffffff8100ee63 in HYPERVISOR_sched_op ()
>>     at
>>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/xen/hypercall.h:292
>> #4  xen_safe_halt () at arch/x86/xen/irq.c:104
>> #5  0xffffffff8100c33e in raw_safe_halt () at
>>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/paravirt.h:110
>> #6  xen_idle () at arch/x86/xen/setup.c:193
>> #7  0xffffffff81011cdd in cpu_idle () at
arch/x86/kernel/process_64.c:143
>> #8  0xffffffff8144b997 in rest_init () at init/main.c:445
>> #9  0xffffffff81824ddc in start_kernel () at init/main.c:695
>> #10 0xffffffff818242c1 in x86_64_start_reservations
>> (real_mode_data=<value optimized out>) at
arch/x86/kernel/head64.c:123
>> #11 0xffffffff81828160 in xen_start_kernel () at
>> arch/x86/xen/enlighten.c:1300
>> #12 0xffffffff838f3000 in ?? ()
>> #13 0xffffffff838f4000 in ?? ()
>> #14 0xffffffff838f5000 in ?? ()
>>
>> Any ideas?
>>
>>     
> ... and when I disabled Hyperthreading in BIOS, the problem seems to
> gone. Obviously this is not a desired solution...
>   
HT has historically been very good at flushing out race conditions which
would normally be tricky to hit on SMP systems.  I assume your domain is
single CPU?  Do you know what''s going on it in that it might be waiting
for?  Is it not longer getting timer events or something?  Does the Xen
''q'' debug-key make it do anything?

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Jul-05 22:52 UTC

head link

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

On 07/06/10 00:43, Jeremy Fitzhardinge wrote:> On 07/05/2010 03:07 PM, Joanna Rutkowska wrote:
>> On 07/05/10 23:28, Joanna Rutkowska wrote:
>>   
>>> On 07/05/10 12:38, Joanna Rutkowska wrote:
>>>     
>>>> I''m experiencing very reproducible DomU lockups that
occur after I
>>>> resume the system from an S3 sleep. Strangely this seem to
happen only
>>>> on my Core i5 systems (tested on two different machines), but
not on
>>>> older Core 2 Duo systems.
>>>>
>>>> Usually this causes the apps (e.g. Firefox) running in DomUs to
become
>>>> unresponsive, but sometimes I see that some very limited
functionality
>>>> of the app is still available (e.g. I can open/close Tabs in
Firefox,
>>>> but cannot do much anything more). Also, when I log in to the
DomU via
>>>> xm console, I usually can see the login prompt, can enter the
username,
>>>> but then the console hangs.
>>>>
>>>> I tried to attach to such a hanged DomU using gdbserver-xen,
but when I
>>>> subsequently try to attach to the server from gdb (via the
target
>>>> 127.0.0.1:9999 command), my gdb segfaults (how funny!).
>>>>
>>>> I''m running Xen 3.4.3, and fairly recent pvops0 kernel
in DomU. In Dom0
>>>> I run 2.6.34-xenlinux kernel (opensuse patches), but I doubt it
is
>>>> relevant in any way.
>>>>
>>>> This seems like a scheduling problem, and, because it seems to
affect
>>>> Core i5 processors, but not Core 2 Duos, it might have
something to do
>>>> with Hyperthreading perhaps?
>>>>
>>>>       
>>> Ok, finally got the gdbsever working. This is the backtrace I get
when
>>> attaching to a lockedup DomU after resume:
>>>
>>> #0  0xffffffff810093aa in ?? ()
>>> #1  0xffffffff8168be18 in ?? ()
>>> #2  0xffff880003a21600 in ?? ()
>>> #3  0xffffffff8100ee63 in HYPERVISOR_sched_op ()
>>>     at
>>>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/xen/hypercall.h:292
>>> #4  xen_safe_halt () at arch/x86/xen/irq.c:104
>>> #5  0xffffffff8100c33e in raw_safe_halt () at
>>>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/paravirt.h:110
>>> #6  xen_idle () at arch/x86/xen/setup.c:193
>>> #7  0xffffffff81011cdd in cpu_idle () at
arch/x86/kernel/process_64.c:143
>>> #8  0xffffffff8144b997 in rest_init () at init/main.c:445
>>> #9  0xffffffff81824ddc in start_kernel () at init/main.c:695
>>> #10 0xffffffff818242c1 in x86_64_start_reservations
>>> (real_mode_data=<value optimized out>) at
arch/x86/kernel/head64.c:123
>>> #11 0xffffffff81828160 in xen_start_kernel () at
>>> arch/x86/xen/enlighten.c:1300
>>> #12 0xffffffff838f3000 in ?? ()
>>> #13 0xffffffff838f4000 in ?? ()
>>> #14 0xffffffff838f5000 in ?? ()
>>>
>>> Any ideas?
>>>
>>>     
>> ... and when I disabled Hyperthreading in BIOS, the problem seems to
>> gone. Obviously this is not a desired solution...
>>   
> 
> HT has historically been very good at flushing out race conditions which
> would normally be tricky to hit on SMP systems.  I assume your domain is
> single CPU?
Actually no. It used to be indeed, but then I thought it might be the
issue and assigned 2 vcpus to it, but it still they were locking up.
> Do you know what''s going on it in that it might be waiting
> for?
No idea. I might be guessing that it would be different kernel
subsystems each time -- e.g. when I''m lucky and when the apps got only
"partially" locked up, I can e.g. open new tabs in Google Chrome, I
can
see some thumbnails of my popular websites, but without their contents.
This would suggest the networking subsystem is dead, but at the same
time Chrome is apparently communicating fine with the X server in the
DomU (and which in turn talks fine with Dom0 over Xen shared
memory/evtchanl).

I experienced the above behavior also when had only one VCPU er DomU.
>  Is it not longer getting timer events or something?  Does the Xen
> ''q'' debug-key make it do anything?
Ah, that''s some secret option I''ve never heard of... Is in the
gdb when
using with gdbserver-xen?

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jul-05 23:17 UTC

head link

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

On 07/05/2010 03:52 PM, Joanna Rutkowska wrote:> On 07/06/10 00:43, Jeremy Fitzhardinge wrote:
>   
>> On 07/05/2010 03:07 PM, Joanna Rutkowska wrote:
>>     
>>> On 07/05/10 23:28, Joanna Rutkowska wrote:
>>>   
>>>       
>>>> On 07/05/10 12:38, Joanna Rutkowska wrote:
>>>>     
>>>>         
>>>>> I''m experiencing very reproducible DomU lockups
that occur after I
>>>>> resume the system from an S3 sleep. Strangely this seem to
happen only
>>>>> on my Core i5 systems (tested on two different machines),
but not on
>>>>> older Core 2 Duo systems.
>>>>>
>>>>> Usually this causes the apps (e.g. Firefox) running in
DomUs to become
>>>>> unresponsive, but sometimes I see that some very limited
functionality
>>>>> of the app is still available (e.g. I can open/close Tabs
in Firefox,
>>>>> but cannot do much anything more). Also, when I log in to
the DomU via
>>>>> xm console, I usually can see the login prompt, can enter
the username,
>>>>> but then the console hangs.
>>>>>
>>>>> I tried to attach to such a hanged DomU using
gdbserver-xen, but when I
>>>>> subsequently try to attach to the server from gdb (via the
target
>>>>> 127.0.0.1:9999 command), my gdb segfaults (how funny!).
>>>>>
>>>>> I''m running Xen 3.4.3, and fairly recent pvops0
kernel in DomU. In Dom0
>>>>> I run 2.6.34-xenlinux kernel (opensuse patches), but I
doubt it is
>>>>> relevant in any way.
>>>>>
>>>>> This seems like a scheduling problem, and, because it seems
to affect
>>>>> Core i5 processors, but not Core 2 Duos, it might have
something to do
>>>>> with Hyperthreading perhaps?
>>>>>
>>>>>       
>>>>>           
>>>> Ok, finally got the gdbsever working. This is the backtrace I
get when
>>>> attaching to a lockedup DomU after resume:
>>>>
>>>> #0  0xffffffff810093aa in ?? ()
>>>> #1  0xffffffff8168be18 in ?? ()
>>>> #2  0xffff880003a21600 in ?? ()
>>>> #3  0xffffffff8100ee63 in HYPERVISOR_sched_op ()
>>>>     at
>>>>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/xen/hypercall.h:292
>>>> #4  xen_safe_halt () at arch/x86/xen/irq.c:104
>>>> #5  0xffffffff8100c33e in raw_safe_halt () at
>>>>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/paravirt.h:110
>>>> #6  xen_idle () at arch/x86/xen/setup.c:193
>>>> #7  0xffffffff81011cdd in cpu_idle () at
arch/x86/kernel/process_64.c:143
>>>> #8  0xffffffff8144b997 in rest_init () at init/main.c:445
>>>> #9  0xffffffff81824ddc in start_kernel () at init/main.c:695
>>>> #10 0xffffffff818242c1 in x86_64_start_reservations
>>>> (real_mode_data=<value optimized out>) at
arch/x86/kernel/head64.c:123
>>>> #11 0xffffffff81828160 in xen_start_kernel () at
>>>> arch/x86/xen/enlighten.c:1300
>>>> #12 0xffffffff838f3000 in ?? ()
>>>> #13 0xffffffff838f4000 in ?? ()
>>>> #14 0xffffffff838f5000 in ?? ()
>>>>
>>>> Any ideas?
>>>>
>>>>     
>>>>         
>>> ... and when I disabled Hyperthreading in BIOS, the problem seems
to
>>> gone. Obviously this is not a desired solution...
>>>   
>>>       
>> HT has historically been very good at flushing out race conditions
which
>> would normally be tricky to hit on SMP systems.  I assume your domain
is
>> single CPU?
>>     
> Actually no. It used to be indeed, but then I thought it might be the
> issue and assigned 2 vcpus to it, but it still they were locking up.
>   
Does the other cpu have the same backtrace into idle?
>> Do you know what''s going on it in that it might be waiting
>> for?
>>     
> No idea. I might be guessing that it would be different kernel
> subsystems each time -- e.g. when I''m lucky and when the apps got
only
> "partially" locked up, I can e.g. open new tabs in Google Chrome,
I can
> see some thumbnails of my popular websites, but without their contents.
> This would suggest the networking subsystem is dead, but at the same
> time Chrome is apparently communicating fine with the X server in the
> DomU (and which in turn talks fine with Dom0 over Xen shared
> memory/evtchanl).
>
> I experienced the above behavior also when had only one VCPU er DomU.
>   
I''ve seen similar things with just normal domain save/restore, where
the
timer interrupt seems to be failing.  Can you ssh into the domain?  I
found that I couldn''t do an interactive ssh (hung at the prompt), but a
non-interactive command would work, so I could cat /proc/interrupts.

This was on my non-HT i7 box, and it affected both pvops domUs, and
CentOS 5 ones.
>>  Is it not longer getting timer events or something?  Does the Xen
>> ''q'' debug-key make it do anything?
>>     
> Ah, that''s some secret option I''ve never heard of... Is
in the gdb when
> using with gdbserver-xen?
>   
No, on the xen console: type ^A^A^A to switch input to Xen, then press q
(h gets a list of other magic keys).  ^A^A^A switches the console back
to dom0.  You can also trigger it with "xm debug-key q" and look at
"xm
dmesg" to see the results if you can''t get to the Xen console.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Jul-06 08:41 UTC

head link

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

>>> On 06.07.10 at 01:17, Jeremy Fitzhardinge <jeremy@goop.org>
wrote:
> On 07/05/2010 03:52 PM, Joanna Rutkowska wrote:
>> On 07/06/10 00:43, Jeremy Fitzhardinge wrote:
>>> Do you know what''s going on it in that it might be waiting
>>> for?
>>>     
>> No idea. I might be guessing that it would be different kernel
>> subsystems each time -- e.g. when I''m lucky and when the apps
got only
>> "partially" locked up, I can e.g. open new tabs in Google
Chrome, I can
>> see some thumbnails of my popular websites, but without their contents.
>> This would suggest the networking subsystem is dead, but at the same
>> time Chrome is apparently communicating fine with the X server in the
>> DomU (and which in turn talks fine with Dom0 over Xen shared
>> memory/evtchanl).
>>
>> I experienced the above behavior also when had only one VCPU er DomU.
>>   
> 
> I''ve seen similar things with just normal domain save/restore,
where the
> timer interrupt seems to be failing.  Can you ssh into the domain?  I
> found that I couldn''t do an interactive ssh (hung at the prompt),
but a
> non-interactive command would work, so I could cat /proc/interrupts.
Did either of you try disabling the setting of sched_clock_stable in
arch/x86/kernel/cpu/intel.c:early_init_intel()? I found this to be a
requirement in our pv kernels (though in connection with the use of
C-states, not with S3).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Jul-06 08:59 UTC

head link

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

On 07/06/10 10:41, Jan Beulich wrote:>>>> On 06.07.10 at 01:17, Jeremy Fitzhardinge
<jeremy@goop.org> wrote:
>> On 07/05/2010 03:52 PM, Joanna Rutkowska wrote:
>>> On 07/06/10 00:43, Jeremy Fitzhardinge wrote:
>>>> Do you know what''s going on it in that it might be
waiting
>>>> for?
>>>>     
>>> No idea. I might be guessing that it would be different kernel
>>> subsystems each time -- e.g. when I''m lucky and when the
apps got only
>>> "partially" locked up, I can e.g. open new tabs in Google
Chrome, I can
>>> see some thumbnails of my popular websites, but without their
contents.
>>> This would suggest the networking subsystem is dead, but at the
same
>>> time Chrome is apparently communicating fine with the X server in
the
>>> DomU (and which in turn talks fine with Dom0 over Xen shared
>>> memory/evtchanl).
>>>
>>> I experienced the above behavior also when had only one VCPU er
DomU.
>>>   
>>
>> I''ve seen similar things with just normal domain save/restore,
where the
>> timer interrupt seems to be failing.  Can you ssh into the domain?  I
>> found that I couldn''t do an interactive ssh (hung at the
prompt), but a
>> non-interactive command would work, so I could cat /proc/interrupts.
> 
> Did either of you try disabling the setting of sched_clock_stable in
> arch/x86/kernel/cpu/intel.c:early_init_intel()? I found this to be a
> requirement in our pv kernels (though in connection with the use of
> C-states, not with S3).
> Before I try it -- can you explain what would be the theory behind it,
specifically how this would be related to HT? Clearly it is a HT
problem, and intuitively, I would expect this to be a Xen-side problem,
rather than DomU-side?

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Jul-06 09:57 UTC

head link

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

>>> On 06.07.10 at 10:59, Joanna Rutkowska
<joanna@invisiblethingslab.com> wrote:
> On 07/06/10 10:41, Jan Beulich wrote:
>> Did either of you try disabling the setting of sched_clock_stable in
>> arch/x86/kernel/cpu/intel.c:early_init_intel()? I found this to be a
>> requirement in our pv kernels (though in connection with the use of
>> C-states, not with S3).
>> 
> Before I try it -- can you explain what would be the theory behind it,
> specifically how this would be related to HT? Clearly it is a HT
> problem, and intuitively, I would expect this to be a Xen-side problem,
> rather than DomU-side?
The HT connection is only a vague one, as Jeremy also hinted at in
his reply. The issue is that with sched_clock_stable set,
sched_clock_cpu() (and cpu_clock()) gets short cut to sched_clock(),
hence becoming susceptible to eventual non-monotonic behavior of
that function.

In any case, it''s a wild guess only, attributed to the partially-hung
observations you made matching my observations (on new Intel
CPUs only) prior to addressing this issue.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Jul-08 14:04 UTC

head link

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

On 07/06/10 00:07, Joanna Rutkowska wrote:> On 07/05/10 23:28, Joanna Rutkowska wrote:
>> On 07/05/10 12:38, Joanna Rutkowska wrote:
>>> I''m experiencing very reproducible DomU lockups that occur
after I
>>> resume the system from an S3 sleep. Strangely this seem to happen
only
>>> on my Core i5 systems (tested on two different machines), but not
on
>>> older Core 2 Duo systems.
>>>
>>> Usually this causes the apps (e.g. Firefox) running in DomUs to
become
>>> unresponsive, but sometimes I see that some very limited
functionality
>>> of the app is still available (e.g. I can open/close Tabs in
Firefox,
>>> but cannot do much anything more). Also, when I log in to the DomU
via
>>> xm console, I usually can see the login prompt, can enter the
username,
>>> but then the console hangs.
>>>
>>> I tried to attach to such a hanged DomU using gdbserver-xen, but
when I
>>> subsequently try to attach to the server from gdb (via the target
>>> 127.0.0.1:9999 command), my gdb segfaults (how funny!).
>>>
>>> I''m running Xen 3.4.3, and fairly recent pvops0 kernel in
DomU. In Dom0
>>> I run 2.6.34-xenlinux kernel (opensuse patches), but I doubt it is
>>> relevant in any way.
>>>
>>> This seems like a scheduling problem, and, because it seems to
affect
>>> Core i5 processors, but not Core 2 Duos, it might have something to
do
>>> with Hyperthreading perhaps?
>>>
>> Ok, finally got the gdbsever working. This is the backtrace I get when
>> attaching to a lockedup DomU after resume:
>>
>> #0  0xffffffff810093aa in ?? ()
>> #1  0xffffffff8168be18 in ?? ()
>> #2  0xffff880003a21600 in ?? ()
>> #3  0xffffffff8100ee63 in HYPERVISOR_sched_op ()
>>     at
>>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/xen/hypercall.h:292
>> #4  xen_safe_halt () at arch/x86/xen/irq.c:104
>> #5  0xffffffff8100c33e in raw_safe_halt () at
>>
/usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/paravirt.h:110
>> #6  xen_idle () at arch/x86/xen/setup.c:193
>> #7  0xffffffff81011cdd in cpu_idle () at
arch/x86/kernel/process_64.c:143
>> #8  0xffffffff8144b997 in rest_init () at init/main.c:445
>> #9  0xffffffff81824ddc in start_kernel () at init/main.c:695
>> #10 0xffffffff818242c1 in x86_64_start_reservations
>> (real_mode_data=<value optimized out>) at
arch/x86/kernel/head64.c:123
>> #11 0xffffffff81828160 in xen_start_kernel () at
>> arch/x86/xen/enlighten.c:1300
>> #12 0xffffffff838f3000 in ?? ()
>> #13 0xffffffff838f4000 in ?? ()
>> #14 0xffffffff838f5000 in ?? ()
>>
>> Any ideas?
>>
> ... and when I disabled Hyperthreading in BIOS, the problem seems to
> gone. Obviously this is not a desired solution...
> 
I''ve added a simple hook to pm-util, so that it does xm pause for all
the running DomUs just before suspend, and later, just after resume it
does xm unpause for all paused DomUs. The problem seems to be gone now,
after a dozen or more suspend/resumes.

The actual pm-utils script can be seen here:

https://qubes-os.org/gitweb/?p=joanna/core.git;a=blob;f=dom0/pm-utils/02qubes-pause-vms;h=5da1be84a86c2e3a95548e52e4672e988d6779a8;hb=c8ef500588452d39b4b41e9f38066c22c6b832ad

It uses Qubes-specific qvm-run command, but I guess it would be easy to
implement the same functionality in the xm command, e.g.:

xm pause all_running

and

xm pause all_paused

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jul 2010 - DomU lockups after resume from S3 on Core i5 processors

[Xen-devel] DomU lockups after resume from S3 on Core i5 processors

[Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors