thr3ads.net - Xen devel - [Xen-devel] DomU crash during migration when suspending source domain [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Graham, Simon

2007-Feb-14 03:42 UTC

[Xen-devel] DomU crash during migration when suspending source domain

Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with
3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual panic
is attached at the end of this, but the bottom line is that the code in
cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
attempting to clean up the cache info for a processor that does not yet have
this info setup - the code is dereferencing a pointer in the cpuid4_info[] array
and looking at the dump I can see that this is NULL.

My working theory here is that we attempted the migration waaay early and the
initialization of the array of cache info pointers was not setup for all
processors yet; it would be relatively easy to protect against this by checking
for NULL, but I''m not sure if this is the correct solution or not -- if
anyone is familiar with this code and can comment on an appropriate fix
I''d be grateful.

One thing I''m really not sure about is the timing of marking the CPUs
up with respect to the trace re initializing CPUs (see console output below) --
I can see that the four VCPUs are setup in the cpu_sys_devices array (which is
setup by the code that outputs the ''Initializing CPU#n'' trace)
but the array of cache info structures only has an entry for VCPU 0:

crash> cpu_sys_devices
cpu_sys_devices = $3  {0xc0464448, 0xc046448c, 0xc04644d0, 0xc0464514, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0,
0x0, 0x0}

crash> cpuid4_info
cpuid4_info = $4  {0xc7971180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}

Any suggestions for appropriate fixes here?
Simon

--- console output ---

Enabling SMP...
Initializing CPU#3
Initializing CPU#2
Initializing CPU#1
eth0: no IPv6 routers present
Unable to handle kernel NULL pointer dereference at virtual address 00000010
 printing eip:
c010dd3a
0204a000 -> *pde = 00000001:0d8ec001
06a9c000 -> *pme = 00000000:00000000
Oops: 0000 [#1]
SMP 
Modules linked in: ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core
binfmt_misc dm_mirror dm_mod bnx2 ata_piix libata mptscsih mptfc mptspi mptsas
mptscsi scsi_mod mptbase
CPU:    0
EIP:    0061:[<c010dd3a>]    Tainted: GF    VLI
EFLAGS: 00010202  (2.6.16.29-xen #1) 
EIP is at cache_remove_shared_cpu_map+0x1a/0x90
eax: 00000000  ebx: 00000001  ecx: 00000001  edx: 00000000
esi: 00000000  edi: 00000010  ebp: c3913f14  esp: c3913f08
ds: 007b  es: 007b  ss: 0069
Process suspend (pid: 4038, threadinfo=c3912000 task=c2244570)
Stack: <0>00000001 00000001 00000000 c3913f28 c010e3ba 00000007 00000001
00000007
      c3913f34 c010e425 c03bd804 c3913f48 c012fae8 ffffffea 00000001 c568c570 
      c3913f7c c013b889 c3913fc0 00000002 00000001 00000000 00000003 00000000 
Call Trace:
 [<c0105401>] show_stack_log_lvl+0xa1/0xe0
 [<c01055f1>] show_registers+0x181/0x200
 [<c0105810>] die+0x100/0x1a0
 [<c01156f6>] do_page_fault+0x3c6/0x8b1
 [<c0105067>] error_code+0x2b/0x30
 [<c010e3ba>] cache_remove_dev+0x2a/0x60
 [<c010e425>] cacheinfo_cpu_callback+0x35/0x40
 [<c012fae8>] notifier_call_chain+0x18/0x40
 [<c013b889>] cpu_down+0x139/0x260
 [<c028bc9f>] smp_suspend+0x7f/0x100
 [<c028ca80>] __do_suspend+0x40/0x180
 [<c0136a06>] kthread+0x96/0xe0
 [<c0102e95>] kernel_thread_helper+0x5/0x10
Code: 0c 5b 5e 5f 5d c3 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 57 56 89 d6 53
89 c3 8d 04 92 8b 14 9d 20 4d 46 c0 8d 04 82 8d 78 10 <8b> 40 10 ba 20 00
00 00 85 c0 74 03 0f bc d0 83 fa 21 b9 20 00

-and-

crash> bt
PID: 4038   TASK: c2244570  CPU: 0   COMMAND: "suspend"
 #0 [c3913ddc] xen_panic_event at c010a527
 #1 [c3913df8] notifier_call_chain at c012fae6
 #2 [c3913e0c] panic at c0120b16
 #3 [c3913e20] die at c0105866
 #4 [c3913e6c] do_page_fault at c01156f1
 #5 [c3913ed0] error_code (via page_fault) at c0105065
    EAX: 00000000  EBX: 00000001  ECX: 00000001  EDX: 00000000  EBP: c3913f14
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 00000010
    CS:  0061      EIP: c010dd3a  ERR: ffffffff  EFLAGS: 00010202
 #6 [c3913f04] cache_remove_shared_cpu_map at c010dd3a
 #7 [c3913f18] cache_remove_dev at c010e3b5
 #8 [c3913f2c] cacheinfo_cpu_callback at c010e420
 #9 [c3913f38] notifier_call_chain at c012fae6
#10 [c3913f4c] cpu_down at c013b884
#11 [c3913f80] smp_suspend at c028bc9a
#12 [c3913f98] __do_suspend at c028ca7b
#13 [c3913fc4] kthread at c0136a03
#14 [c3913fe8] kernel_thread_helper at c0102e93
crash>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-14 10:13 UTC

head link

Re: [Xen-devel] DomU crash during migration when suspending source domain

Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
hotplug issue in Linux, but we did so lots of testing of that...

 -- Keir

On 14/2/07 03:42, "Graham, Simon" <Simon.Graham@stratus.com>
wrote:
> Just run into an odd DomU crash doing live migration of a 4-VCPU domain
(with
> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual
panic
> is attached at the end of this, but the bottom line is that the code in
> cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
> attempting to clean up the cache info for a processor that does not yet
have
> this info setup - the code is dereferencing a pointer in the cpuid4_info[]
> array and looking at the dump I can see that this is NULL.
> 
> My working theory here is that we attempted the migration waaay early and
the
> initialization of the array of cache info pointers was not setup for all
> processors yet; it would be relatively easy to protect against this by
> checking for NULL, but I''m not sure if this is the correct
solution or not --
> if anyone is familiar with this code and can comment on an appropriate fix
I''d
> be grateful.
> 
> One thing I''m really not sure about is the timing of marking the
CPUs up with
> respect to the trace re initializing CPUs (see console output below) -- I
can
> see that the four VCPUs are setup in the cpu_sys_devices array (which is
setup
> by the code that outputs the ''Initializing CPU#n'' trace)
but the array of
> cache info structures only has an entry for VCPU 0:


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-14 10:36 UTC

head link

Re: [Xen-devel] DomU crash during migration when suspending source domain

Your theory that the cpu_down() is happening too early sounds plausible
except that cpu_up/cpu_down are both entirely protected by the hotplug lock.
See their definitions in kernel/cpu.c.

The notifier calls of interest are CPU_ONLINE and CPU_DEAD. These are the
events that the cacheinfo code cares about. You can see that both
notifications are broadcast under the cpu_hotplug_lock, so there should be
no race possible in which a CPU starts to be taken down before all
notification work associated with it coming online has completed.

 -- Keir

On 14/2/07 10:13, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk>
wrote:
> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
> hotplug issue in Linux, but we did so lots of testing of that...
> 
>  -- Keir
> 
> On 14/2/07 03:42, "Graham, Simon"
<Simon.Graham@stratus.com> wrote:
> 
>> Just run into an odd DomU crash doing live migration of a 4-VCPU domain
(with
>> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the
actual
>> panic
>> is attached at the end of this, but the bottom line is that the code in
>> cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c)
is
>> attempting to clean up the cache info for a processor that does not yet
have
>> this info setup - the code is dereferencing a pointer in the
cpuid4_info[]
>> array and looking at the dump I can see that this is NULL.
>> 
>> My working theory here is that we attempted the migration waaay early
and the
>> initialization of the array of cache info pointers was not setup for
all
>> processors yet; it would be relatively easy to protect against this by
>> checking for NULL, but I''m not sure if this is the correct
solution or not --
>> if anyone is familiar with this code and can comment on an appropriate
fix
>> I''d
>> be grateful.
>> 
>> One thing I''m really not sure about is the timing of marking
the CPUs up with
>> respect to the trace re initializing CPUs (see console output below) --
I can
>> see that the four VCPUs are setup in the cpu_sys_devices array (which
is
>> setup
>> by the code that outputs the ''Initializing CPU#n''
trace) but the array of
>> cache info structures only has an entry for VCPU 0:
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-14 10:48 UTC

head link

Re: [Xen-devel] DomU crash during migration when suspending source domain

Are you migrating between unlike boxes? My guess is that the original box
has processors supporting cacheinfo cpuid leaves and the target box does
not. Migrating to older less-capable CPUs is definitely hit-and-miss
I''m
afraid. It really is best not to do it!

 -- Keir

On 14/2/07 10:36, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk>
wrote:
> Your theory that the cpu_down() is happening too early sounds plausible
> except that cpu_up/cpu_down are both entirely protected by the hotplug
lock.
> See their definitions in kernel/cpu.c.
> 
> The notifier calls of interest are CPU_ONLINE and CPU_DEAD. These are the
> events that the cacheinfo code cares about. You can see that both
> notifications are broadcast under the cpu_hotplug_lock, so there should be
> no race possible in which a CPU starts to be taken down before all
> notification work associated with it coming online has completed.
> 
>  -- Keir
> 
> On 14/2/07 10:13, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk>
wrote:
> 
>> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
>> hotplug issue in Linux, but we did so lots of testing of that...
>> 
>>  -- Keir
>> 
>> On 14/2/07 03:42, "Graham, Simon"
<Simon.Graham@stratus.com> wrote:
>> 
>>> Just run into an odd DomU crash doing live migration of a 4-VCPU
domain
>>> (with
>>> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the
actual
>>> panic
>>> is attached at the end of this, but the bottom line is that the
code in
>>> cache_remove_shared_cpu_map (in
arch/i385/kernel/cpu/intel_cacheinfo.c) is
>>> attempting to clean up the cache info for a processor that does not
yet have
>>> this info setup - the code is dereferencing a pointer in the
cpuid4_info[]
>>> array and looking at the dump I can see that this is NULL.
>>> 
>>> My working theory here is that we attempted the migration waaay
early and
>>> the
>>> initialization of the array of cache info pointers was not setup
for all
>>> processors yet; it would be relatively easy to protect against this
by
>>> checking for NULL, but I''m not sure if this is the correct
solution or not
>>> --
>>> if anyone is familiar with this code and can comment on an
appropriate fix
>>> I''d
>>> be grateful.
>>> 
>>> One thing I''m really not sure about is the timing of
marking the CPUs up
>>> with
>>> respect to the trace re initializing CPUs (see console output
below) -- I
>>> can
>>> see that the four VCPUs are setup in the cpu_sys_devices array
(which is
>>> setup
>>> by the code that outputs the ''Initializing CPU#n''
trace) but the array of
>>> cache info structures only has an entry for VCPU 0:
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Graham, Simon

2007-Feb-14 13:02 UTC

head link

RE: [Xen-devel] DomU crash during migration when suspending source domain

> 
> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a
CPU> hotplug issue in Linux, but we did so lots of testing of that...
> 
>  -- Keir
>
Yes -- in fact, we''re using the 3.0.4 kernel for both Dom0 and DomU.
I''ve some more info from failures last night:

1. I was wrong; it isn''t a way early migration; it''s a 2nd
migration -
it seems that
   the first migration leaves the system in a bad way
2. It is only happening on specific machines - interestingly enough, in
every case, we
   migrate between dual core Xeon and quad core E5310 systems; these
systems have different
   cache hierarchies I think; maybe the CPU hotplug code cant handle a
change in the
   processor type?

Simon
 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Graham, Simon

2007-Feb-15 02:47 UTC

head link

RE: [Xen-devel] DomU crash during migration when suspending source domain

> A broader set of fixes might get accepted upstream because
> cache_add_dev()
> can fail for other reasons too (at least out-of-memory) and any such
> failure
> will cause cache_remove_dev() to barf. But it''s not such a simple
thing> to
> fix and it does not solve the general problem for us.
OK -- after some work today, I have a patch that would perhaps be
acceptable upstream -- I started off looking into ways of ensuring the
information on the cache got reset properly after migration but quickly
decided that this was not going to be at all easy (plus it wouldn''t
address the larger issues of processor compatibility), so instead I
concentrated on handling all the cases where pointers might be NULL
because of an allocation failure (which also happens to fix the crash we
see here).

Attached is a patch to intel_cacheinfo-xen.c in unstable -- I basically
found everywhere that you might have a NULL due to allocation failure
and tested for it; I''ve run this a little and confirmed that it solves
the original crash migrating back and forth between the systems with
different processor families. Will run more extensive regressions on it
tonight and tomorrow, but I thought I''d send it now for review; will
resend with signed-off-by and comment if you think it is OK.

Simon



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-15 08:12 UTC

head link

Re: [Xen-devel] DomU crash during migration when suspending source domain

On 15/2/07 02:47, "Graham, Simon" <Simon.Graham@stratus.com>
wrote:
> Attached is a patch to intel_cacheinfo-xen.c in unstable -- I basically
> found everywhere that you might have a NULL due to allocation failure
> and tested for it; I''ve run this a little and confirmed that it
solves
> the original crash migrating back and forth between the systems with
> different processor families. Will run more extensive regressions on it
> tonight and tomorrow, but I thought I''d send it now for review;
will
> resend with signed-off-by and comment if you think it is OK.
The question is whether it has a chance of being accepted upstream. Since
it''s rather bigger than the one-liner we originally hoped for, we
don''t want
to carry it if it has no chance of acceptance.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Graham, Simon

2007-Feb-15 12:45 UTC

head link

RE: [Xen-devel] DomU crash during migration when suspending source domain

> The question is whether it has a chance of being accepted upstream.
> Since
> it''s rather bigger than the one-liner we originally hoped for, we
don''t> want
> to carry it if it has no chance of acceptance.
> 
You have a better idea of that than I do - however, it seemed to me that
a complete fix for the cases where allocation failed would be more
likely to be accepted than something that only fixes one place.

Another thing to consider -- hitting allocation failures is MUCH more
likely in the Xen environment when live migration is in use; how often
is a processor taken down and then restarted outside of live migration?
However, since live migration can occur when the domain has been up for
a long time, it''s much more likely for the allocation to fail (although
presumably still somewhat low)...

Excuse my ignorance but what''s the process for pushing something like
this upstream?
/simgr

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Feb-15 13:22 UTC

head link

Re: [Xen-devel] DomU crash during migration when suspending source domain

On 15/2/07 12:45, "Graham, Simon" <Simon.Graham@stratus.com>
wrote:
> Excuse my ignorance but what''s the process for pushing something
like
> this upstream?
It can be a bit of a pain: you need to port to at least the latest
git-snapshot patch from kernel.org. Quite often you''re lucky and the
surrounding code hasn''t changed that much. Then you post to the
linux-kernel
mailing list, perhaps cc''ing some likely interested parties (a couple
of the
email addresses from the top of the file you''ve modified would make
sense).
Then sit back and hope for some responses, or the patch to be picked up.
Make sure your email subject starts with [PATCH], or you''ll likely be
ignored.

 -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2007-Feb-15 19:52 UTC

head link

Re: [Xen-devel] DomU crash during migration when suspending source domain

Graham, Simon wrote:> You have a better idea of that than I do - however, it seemed to me that
> a complete fix for the cases where allocation failed would be more
> likely to be accepted than something that only fixes one place.
>
> Another thing to consider -- hitting allocation failures is MUCH more
> likely in the Xen environment when live migration is in use; how often
> is a processor taken down and then restarted outside of live migration?
> However, since live migration can occur when the domain has been up for
> a long time, it''s much more likely for the allocation to fail
(although
> presumably still somewhat low)...
>
> Excuse my ignorance but what''s the process for pushing something
like
> this upstream?
In addition to all the things Keir mentioned, the other important thing
is to make sure that your comments at the top of the patch explain 1)
what the problem you''re trying to solve is, and 2) how this patch
solves
those problems.  And a good concise patch subject line helps as well. 
See http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt for all the
details.

It seems to me that this patch should be pretty easy to get accepted,
since it is local, a pretty obvious bug, and not Xen-specific.  But it
may need some stylistic attention: for one, the kernel style tries to
avoid nesting things in deep if() statements, and an early return or
goto to a comment exit point might be more appropriate.

Andi Kleen <ak@muc.de> is probably the best person to send it to (as
well as cc:ing it linux-kernel).

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Possibly Parallel Threads

Search for more reasonably related threads

Xen devel - Feb 2007 - DomU crash during migration when suspending source domain

[Xen-devel] DomU crash during migration when suspending source domain

Re: [Xen-devel] DomU crash during migration when suspending source domain

Re: [Xen-devel] DomU crash during migration when suspending source domain

Re: [Xen-devel] DomU crash during migration when suspending source domain

RE: [Xen-devel] DomU crash during migration when suspending source domain

RE: [Xen-devel] DomU crash during migration when suspending source domain

Re: [Xen-devel] DomU crash during migration when suspending source domain

RE: [Xen-devel] DomU crash during migration when suspending source domain

Re: [Xen-devel] DomU crash during migration when suspending source domain

Re: [Xen-devel] DomU crash during migration when suspending source domain

Possibly Parallel Threads