thr3ads.net - Xen devel - Dom 0 crash [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Ian Murray

2013-Nov-05 11:58 UTC

Dom 0 crash

Hi,

I have a recurring crash using Xen 4.3.1-RC2 and Ubuntu 12.04 as Dom0
(3.2.0-55-generic). I have software RAID 5 with LVM''s. DomU (also 12.04
Ubuntu 3.2.0-55 kernel) has a dedicated logical volume, which is being backed up
shutting down the DomU, an LVM snapshot being created, restart of DomU and then
the snapshot dd''ed to another logical volume. The snapshot is then
removed and the second LV is dd''ed to gzip and onto DAT tape.

I currently have this running every hour (unless its already running) for
testing purposes. After 6-12 runs of this, the Dom0 kernel crashes with he below
output.

When I preform this booting into the same kernel standalone, the problem does
not occur.

Can anyone please suggest what I am doing wrong or identify if it is bug?

Thanks in advance,

Ian.


[24149.786053] general protection fault: 0000 [#1] SMP 
[24149.786070] CPU 0 
[24149.786073] Modules linked in: dm_snapshot xt_physdev iptable_filter
ip_tables x_tables xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev
xen_evtchn xenfs bridge stp ppdev dm_multipath
snd_hda_codec_realtek nouveau ttm drm_kms_helper drm i2c_algo_bit
edac_core mxm_wmi video k8temp edac_mce_amd serio_raw osst st
snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore
snd_page_alloc i2c_nforce2 parport_pc wmi mac_hid lp parport pata_jmicron raid10
raid456 async_pq async_xor xor async_memcpy async_raid6_recov aic7xxx forcedeth
pata_amd raid6_pq async_tx raid1 raid0 multipath linear [24149.786164]
[24149.786169] Pid: 0, comm: swapper/0 Not tainted 3.2.0-55-generic #85-Ubuntu
To Be Filled By O.E.M. To Be Filled By O.E.M./939N68PV-GLAN [24149.786181] RIP:
e030:[<ffffffff8142655d>]  [<ffffffff8142655d>]
scsi_dispatch_cmd+0x6d/0x2e0
[24149.786197] RSP: e02b:ffff88001fc03c80  EFLAGS: 00010206
[24149.786202] RAX: 0000000020000000 RBX: ffff880018478000 RCX:
ffff8800184c1f38
[24149.786208] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff880018478000
[24149.786215] RBP: ffff88001fc03cb0 R08: 0000000000000001 R09:
0000000000000000
[24149.786221] R10: 0000000000000028 R11: 0000000000000003 R12:
ffff880003c22800
[24149.786227] R13: 0100000000000800 R14: ffff8800184c12b0 R15:
ffff880018478000
[24149.786238] FS:  00007f02a854d7c0(0000) GS:ffff88001fc00000(0000)
knlGS:0000000000000000
[24149.786245] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[24149.786250] CR2: 00007f3c3b232000 CR3: 0000000012417000 CR4:
0000000000000660
[24149.786258] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[24149.786264] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[24149.786271] Process swapper/0 (pid: 0, threadinfo ffffffff81c00000, task
ffffffff81c0d020)
[24149.786278] Stack:
[24149.786281]  ffff88001fc03ca0 ffff880003ead000 ffff880003c22800
ffff880003d2b368
[24149.786291]  ffff8800184c12b0 ffff880018478000 ffff88001fc03d10
ffffffff8142da62
[24149.786301]  ffff880003c22828 0000000000000000 ffff880003ead138
ffff880003ead048
[24149.786311] Call Trace:
[24149.786315]  <IRQ> 
[24149.786323]  [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470
[24149.786333]  [<ffffffff812f1a28>] blk_run_queue+0x38/0x60
[24149.786339]  [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0
[24149.786347]  [<ffffffff8142e822>] scsi_next_command+0x42/0x60
[24149.786354]  [<ffffffff8142ea52>] scsi_io_completion+0x1b2/0x630
[24149.786363]  [<ffffffff816611fe>] ?
_raw_spin_unlock_irqrestore+0x1e/0x30
[24149.786371]  [<ffffffff81424b5c>] scsi_finish_command+0xcc/0x130
[24149.786378]  [<ffffffff8142e7ae>] scsi_softirq_done+0x13e/0x150
[24149.786386]  [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0
[24149.786394]  [<ffffffff8106fa38>] __do_softirq+0xa8/0x210
[24149.786402]  [<ffffffff8166ba6c>] call_softirq+0x1c/0x30
[24149.786410]  [<ffffffff810162f5>] do_softirq+0x65/0xa0
[24149.786416]  [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0
[24149.786428]  [<ffffffff813aecd5>] xen_evtchn_do_upcall+0x35/0x50
[24149.786436]  [<ffffffff8166babe>] xen_do_hypervisor_callback+0x1e/0x30
[24149.786441]  <EOI>
[24149.786449]  [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
[24149.786456]  [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
[24149.786464]  [<ffffffff8100a500>] ? xen_safe_halt+0x10/0x20
[24149.786472]  [<ffffffff8101c913>] ? default_idle+0x53/0x1d0
[24149.786478]  [<ffffffff81013236>] ? cpu_idle+0xd6/0x120
[24149.786485]  [<ffffffff8162792e>] ? rest_init+0x72/0x74
[24149.786494]  [<ffffffff81cfcc06>] ? start_kernel+0x3b5/0x3c2
[24149.786501]  [<ffffffff81cfc388>] ?
x86_64_start_reservations+0x132/0x136
[24149.786509]  [<ffffffff81cffde8>] ? xen_start_kernel+0x48b/0x492
[24149.786514] Code: 00 00 0f b6 90 a1 00 00 00 84 d2 74 1e 80 fa 03 7f 19 48 8b
4f 50 8b 80 84 00 00 00 0f b6 51 01 c1 e0 05 83 e2 1f 09 d0 88 41 01 <41>
8b b5 e0 00 00 00 4d 8b a5 e8 00 00 00 85 f6 74 17 48 8b 05
[24149.786588] RIP  [<ffffffff8142655d>] scsi_dispatch_cmd+0x6d/0x2e0
[24149.786596]  RSP <ffff88001fc03c80>
[24149.786834] ---[ end trace 706afe7abd423bbf ]---
[24149.786840] Kernel panic - not syncing: Fatal exception in interrupt
[24149.786847] Pid: 0, comm: swapper/0 Tainted: G      D    
3.2.0-55-generic #85-Ubuntu
[24149.786853] Call Trace:
[24149.786856]  <IRQ>  [<ffffffff8164869c>] panic+0x91/0x1a4
[24149.786868]  [<ffffffff8166239a>] oops_end+0xea/0xf0
[24149.786875]  [<ffffffff810178b8>] die+0x58/0x90
[24149.786882]  [<ffffffff81661ee2>] do_general_protection+0x162/0x170
[24149.786889]  [<ffffffff81661905>] general_protection+0x25/0x30
[24149.786896]  [<ffffffff8142655d>] ? scsi_dispatch_cmd+0x6d/0x2e0
[24149.786904]  [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470
[24149.786911]  [<ffffffff812f1a28>] blk_run_queue+0x38/0x60
[24149.786918]  [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0
[24149.786925]  [<ffffffff8142e822>] scsi_next_command+0x42/0x60
[24149.786932]  [<ffffffff8142ea52>] scsi_io_completion+0x1b2/0x630
[24149.786939]  [<ffffffff816611fe>] ?
_raw_spin_unlock_irqrestore+0x1e/0x30
[24149.786947]  [<ffffffff81424b5c>] scsi_finish_command+0xcc/0x130
[24149.786954]  [<ffffffff8142e7ae>] scsi_softirq_done+0x13e/0x150
[24149.786962]  [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0
[24149.786968]  [<ffffffff8106fa38>] __do_softirq+0xa8/0x210
[24149.786975]  [<ffffffff8166ba6c>] call_softirq+0x1c/0x30
[24149.786982]  [<ffffffff810162f5>] do_softirq+0x65/0xa0
[24149.786988]  [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0
[24149.786994]  [<ffffffff813aecd5>] xen_evtchn_do_upcall+0x35/0x50
[24149.787002]  [<ffffffff8166babe>] xen_do_hypervisor_callback+0x1e/0x30
[24149.787007]  <EOI>  [<ffffffff810013aa>] ?
hypercall_page+0x3aa/0x1000 [24149.787019]  [<ffffffff810013aa>] ?
hypercall_page+0x3aa/0x1000 [24149.787026]  [<ffffffff8100a500>] ?
xen_safe_halt+0x10/0x20
[24149.787033]  [<ffffffff8101c913>] ? default_idle+0x53/0x1d0
[24149.787039]  [<ffffffff81013236>] ? cpu_idle+0xd6/0x120
[24149.787045]  [<ffffffff8162792e>] ? rest_init+0x72/0x74
[24149.787052]  [<ffffffff81cfcc06>] ? start_kernel+0x3b5/0x3c2
[24149.787059]  [<ffffffff81cfc388>] ?
x86_64_start_reservations+0x132/0x136
[24149.787067]  [<ffffffff81cffde8>] ? xen_start_kernel+0x48b/0x492 (XEN)
Domain 0 crashed: rebooting machine in 5 seconds.

Jan Beulich

2013-Nov-05 13:16 UTC

head link

Re: Dom 0 crash

>>> On 05.11.13 at 12:58, Ian Murray <murrayie@yahoo.co.uk>
wrote:
> I have a recurring crash using Xen 4.3.1-RC2 and Ubuntu 12.04 as Dom0 
> (3.2.0-55-generic). I have software RAID 5 with LVM''s. DomU (also
12.04
> Ubuntu 3.2.0-55 kernel) has a dedicated logical volume, which is being
backed
> up shutting down the DomU, an LVM snapshot being created, restart of DomU
and
> then the snapshot dd''ed to another logical volume. The snapshot is
then
> removed and the second LV is dd''ed to gzip and onto DAT tape.
> 
> I currently have this running every hour (unless its already running) for 
> testing purposes. After 6-12 runs of this, the Dom0 kernel crashes with he 
> below output.
> 
> When I preform this booting into the same kernel standalone, the problem 
> does not occur.
Likely because the action that triggers this doesn''t get performed
in that case?
> Can anyone please suggest what I am doing wrong or identify if it is bug?
Considering that exception address ...
> RIP: e030:[<ffffffff8142655d>]  [<ffffffff8142655d>]
scsi_dispatch_cmd+0x6d/0x2e0
... and call stack ...
> [24149.786311] Call Trace:
> [24149.786315]  <IRQ> 
> [24149.786323]  [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470
> [24149.786333]  [<ffffffff812f1a28>] blk_run_queue+0x38/0x60
> [24149.786339]  [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0
> [24149.786347]  [<ffffffff8142e822>] scsi_next_command+0x42/0x60
> [24149.786354]  [<ffffffff8142ea52>] scsi_io_completion+0x1b2/0x630 
> [24149.786363]  [<ffffffff816611fe>] ?
_raw_spin_unlock_irqrestore+0x1e/0x30
> [24149.786371]  [<ffffffff81424b5c>] scsi_finish_command+0xcc/0x130 
> [24149.786378]  [<ffffffff8142e7ae>] scsi_softirq_done+0x13e/0x150 
> [24149.786386]  [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0
> [24149.786394]  [<ffffffff8106fa38>] __do_softirq+0xa8/0x210
> [24149.786402]  [<ffffffff8166ba6c>] call_softirq+0x1c/0x30
> [24149.786410]  [<ffffffff810162f5>] do_softirq+0x65/0xa0
> [24149.786416]  [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0
> [24149.786428]  [<ffffffff813aecd5>] xen_evtchn_do_upcall+0x35/0x50 
> [24149.786436]  [<ffffffff8166babe>]
xen_do_hypervisor_callback+0x1e/0x30
> [24149.786441]  <EOI> 
> [24149.786449]  [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 
> [24149.786456]  [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 
> [24149.786464]  [<ffffffff8100a500>] ? xen_safe_halt+0x10/0x20
> [24149.786472]  [<ffffffff8101c913>] ? default_idle+0x53/0x1d0
> [24149.786478]  [<ffffffff81013236>] ? cpu_idle+0xd6/0x120
... point into the SCSI subsystem, this is likely the wrong list to
ask for help on.

Jan

Ian Murray

2013-Nov-05 22:29 UTC

head link

Re: Dom 0 crash

On 05/11/13 13:16, Jan Beulich wrote:>>>> On 05.11.13 at 12:58, Ian Murray <murrayie@yahoo.co.uk>
wrote:
>> I have a recurring crash using Xen 4.3.1-RC2 and Ubuntu 12.04 as Dom0
>> (3.2.0-55-generic). I have software RAID 5 with LVM''s. DomU
(also 12.04
>> Ubuntu 3.2.0-55 kernel) has a dedicated logical volume, which is being
backed
>> up shutting down the DomU, an LVM snapshot being created, restart of
DomU and
>> then the snapshot dd''ed to another logical volume. The
snapshot is then
>> removed and the second LV is dd''ed to gzip and onto DAT tape.
>>
>> I currently have this running every hour (unless its already running)
for
>> testing purposes. After 6-12 runs of this, the Dom0 kernel crashes with
he
>> below output.
>>
>> When I preform this booting into the same kernel standalone, the
problem
>> does not occur.
> Likely because the action that triggers this doesn''t get performed
> in that case?Thanks for the response.

I am obviously comparing apples and oranges, but I have tried to be as 
similar as possible in as much as I have limited kernel memory to 512M 
as I do with Dom0 and have used a background task writing /dev/urandom 
to the LV that the domU would normally be using. The only difference is 
that it isn''t running under Xen and I don''t have a domU
running in the
background. I will repeat the exercise with no domU running, but under Xen.

>> Can anyone please suggest what I am doing wrong or identify if it is
bug?
> Considering that exception address ...
>
>> RIP: e030:[<ffffffff8142655d>]  [<ffffffff8142655d>]
scsi_dispatch_cmd+0x6d/0x2e0
> ... and call stack ...
>
>> [24149.786311] Call Trace:
>> [24149.786315]  <IRQ>
>> [24149.786323]  [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470
>> [24149.786333]  [<ffffffff812f1a28>] blk_run_queue+0x38/0x60
>> [24149.786339]  [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0
>> [24149.786347]  [<ffffffff8142e822>] scsi_next_command+0x42/0x60
>> [24149.786354]  [<ffffffff8142ea52>]
scsi_io_completion+0x1b2/0x630
>> [24149.786363]  [<ffffffff816611fe>] ?
_raw_spin_unlock_irqrestore+0x1e/0x30
>> [24149.786371]  [<ffffffff81424b5c>]
scsi_finish_command+0xcc/0x130
>> [24149.786378]  [<ffffffff8142e7ae>]
scsi_softirq_done+0x13e/0x150
>> [24149.786386]  [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0
>> [24149.786394]  [<ffffffff8106fa38>] __do_softirq+0xa8/0x210
>> [24149.786402]  [<ffffffff8166ba6c>] call_softirq+0x1c/0x30
>> [24149.786410]  [<ffffffff810162f5>] do_softirq+0x65/0xa0
>> [24149.786416]  [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0
>> [24149.786428]  [<ffffffff813aecd5>]
xen_evtchn_do_upcall+0x35/0x50
>> [24149.786436]  [<ffffffff8166babe>]
xen_do_hypervisor_callback+0x1e/0x30
>> [24149.786441]  <EOI>
>> [24149.786449]  [<ffffffff810013aa>] ?
hypercall_page+0x3aa/0x1000
>> [24149.786456]  [<ffffffff810013aa>] ?
hypercall_page+0x3aa/0x1000
>> [24149.786464]  [<ffffffff8100a500>] ? xen_safe_halt+0x10/0x20
>> [24149.786472]  [<ffffffff8101c913>] ? default_idle+0x53/0x1d0
>> [24149.786478]  [<ffffffff81013236>] ? cpu_idle+0xd6/0x120
> ... point into the SCSI subsystem, this is likely the wrong list to
> ask for help on.... but the right list to confirm that I am on the wrong list? :)

Seriously, the specific evidence may suggest it''s a non-Xen issue/bug, 
but Xen is the only measurable/visible difference so far.  I referred it 
to this list because here the demarcation between hypervisor, PVOPS and 
regular kernel code interaction is likely best understood.

Thanks again for your response.
>
> Jan
>

Konrad Rzeszutek Wilk

2013-Nov-06 17:43 UTC

head link

Re: Dom 0 crash

On Tue, Nov 05, 2013 at 10:29:49PM +0000, Ian Murray
wrote:> On 05/11/13 13:16, Jan Beulich wrote:
> >>>>On 05.11.13 at 12:58, Ian Murray
<murrayie@yahoo.co.uk> wrote:
> >>I have a recurring crash using Xen 4.3.1-RC2 and Ubuntu 12.04 as
Dom0
> >>(3.2.0-55-generic). I have software RAID 5 with LVM''s.
DomU (also 12.04
> >>Ubuntu 3.2.0-55 kernel) has a dedicated logical volume, which is
being backed
> >>up shutting down the DomU, an LVM snapshot being created, restart
of DomU and
> >>then the snapshot dd''ed to another logical volume. The
snapshot is then
> >>removed and the second LV is dd''ed to gzip and onto DAT
tape.
> >>
> >>I currently have this running every hour (unless its already
running) for
> >>testing purposes. After 6-12 runs of this, the Dom0 kernel crashes
with he
> >>below output.
> >>
> >>When I preform this booting into the same kernel standalone, the
problem
> >>does not occur.
> >Likely because the action that triggers this doesn''t get
performed
> >in that case?
> Thanks for the response.
> 
> I am obviously comparing apples and oranges, but I have tried to be
> as similar as possible in as much as I have limited kernel memory to
> 512M as I do with Dom0 and have used a background task writing
> /dev/urandom to the LV that the domU would normally be using. The
> only difference is that it isn''t running under Xen and I
don''t have
> a domU running in the background. I will repeat the exercise with no
> domU running, but under Xen.
> 
> 
> >>Can anyone please suggest what I am doing wrong or identify if it
is bug?
> >Considering that exception address ...
> >
> >>RIP: e030:[<ffffffff8142655d>]  [<ffffffff8142655d>]
scsi_dispatch_cmd+0x6d/0x2e0
> >... and call stack ...
> >
> >>[24149.786311] Call Trace:
> >>[24149.786315]  <IRQ>
> >>[24149.786323]  [<ffffffff8142da62>]
scsi_request_fn+0x3a2/0x470
> >>[24149.786333]  [<ffffffff812f1a28>] blk_run_queue+0x38/0x60
> >>[24149.786339]  [<ffffffff8142c416>]
scsi_run_queue+0xd6/0x1b0
> >>[24149.786347]  [<ffffffff8142e822>]
scsi_next_command+0x42/0x60
> >>[24149.786354]  [<ffffffff8142ea52>]
scsi_io_completion+0x1b2/0x630
> >>[24149.786363]  [<ffffffff816611fe>] ?
_raw_spin_unlock_irqrestore+0x1e/0x30
> >>[24149.786371]  [<ffffffff81424b5c>]
scsi_finish_command+0xcc/0x130
> >>[24149.786378]  [<ffffffff8142e7ae>]
scsi_softirq_done+0x13e/0x150
> >>[24149.786386]  [<ffffffff812fb6b3>]
blk_done_softirq+0x83/0xa0
> >>[24149.786394]  [<ffffffff8106fa38>] __do_softirq+0xa8/0x210
> >>[24149.786402]  [<ffffffff8166ba6c>] call_softirq+0x1c/0x30
> >>[24149.786410]  [<ffffffff810162f5>] do_softirq+0x65/0xa0
> >>[24149.786416]  [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0
> >>[24149.786428]  [<ffffffff813aecd5>]
xen_evtchn_do_upcall+0x35/0x50
> >>[24149.786436]  [<ffffffff8166babe>]
xen_do_hypervisor_callback+0x1e/0x30
> >>[24149.786441]  <EOI>
> >>[24149.786449]  [<ffffffff810013aa>] ?
hypercall_page+0x3aa/0x1000
> >>[24149.786456]  [<ffffffff810013aa>] ?
hypercall_page+0x3aa/0x1000
> >>[24149.786464]  [<ffffffff8100a500>] ?
xen_safe_halt+0x10/0x20
> >>[24149.786472]  [<ffffffff8101c913>] ?
default_idle+0x53/0x1d0
> >>[24149.786478]  [<ffffffff81013236>] ? cpu_idle+0xd6/0x120
> >... point into the SCSI subsystem, this is likely the wrong list to
> >ask for help on.
> ... but the right list to confirm that I am on the wrong list? :)
:-)> 
> Seriously, the specific evidence may suggest it''s a non-Xen
> issue/bug, but Xen is the only measurable/visible difference so far.
> I referred it to this list because here the demarcation between
> hypervisor, PVOPS and regular kernel code interaction is likely best
> understood.
But you wouldn''t do the same workload under baremetal thought?

Here is a thought.  If you just do: "LV is dd''ed to gzip and onto
DAT tape."
for 15 times under baremetal do you see the same issue?

And is there something particular about this DAT? Is it just a generic
/dev/st device? 

Lastly, complete shot in the dark - try increasing the swiotlb size.
Do ''swiotlb=65543'' on the Linux command line when booting
under Xen.
> 
> Thanks again for your response.
> 
> >
> >Jan
> >
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Xen devel - Nov 2013 - Dom 0 crash

Dom 0 crash

Re: Dom 0 crash

Re: Dom 0 crash

Re: Dom 0 crash