Hi, I have a recurring crash using Xen 4.3.1-RC2 and Ubuntu 12.04 as Dom0 (3.2.0-55-generic). I have software RAID 5 with LVM''s. DomU (also 12.04 Ubuntu 3.2.0-55 kernel) has a dedicated logical volume, which is being backed up shutting down the DomU, an LVM snapshot being created, restart of DomU and then the snapshot dd''ed to another logical volume. The snapshot is then removed and the second LV is dd''ed to gzip and onto DAT tape. I currently have this running every hour (unless its already running) for testing purposes. After 6-12 runs of this, the Dom0 kernel crashes with he below output. When I preform this booting into the same kernel standalone, the problem does not occur. Can anyone please suggest what I am doing wrong or identify if it is bug? Thanks in advance, Ian. [24149.786053] general protection fault: 0000 [#1] SMP [24149.786070] CPU 0 [24149.786073] Modules linked in: dm_snapshot xt_physdev iptable_filter ip_tables x_tables xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs bridge stp ppdev dm_multipath snd_hda_codec_realtek nouveau ttm drm_kms_helper drm i2c_algo_bit edac_core mxm_wmi video k8temp edac_mce_amd serio_raw osst st snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc i2c_nforce2 parport_pc wmi mac_hid lp parport pata_jmicron raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov aic7xxx forcedeth pata_amd raid6_pq async_tx raid1 raid0 multipath linear [24149.786164] [24149.786169] Pid: 0, comm: swapper/0 Not tainted 3.2.0-55-generic #85-Ubuntu To Be Filled By O.E.M. To Be Filled By O.E.M./939N68PV-GLAN [24149.786181] RIP: e030:[<ffffffff8142655d>] [<ffffffff8142655d>] scsi_dispatch_cmd+0x6d/0x2e0 [24149.786197] RSP: e02b:ffff88001fc03c80 EFLAGS: 00010206 [24149.786202] RAX: 0000000020000000 RBX: ffff880018478000 RCX: ffff8800184c1f38 [24149.786208] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880018478000 [24149.786215] RBP: ffff88001fc03cb0 R08: 0000000000000001 R09: 0000000000000000 [24149.786221] R10: 0000000000000028 R11: 0000000000000003 R12: ffff880003c22800 [24149.786227] R13: 0100000000000800 R14: ffff8800184c12b0 R15: ffff880018478000 [24149.786238] FS: 00007f02a854d7c0(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 [24149.786245] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [24149.786250] CR2: 00007f3c3b232000 CR3: 0000000012417000 CR4: 0000000000000660 [24149.786258] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [24149.786264] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [24149.786271] Process swapper/0 (pid: 0, threadinfo ffffffff81c00000, task ffffffff81c0d020) [24149.786278] Stack: [24149.786281] ffff88001fc03ca0 ffff880003ead000 ffff880003c22800 ffff880003d2b368 [24149.786291] ffff8800184c12b0 ffff880018478000 ffff88001fc03d10 ffffffff8142da62 [24149.786301] ffff880003c22828 0000000000000000 ffff880003ead138 ffff880003ead048 [24149.786311] Call Trace: [24149.786315] <IRQ> [24149.786323] [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470 [24149.786333] [<ffffffff812f1a28>] blk_run_queue+0x38/0x60 [24149.786339] [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0 [24149.786347] [<ffffffff8142e822>] scsi_next_command+0x42/0x60 [24149.786354] [<ffffffff8142ea52>] scsi_io_completion+0x1b2/0x630 [24149.786363] [<ffffffff816611fe>] ? _raw_spin_unlock_irqrestore+0x1e/0x30 [24149.786371] [<ffffffff81424b5c>] scsi_finish_command+0xcc/0x130 [24149.786378] [<ffffffff8142e7ae>] scsi_softirq_done+0x13e/0x150 [24149.786386] [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0 [24149.786394] [<ffffffff8106fa38>] __do_softirq+0xa8/0x210 [24149.786402] [<ffffffff8166ba6c>] call_softirq+0x1c/0x30 [24149.786410] [<ffffffff810162f5>] do_softirq+0x65/0xa0 [24149.786416] [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0 [24149.786428] [<ffffffff813aecd5>] xen_evtchn_do_upcall+0x35/0x50 [24149.786436] [<ffffffff8166babe>] xen_do_hypervisor_callback+0x1e/0x30 [24149.786441] <EOI> [24149.786449] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 [24149.786456] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 [24149.786464] [<ffffffff8100a500>] ? xen_safe_halt+0x10/0x20 [24149.786472] [<ffffffff8101c913>] ? default_idle+0x53/0x1d0 [24149.786478] [<ffffffff81013236>] ? cpu_idle+0xd6/0x120 [24149.786485] [<ffffffff8162792e>] ? rest_init+0x72/0x74 [24149.786494] [<ffffffff81cfcc06>] ? start_kernel+0x3b5/0x3c2 [24149.786501] [<ffffffff81cfc388>] ? x86_64_start_reservations+0x132/0x136 [24149.786509] [<ffffffff81cffde8>] ? xen_start_kernel+0x48b/0x492 [24149.786514] Code: 00 00 0f b6 90 a1 00 00 00 84 d2 74 1e 80 fa 03 7f 19 48 8b 4f 50 8b 80 84 00 00 00 0f b6 51 01 c1 e0 05 83 e2 1f 09 d0 88 41 01 <41> 8b b5 e0 00 00 00 4d 8b a5 e8 00 00 00 85 f6 74 17 48 8b 05 [24149.786588] RIP [<ffffffff8142655d>] scsi_dispatch_cmd+0x6d/0x2e0 [24149.786596] RSP <ffff88001fc03c80> [24149.786834] ---[ end trace 706afe7abd423bbf ]--- [24149.786840] Kernel panic - not syncing: Fatal exception in interrupt [24149.786847] Pid: 0, comm: swapper/0 Tainted: G D 3.2.0-55-generic #85-Ubuntu [24149.786853] Call Trace: [24149.786856] <IRQ> [<ffffffff8164869c>] panic+0x91/0x1a4 [24149.786868] [<ffffffff8166239a>] oops_end+0xea/0xf0 [24149.786875] [<ffffffff810178b8>] die+0x58/0x90 [24149.786882] [<ffffffff81661ee2>] do_general_protection+0x162/0x170 [24149.786889] [<ffffffff81661905>] general_protection+0x25/0x30 [24149.786896] [<ffffffff8142655d>] ? scsi_dispatch_cmd+0x6d/0x2e0 [24149.786904] [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470 [24149.786911] [<ffffffff812f1a28>] blk_run_queue+0x38/0x60 [24149.786918] [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0 [24149.786925] [<ffffffff8142e822>] scsi_next_command+0x42/0x60 [24149.786932] [<ffffffff8142ea52>] scsi_io_completion+0x1b2/0x630 [24149.786939] [<ffffffff816611fe>] ? _raw_spin_unlock_irqrestore+0x1e/0x30 [24149.786947] [<ffffffff81424b5c>] scsi_finish_command+0xcc/0x130 [24149.786954] [<ffffffff8142e7ae>] scsi_softirq_done+0x13e/0x150 [24149.786962] [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0 [24149.786968] [<ffffffff8106fa38>] __do_softirq+0xa8/0x210 [24149.786975] [<ffffffff8166ba6c>] call_softirq+0x1c/0x30 [24149.786982] [<ffffffff810162f5>] do_softirq+0x65/0xa0 [24149.786988] [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0 [24149.786994] [<ffffffff813aecd5>] xen_evtchn_do_upcall+0x35/0x50 [24149.787002] [<ffffffff8166babe>] xen_do_hypervisor_callback+0x1e/0x30 [24149.787007] <EOI> [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 [24149.787019] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 [24149.787026] [<ffffffff8100a500>] ? xen_safe_halt+0x10/0x20 [24149.787033] [<ffffffff8101c913>] ? default_idle+0x53/0x1d0 [24149.787039] [<ffffffff81013236>] ? cpu_idle+0xd6/0x120 [24149.787045] [<ffffffff8162792e>] ? rest_init+0x72/0x74 [24149.787052] [<ffffffff81cfcc06>] ? start_kernel+0x3b5/0x3c2 [24149.787059] [<ffffffff81cfc388>] ? x86_64_start_reservations+0x132/0x136 [24149.787067] [<ffffffff81cffde8>] ? xen_start_kernel+0x48b/0x492 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
>>> On 05.11.13 at 12:58, Ian Murray <murrayie@yahoo.co.uk> wrote: > I have a recurring crash using Xen 4.3.1-RC2 and Ubuntu 12.04 as Dom0 > (3.2.0-55-generic). I have software RAID 5 with LVM''s. DomU (also 12.04 > Ubuntu 3.2.0-55 kernel) has a dedicated logical volume, which is being backed > up shutting down the DomU, an LVM snapshot being created, restart of DomU and > then the snapshot dd''ed to another logical volume. The snapshot is then > removed and the second LV is dd''ed to gzip and onto DAT tape. > > I currently have this running every hour (unless its already running) for > testing purposes. After 6-12 runs of this, the Dom0 kernel crashes with he > below output. > > When I preform this booting into the same kernel standalone, the problem > does not occur.Likely because the action that triggers this doesn''t get performed in that case?> Can anyone please suggest what I am doing wrong or identify if it is bug?Considering that exception address ...> RIP: e030:[<ffffffff8142655d>] [<ffffffff8142655d>] scsi_dispatch_cmd+0x6d/0x2e0... and call stack ...> [24149.786311] Call Trace: > [24149.786315] <IRQ> > [24149.786323] [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470 > [24149.786333] [<ffffffff812f1a28>] blk_run_queue+0x38/0x60 > [24149.786339] [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0 > [24149.786347] [<ffffffff8142e822>] scsi_next_command+0x42/0x60 > [24149.786354] [<ffffffff8142ea52>] scsi_io_completion+0x1b2/0x630 > [24149.786363] [<ffffffff816611fe>] ? _raw_spin_unlock_irqrestore+0x1e/0x30 > [24149.786371] [<ffffffff81424b5c>] scsi_finish_command+0xcc/0x130 > [24149.786378] [<ffffffff8142e7ae>] scsi_softirq_done+0x13e/0x150 > [24149.786386] [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0 > [24149.786394] [<ffffffff8106fa38>] __do_softirq+0xa8/0x210 > [24149.786402] [<ffffffff8166ba6c>] call_softirq+0x1c/0x30 > [24149.786410] [<ffffffff810162f5>] do_softirq+0x65/0xa0 > [24149.786416] [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0 > [24149.786428] [<ffffffff813aecd5>] xen_evtchn_do_upcall+0x35/0x50 > [24149.786436] [<ffffffff8166babe>] xen_do_hypervisor_callback+0x1e/0x30 > [24149.786441] <EOI> > [24149.786449] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 > [24149.786456] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 > [24149.786464] [<ffffffff8100a500>] ? xen_safe_halt+0x10/0x20 > [24149.786472] [<ffffffff8101c913>] ? default_idle+0x53/0x1d0 > [24149.786478] [<ffffffff81013236>] ? cpu_idle+0xd6/0x120... point into the SCSI subsystem, this is likely the wrong list to ask for help on. Jan
On 05/11/13 13:16, Jan Beulich wrote:>>>> On 05.11.13 at 12:58, Ian Murray <murrayie@yahoo.co.uk> wrote: >> I have a recurring crash using Xen 4.3.1-RC2 and Ubuntu 12.04 as Dom0 >> (3.2.0-55-generic). I have software RAID 5 with LVM''s. DomU (also 12.04 >> Ubuntu 3.2.0-55 kernel) has a dedicated logical volume, which is being backed >> up shutting down the DomU, an LVM snapshot being created, restart of DomU and >> then the snapshot dd''ed to another logical volume. The snapshot is then >> removed and the second LV is dd''ed to gzip and onto DAT tape. >> >> I currently have this running every hour (unless its already running) for >> testing purposes. After 6-12 runs of this, the Dom0 kernel crashes with he >> below output. >> >> When I preform this booting into the same kernel standalone, the problem >> does not occur. > Likely because the action that triggers this doesn''t get performed > in that case?Thanks for the response. I am obviously comparing apples and oranges, but I have tried to be as similar as possible in as much as I have limited kernel memory to 512M as I do with Dom0 and have used a background task writing /dev/urandom to the LV that the domU would normally be using. The only difference is that it isn''t running under Xen and I don''t have a domU running in the background. I will repeat the exercise with no domU running, but under Xen.>> Can anyone please suggest what I am doing wrong or identify if it is bug? > Considering that exception address ... > >> RIP: e030:[<ffffffff8142655d>] [<ffffffff8142655d>] scsi_dispatch_cmd+0x6d/0x2e0 > ... and call stack ... > >> [24149.786311] Call Trace: >> [24149.786315] <IRQ> >> [24149.786323] [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470 >> [24149.786333] [<ffffffff812f1a28>] blk_run_queue+0x38/0x60 >> [24149.786339] [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0 >> [24149.786347] [<ffffffff8142e822>] scsi_next_command+0x42/0x60 >> [24149.786354] [<ffffffff8142ea52>] scsi_io_completion+0x1b2/0x630 >> [24149.786363] [<ffffffff816611fe>] ? _raw_spin_unlock_irqrestore+0x1e/0x30 >> [24149.786371] [<ffffffff81424b5c>] scsi_finish_command+0xcc/0x130 >> [24149.786378] [<ffffffff8142e7ae>] scsi_softirq_done+0x13e/0x150 >> [24149.786386] [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0 >> [24149.786394] [<ffffffff8106fa38>] __do_softirq+0xa8/0x210 >> [24149.786402] [<ffffffff8166ba6c>] call_softirq+0x1c/0x30 >> [24149.786410] [<ffffffff810162f5>] do_softirq+0x65/0xa0 >> [24149.786416] [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0 >> [24149.786428] [<ffffffff813aecd5>] xen_evtchn_do_upcall+0x35/0x50 >> [24149.786436] [<ffffffff8166babe>] xen_do_hypervisor_callback+0x1e/0x30 >> [24149.786441] <EOI> >> [24149.786449] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 >> [24149.786456] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 >> [24149.786464] [<ffffffff8100a500>] ? xen_safe_halt+0x10/0x20 >> [24149.786472] [<ffffffff8101c913>] ? default_idle+0x53/0x1d0 >> [24149.786478] [<ffffffff81013236>] ? cpu_idle+0xd6/0x120 > ... point into the SCSI subsystem, this is likely the wrong list to > ask for help on.... but the right list to confirm that I am on the wrong list? :) Seriously, the specific evidence may suggest it''s a non-Xen issue/bug, but Xen is the only measurable/visible difference so far. I referred it to this list because here the demarcation between hypervisor, PVOPS and regular kernel code interaction is likely best understood. Thanks again for your response.> > Jan >
On Tue, Nov 05, 2013 at 10:29:49PM +0000, Ian Murray wrote:> On 05/11/13 13:16, Jan Beulich wrote: > >>>>On 05.11.13 at 12:58, Ian Murray <murrayie@yahoo.co.uk> wrote: > >>I have a recurring crash using Xen 4.3.1-RC2 and Ubuntu 12.04 as Dom0 > >>(3.2.0-55-generic). I have software RAID 5 with LVM''s. DomU (also 12.04 > >>Ubuntu 3.2.0-55 kernel) has a dedicated logical volume, which is being backed > >>up shutting down the DomU, an LVM snapshot being created, restart of DomU and > >>then the snapshot dd''ed to another logical volume. The snapshot is then > >>removed and the second LV is dd''ed to gzip and onto DAT tape. > >> > >>I currently have this running every hour (unless its already running) for > >>testing purposes. After 6-12 runs of this, the Dom0 kernel crashes with he > >>below output. > >> > >>When I preform this booting into the same kernel standalone, the problem > >>does not occur. > >Likely because the action that triggers this doesn''t get performed > >in that case? > Thanks for the response. > > I am obviously comparing apples and oranges, but I have tried to be > as similar as possible in as much as I have limited kernel memory to > 512M as I do with Dom0 and have used a background task writing > /dev/urandom to the LV that the domU would normally be using. The > only difference is that it isn''t running under Xen and I don''t have > a domU running in the background. I will repeat the exercise with no > domU running, but under Xen. > > > >>Can anyone please suggest what I am doing wrong or identify if it is bug? > >Considering that exception address ... > > > >>RIP: e030:[<ffffffff8142655d>] [<ffffffff8142655d>] scsi_dispatch_cmd+0x6d/0x2e0 > >... and call stack ... > > > >>[24149.786311] Call Trace: > >>[24149.786315] <IRQ> > >>[24149.786323] [<ffffffff8142da62>] scsi_request_fn+0x3a2/0x470 > >>[24149.786333] [<ffffffff812f1a28>] blk_run_queue+0x38/0x60 > >>[24149.786339] [<ffffffff8142c416>] scsi_run_queue+0xd6/0x1b0 > >>[24149.786347] [<ffffffff8142e822>] scsi_next_command+0x42/0x60 > >>[24149.786354] [<ffffffff8142ea52>] scsi_io_completion+0x1b2/0x630 > >>[24149.786363] [<ffffffff816611fe>] ? _raw_spin_unlock_irqrestore+0x1e/0x30 > >>[24149.786371] [<ffffffff81424b5c>] scsi_finish_command+0xcc/0x130 > >>[24149.786378] [<ffffffff8142e7ae>] scsi_softirq_done+0x13e/0x150 > >>[24149.786386] [<ffffffff812fb6b3>] blk_done_softirq+0x83/0xa0 > >>[24149.786394] [<ffffffff8106fa38>] __do_softirq+0xa8/0x210 > >>[24149.786402] [<ffffffff8166ba6c>] call_softirq+0x1c/0x30 > >>[24149.786410] [<ffffffff810162f5>] do_softirq+0x65/0xa0 > >>[24149.786416] [<ffffffff8106fe1e>] irq_exit+0x8e/0xb0 > >>[24149.786428] [<ffffffff813aecd5>] xen_evtchn_do_upcall+0x35/0x50 > >>[24149.786436] [<ffffffff8166babe>] xen_do_hypervisor_callback+0x1e/0x30 > >>[24149.786441] <EOI> > >>[24149.786449] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 > >>[24149.786456] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000 > >>[24149.786464] [<ffffffff8100a500>] ? xen_safe_halt+0x10/0x20 > >>[24149.786472] [<ffffffff8101c913>] ? default_idle+0x53/0x1d0 > >>[24149.786478] [<ffffffff81013236>] ? cpu_idle+0xd6/0x120 > >... point into the SCSI subsystem, this is likely the wrong list to > >ask for help on. > ... but the right list to confirm that I am on the wrong list? :):-)> > Seriously, the specific evidence may suggest it''s a non-Xen > issue/bug, but Xen is the only measurable/visible difference so far. > I referred it to this list because here the demarcation between > hypervisor, PVOPS and regular kernel code interaction is likely best > understood.But you wouldn''t do the same workload under baremetal thought? Here is a thought. If you just do: "LV is dd''ed to gzip and onto DAT tape." for 15 times under baremetal do you see the same issue? And is there something particular about this DAT? Is it just a generic /dev/st device? Lastly, complete shot in the dark - try increasing the swiotlb size. Do ''swiotlb=65543'' on the Linux command line when booting under Xen.> > Thanks again for your response. > > > > >Jan > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel