Hello,
I have encountered a crash in dom0 kernel while booting a domU from an AOE
device. I haven''t seen such crashes when booting from local partitions/
LVM
volumes/ loopback file systems. Also I haven''t seen such crash when I
did
repetitive I/O to these AOE devices. As the call trace of crash indicates
the crash is in xenolinux kernel. Also this crash is predictably
reproducible.
I am currently using xen 3.0.1, but I have seen the same thing happening in
3.0.2 some time back. If time permits I can try to reproduce it on latest
Xen builds.
The domU''s disks look like this:
''phy:/dev/etherd/e0.4,sda1,w''
''phy:/dev/etherd/e1.4,sda2,w''
Inside the domU, sda1 is treated as root device and sda2 is treated as swap.
The AOE setup involves, vblade servers running on the server machine that
exports some disks over AOE. The dom0 instance in question is a client to
this AOE server. It has ''aoe'' module loaded into it and the
aoe-tools
version is 10.
The stack trace of the crash is as follows:
Unable to handle kernel NULL pointer dereference at virtual address 00000004
printing eip:
c012cc32
*pde = ma 8da99067 pa 32e99067
*pte = ma 00000000 pa 55555000
Oops: 0002 [#1]
SMP
Modules linked in: ipt_physdev iptable_filter ip_tables aoe bridge nfs lockd
ppdev vmnet vmmon sg parport_pc lp parport autofs4 sunrpc af_packet
binfmt_misc dm_mirror dm_multipath video thermal processor fan button
battery ac ipv6 md ohci1394 ieee1394 uhci_hcd intel_agp agpgart i2c_i801
i2c_core pci_hotplug snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss
snd_pcm snd_timer snd soundcore snd_page_alloc e1000 floppy unix sd_mod
aacraid scsi_mod ext3 jbd dm_mod
CPU: 0
EIP: 0061:[<c012cc32>] Tainted: P VLI
EFLAGS: 00010012 (2.6.12.6-xen)
EIP is at run_timer_softirq+0xa2/0x1c0
eax: 00000000 ebx: 00000000 ecx: f33dbe00 edx: c03f3f0c
esi: 00000100 edi: c26deda0 ebp: 00000000 esp: c03f3ef8
ds: 007b es: 007b ss: 0069
Process swapper (pid: 0, threadinfo=c03f2000 task=c0369fc0)
Stack: 00000000 c03f3f7c 00000100 c01438a0 c03f2000 f33dbe00 c0449260
20000000
00000011 c03ecda8 c0420ea0 00000000 c0127ee6 c03ecda8 0000000a
c03f2000
00000001 00000000 00000000 c0128005 00000000 fbf7e000 c010ef32
c0105a00
Call Trace:
[<c01438a0>] handle_IRQ_event+0x60/0xb0
[<c0127ee6>] __do_softirq+0x96/0x130
[<c0128005>] do_softirq+0x85/0xa0
[<c010ef32>] do_IRQ+0x22/0x30
[<c0105a00>] evtchn_do_upcall+0x90/0x100
[<c010a88c>] hypervisor_callback+0x2c/0x34
[<c01082aa>] xen_idle+0x4a/0xa0
[<c0108369>] cpu_idle+0x69/0xb0
[<c03f49fa>] start_kernel+0x1ca/0x220
[<c03f4370>] unknown_bootoption+0x0/0x1f0
Code: 00 8b 53 04 8d 6c 24 14 8b 44 24 14 89 69 04 89 4c 24 14 89 50 04 89
02 89 5b 04 89 5e 0c eb 66 8b 51 04 8b 01 8b 69 14 8b 59 18 <89> 50 04 89
02
c7 41 04 00 02 20 00 c7 01 00 01 10 00 89 4f 08
<0>Kernel panic - not syncing: Fatal exception in interrupt
(XEN) Domain 0 shutdown: rebooting machine.
(XEN) Reboot disabled on cmdline: require manual reset
Before getting this crash I get some warnings on the serial console that
look like following:
Uninitialised timer!
This is just a warning. Your computer is OK
function=0xc02344b0, data=0xf1b9d460
But I guess these have nothing to do with the crash.
I also observed the AOE traffic when the crash occurs using tcpdump. But
nothing seemed unusual to my eyes, just that the packets stopped flowing
after the AOE client dom0 crashed. Furthermore, there is no problem with AOE
servers. After reboot I can again start using the same AOE devices (save the
inconsistent file system). My past attempts of putting printk''s in AOE
driver source also didn''t reveal any helpful information.
Please let me know if any bug fixes were done in recent versions in the area
where this crash is being seen (handle_IRQ_event). Any other suggestions to
tackle the problem are welcome.
Thanks,
--
Jayesh
------------------------------------------------------------------------
Everything you can imagine is real
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel