Ray Barnes
2008-Apr-08 12:35 UTC
[Xen-devel] multipathd confusion leads to kernel panic in Xen 3.2.1-rc2
Resending, apparently this did not get picked up by the listserv during my two previous attempts --- Hi all. While playing with iSCSI Enterprise Target + multipathd on CentOS 5.1 (both the target and the initiator/multipath/xen box are Cent 5.1), I encountered a strange fault condition that leads to a kernel panic in a version of Xen 3.2.1-rc2 pulled from a couple of days ago. My lab consists of two Clovertown machines with dual GigE into separate switches. The target box is softraid5 (although I was able to reproduce this using a single drive on the target), and runs a default config of IET, i.e. ''yum -y install scsi-target-utils ; /etc/init.d/tgtd start''. The config on the initiator needed to reproduce this is default to the best of my recollection. Xen was compiled with ''make world XEN_TARGET_X86_PAE=y vmxassist=n''. /etc/multipath.conf is as follows: defaults { udev_dir /dev polling_interval 2 selector "round-robin 0" path_grouping_policy multibus # getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout /bin/true path_checker readsector0 rr_min_io 10 rr_weight priorities failback 2 no_path_retry fail user_friendly_name no } The default parameter node.conn[0].timeo.noop_out_interval = 10 on the initiator tells it to "ping" the target once every 10 seconds, then per node.conn[0].timeo.noop_out_timeout = 15, wait 15 seconds before marking the target down. So most of the time it can figure out a path is down in about 20 seconds, but if you catch it just wrong it''ll take 25 seconds. Add to that the 2 second polling interval in multipathd. What seems to happen is that when I yank an Ethernet cable, multipathd gets confused and takes 30+ seconds to figure things out (this could be a bug in multipathd). But when that happens, a kernel panic ensues (see below). I have been able to reproduce this in the version of Xen that comes with Cent 5.1, as well as 3.2.0 and 3.2.1-rc2 pulled from hg a couple of days ago with a fresh pull of 2.6.18.8. I can very easily reproduce this every time while installing Cent 5.1 into a domU, it''s probably happened 10 times thus far. I can also reproduce easily with a ''dd'' inside of a domU that gets its filesystem from the initiator/multipathd, simply by yanking and replugging one of the Ethernet cables a few times. I also reproduced once just running ''dd'' directly against the multipathed target device in /dev/mapper from within the dom0. However I tried very hard to reproduce this inside the latest non-Xen kernel of CentOS 5.1 and I could not. It''s appears to be a Xen issue, which under no circumstance should crash the entire box. In a final effort to add more substance and background to this, I attempted to yank both cables while running ''dd'' in the dom0 to the target. Although it threw a bunch of errors, I did not make it panic. After multipathd marked both paths down, the ''dd'' process failed with an io error which is expected behavior. Same thing inside the domU. Hopefully this helps *someone*. Rather than filing a bug report first, I wanted to describe this here so you guys could maybe blame it on multipath or tell me to go jump in lake minnetonka. If I can provide any more background on this, please let me know, as I should have this lab setup for several more days. Sincerely, Ray Barnes p.s. I''m *extremely* pleased with the quality and quantity of good work going on with Xen in the public domain nowadays; keep up the good work! BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c0302709 27935000 -> *pde = 00000001:17898001 27298000 -> *pme = 00000000:00000000 Oops: 0002 [#1] SMP Modules linked in: xt_physdev iptable_filter ip_tables bridge autofs4 hidp rfcomm l2cap bluetooth sunrpc ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 ib_iser rdma_cm ib_addr ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi scsi_transport_iscsi binfmt_misc dm_mirror dm_round_robin dm_multipath dm_mod video thermal sbs processor i2c_ec fan container button battery asus_acpi ac lp nvram sg evdev e1000 parport_pc parport i2c_i801 i2c_core pcspkr piix serio_raw sisfb shpchp pci_hotplug 8250_pnp 8250 serial_core rtc ide_disk ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd usbcore CPU: 1 EIP: 0061:[<c0302709>] Not tainted VLI EFLAGS: 00010286 (2.6.18.8-xen #1) EIP is at iret_exc+0xc6a/0x105e eax: 00000000 ebx: 00000000 ecx: 00000007 edx: ed470b40 esi: ed470b50 edi: e68c6490 ebp: 000001f0 esp: ed7a3c84 ds: 007b es: 007b ss: 0069 Process swapper (pid: 0, ti=ed7a2000 task=ed79f080 task.ti=ed7a2000) Stack: 00000034 000001f0 ed470000 c0296f70 ed470b20 e68c6460 000001f0 00000000 00000000 00000000 e68c6460 00000034 e7de98ac 00000000 00000034 00000514 e7f8d594 c0294149 e68c642c ed7a3dbc e7f73440 00000000 c02df3d4 00000224 Call Trace: [<c0296f70>] skb_copy_and_csum_bits+0x140/0x320 [<c0294149>] sock_alloc_send_skb+0x169/0x1c0 [<c02df3d4>] icmp_glue_bits+0x34/0xa0 [<c02be7b3>] ip_append_data+0x623/0xa60 [<c02df3a0>] icmp_glue_bits+0x0/0xa0 [<c02df286>] icmp_push_reply+0x56/0x170 [<c02b7ea1>] ip_route_output_flow+0x21/0x90 [<c02dfc7d>] icmp_send+0x2cd/0x3f0 [<c013d260>] hrtimer_wakeup+0x0/0x20 [<c02b5eec>] ipv4_link_failure+0x1c/0x50 [<c02dd49c>] arp_error_report+0x1c/0x30 [<c02a4158>] neigh_timer_handler+0xf8/0x2c0 [<c012fb0b>] run_timer_softirq+0x13b/0x1f0 [<c02a4060>] neigh_timer_handler+0x0/0x2c0 [<c012a562>] __do_softirq+0x92/0x130 [<c012a679>] do_softirq+0x79/0x80 [<c0107714>] do_IRQ+0x44/0xa0 [<c0248540>] evtchn_do_upcall+0xe0/0x1f0 [<c0105bbd>] hypervisor_callback+0x3d/0x45 [<c0108c7a>] raw_safe_halt+0x9a/0x120 [<c0104709>] xen_idle+0x29/0x50 [<c01036dd>] cpu_idle+0x6d/0xc0 Code: ff e9 f7 6f ea ff 8b 1d 80 32 41 c0 e9 ea c5 ea ff 8b 1d 80 32 41 c0 e9 ff c5 ea ff 8b 15 80 32 41 c0 e9 14 c6 ea ff 8b 5c 24 20 <c7> 03 f2 ff ff ff 8b 7c 24 14 8b 4c 24 18 31 c0 f3 aa e9 60 a7 EIP: [<c0302709>] iret_exc+0xc6a/0x105e SS:ESP 0069:ed7a3c84 <0>Kernel panic - not syncing: Fatal exception in interrupt (XEN) Domain 0 crashed: ''noreboot'' set - not rebooting. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel