Jeffery Kuhn
2008-Jun-18 18:14 UTC
rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
This is a bit weird. I have opensolaris 2008.05 running. uname -a SunOS atlantis 5.11 snv_86 i86pc i386 i86xpv Solaris I have two DomU''s installed a HVM Win2k3 machine, and a RHEL5 paravirtualization. When I shutdown either, the entire Dom0 shutdown. This first started happening after I installed the RHEL5 DomU, but may have been present the whole time since I never really shutdown the win2k3 box. So, if I remote desktop to the windows machine, I do a restart after say some updates, then the entire dom0 restarts. Is there a know configuration that causes this, or is something just majorly screwed up in my configuration. I am new to Xen and xVM so it''s possible I overlooked something. I used virt-install to install rhel5 and the same for the windows machine. I installed the optional virt-manager to use a GUI to manage things. I don''t think I did anything custom anywhere. I tried to follow online how-to''s and sun documentation. This message posted from opensolaris.org
Jürgen Keil
2008-Jun-19 08:55 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
> This is a bit weird. I have opensolaris 2008.05 running. > uname -a > SunOS atlantis 5.11 snv_86 i86pc i386 i86xpv Solaris > > I have two DomU''s installed a HVM Win2k3 machine, and > a RHEL5 paravirtualization. When I shutdown either, > the entire Dom0 shutdown.In dom0, did it save a fresh crash dump in the directory /var/crash/{system''s_hostname} ? If yes, try to extract the stack trace from the crash dump, like this: $ su # cd /var/crash/`hostname` # ls -ltr Gesamt 13628322 -rw-r--r-- 1 root root 2306337 Jun 12 13:24 unix.2 -rw-r--r-- 1 root root 1663844352 Jun 12 13:25 vmcore.2 -rw-r--r-- 1 root root 2798783 Jun 12 19:01 unix.3 -rw-r--r-- 1 root root 1020485632 Jun 12 19:02 vmcore.3 -rw-r--r-- 1 root root 2797098 Jun 16 19:17 unix.4 -rw-r--r-- 1 root root 2958589952 Jun 16 19:18 vmcore.4 -rw-r--r-- 1 root root 2793415 Jun 16 19:23 unix.5 -rw-r--r-- 1 root root 1320591360 Jun 16 19:23 vmcore.5 -rw-r--r-- 1 root root 2 Jun 16 19:23 bounds (crash dump #5 with suffix .5 is the last one saved, so I''m using 5 as argument in the next mdb command): # echo ''::status'' | mdb -k 5 # echo ''$C'' | mdb -k 5 This message posted from opensolaris.org
Jeffery Kuhn
2008-Jun-19 14:42 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
Ok, there are fresh crash files in that location. Here is the output of the commands you suggested. ssctech@atlantis:/var/crash/atlantis# ls -ltr total 1684580 -rw-r--r-- 1 root root 2507840 2008-06-17 13:53 unix.0 -rw-r--r-- 1 root root 374861824 2008-06-17 13:53 vmcore.0 -rw-r--r-- 1 root root 2474509 2008-06-17 14:47 unix.1 -rw-r--r-- 1 root root 337444864 2008-06-17 14:47 vmcore.1 -rw-r--r-- 1 root root 2474509 2008-06-17 17:27 unix.2 -rw-r--r-- 1 root root 336556032 2008-06-17 17:27 vmcore.2 -rw-r--r-- 1 root root 2474509 2008-06-18 14:35 unix.3 -rw-r--r-- 1 root root 339410944 2008-06-18 14:35 vmcore.3 -rw-r--r-- 1 root root 2474509 2008-06-18 15:58 unix.4 -rw-r--r-- 1 root root 322670592 2008-06-18 15:58 vmcore.4 -rw-r--r-- 1 root root 2 2008-06-18 15:58 bounds ssctech@atlantis:/var/crash/atlantis# echo ''::status'' | mdb -k 4 debugging crash dump vmcore.4 (64-bit) from atlantis operating system: 5.11 snv_86 (i86pc) panic message: mutex_enter: bad mutex, lp=ffffff0135578130 owner=ffffff00058d5c80 thread=ffffff 0003aedc80 dump content: kernel pages only ssctech@atlantis:/var/crash/atlantis# echo ''$C'' | mdb -k 4 ffffff0003aed710 vpanic() ffffff0003aed740 mutex_panic+0x73(fffffffffbc0dc28, ffffff0135578130) ffffff0003aed7b0 mutex_vector_enter+0x452(ffffff0135578130) ffffff0003aed880 xnb_copy_to_peer+0x53(ffffff0135578000, ffffff013d342140) ffffff0003aed8b0 xnbo_from_mac+0x1c(ffffff0135578000, 0, ffffff013d342140) ffffff0003aed930 mac_do_rx+0xb9(ffffff0135e5b638, 0, ffffff013d342140, 0) ffffff0003aed960 mac_rx+0x1b(ffffff0135e5b638, 0, ffffff013d342140) ffffff0003aed9b0 vnic_rx+0x59(ffffff0133b18a88, ffffff0133b18a88, ffffff013d342140) ffffff0003aeda60 vnic_promisc_rx+0x12e(ffffff0133b17f58, 0, ffffff01355118c0) ffffff0003aedac0 vnic_classifier_rx+0x43(ffffff0133b17f58, 0, ffffff01355118c0) ffffff0003aedb40 mac_do_rx+0xb9(ffffff0135e5cce8, 0, ffffff01355118c0, 0) ffffff0003aedb70 mac_rx+0x1b(ffffff0135e5cce8, 0, ffffff01355118c0) ffffff0003aedbc0 e1000g_intr_pciexpress+0x102(ffffff0135805000) ffffff0003aedc20 av_dispatch_autovect+0x78(12) ffffff0003aedc60 dispatch_hardint+0x33(12, 0) ffffff0003ab79e0 switch_sp_and_call+0x13() ffffff0003ab7a30 do_interrupt+0x9b(ffffff0003ab7af0, 1) ffffff0003ab7ae0 xen_callback_handler+0x370(ffffff0003ab7af0, 1) ffffff0003ab7af0 xen_callback+0xcd() ffffff0003ab7bf0 HYPERVISOR_sched_op+0x29(1, 0) ffffff0003ab7c00 HYPERVISOR_block+0x11() ffffff0003ab7c10 mach_cpu_idle+0x12() ffffff0003ab7c40 cpu_idle+0xcc() ffffff0003ab7c60 idle+0x10e() ffffff0003ab7c70 thread_start+8() ssctech@atlantis:/var/crash/atlantis# Any ideas? This message posted from opensolaris.org
Jürgen Keil
2008-Jun-19 15:08 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
> Ok, there are fresh crash files in that location. > Here is the output of the commands you suggested. > > ssctech@atlantis:/var/crash/atlantis# ls -ltr > total 1684580 > -rw-r--r-- 1 root root 2507840 2008-06-17 13:53 unix.0 > -rw-r--r-- 1 root root 374861824 2008-06-17 13:53 vmcore.0 > -rw-r--r-- 1 root root 2474509 2008-06-17 14:47 unix.1 > -rw-r--r-- 1 root root 337444864 2008-06-17 14:47 vmcore.1 > -rw-r--r-- 1 root root 2474509 2008-06-17 17:27 unix.2 > -rw-r--r-- 1 root root 336556032 2008-06-17 17:27 vmcore.2 > -rw-r--r-- 1 root root 2474509 2008-06-18 14:35 unix.3 > -rw-r--r-- 1 root root 339410944 2008-06-18 14:35 vmcore.3 > -rw-r--r-- 1 root root 2474509 2008-06-18 15:58 unix.4 > -rw-r--r-- 1 root root 322670592 2008-06-18 15:58 vmcore.4 > -rw-r--r-- 1 root root 2 2008-06-18 15:58 boundshmm, that opensolaris dom0 is getting frequent panics. What does mdb ::status and ''$C'' report for the other crash dumps? Do they all crash with a "bad mutex" panic? Does the stack backtrace look similar to the one for vmcore #4?> ssctech@atlantis:/var/crash/atlantis# echo ''$C'' | mdb -k 4 > ffffff0003aed710 vpanic() > ffffff0003aed740 mutex_panic+0x73(fffffffffbc0dc28, ffffff0135578130) > ffffff0003aed7b0 mutex_vector_enter+0x452(ffffff0135578130) > ffffff0003aed880 xnb_copy_to_peer+0x53(ffffff0135578000, ffffff013d342140) > ffffff0003aed8b0 xnbo_from_mac+0x1c(ffffff0135578000, 0, ffffff013d342140) > ffffff0003aed930 mac_do_rx+0xb9(ffffff0135e5b638, 0, ffffff013d342140, 0) > ffffff0003aed960 mac_rx+0x1b(ffffff0135e5b638, 0, ffffff013d342140) > ffffff0003aed9b0 vnic_rx+0x59(ffffff0133b18a88, ffffff0133b18a88, ffffff013d342140) > ffffff0003aeda60 vnic_promisc_rx+0x12e(ffffff0133b17f58, 0, ffffff01355118c0) > ffffff0003aedac0 vnic_classifier_rx+0x43(ffffff0133b17f58, 0, ffffff01355118c0) > ffffff0003aedb40 mac_do_rx+0xb9(ffffff0135e5cce8, 0, ffffff01355118c0, 0) > ffffff0003aedb70 mac_rx+0x1b(ffffff0135e5cce8, 0, ffffff01355118c0) > ffffff0003aedbc0 e1000g_intr_pciexpress+0x102(ffffff0135805000)Hmm, seems the dom0 nic backend driver did just receive a packet, and tries to forward that to some domU; you said that the dom0 crash happens during domU shutdown, so maybe the nic backend <-> frontend connection is already down... What is printed by: echo ''ffffff0135578000::print xnb_t'' | mdb -k 4 This message posted from opensolaris.org
Jürgen Keil
2008-Jun-19 15:23 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
> Ok, there are fresh crash files in that location....> Any ideas?Looks a bit like bug 6600374 dom0 panic when running xmstress test: BAD TRAP: type=e (#pf Page fault) http://bugs.opensolaris.org/view_bug.do?bug_id=6600374 What happens when you add the following line at the end of dom0''s /etc/system file and reboot into xVM (it enables kernel heap checking): set kmem_flags=0xf After that, try to reproduce the problem, by starting and shutting down some domUs... This message posted from opensolaris.org
Jeffery Kuhn
2008-Jun-19 18:20 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
>Do they all crash with a "bad mutex" panic? >Does the stack backtrace look similar to the one for vmcore #4?They do all panic with a bed mutex and the $C stacktrace looks the same for all of them too. Those times coincide with shutdown incidents of the DomU''s.>What is printed by: >echo ''ffffff0135578000::print xnb_t'' | mdb -k 4Ok, Prepared yourself, this one is long!! ssctech@atlantis:/var/crash/atlantis# echo ''ffffff0135578000::print xnb_t'' | mdb -k 4 { xnb_devinfo = 0xffffff013f1e1b98 xnb_flavour = 0xffffffffc0048da0 xnb_flavour_data = 0xffffff0133b5fcf0 xnb_irq = 0 (B_FALSE) xnb_mac_addr = [ 0, 0x16, 0x3e, 0x19, 0x1b, 0xf ] xnb_stat_ipackets = 0xb3c xnb_stat_opackets = 0x856f xnb_stat_rbytes = 0xf318d xnb_stat_obytes = 0x4f77c9 xnb_stat_intr = 0x5b3 xnb_stat_xmit_defer = 0 xnb_stat_tx_cksum_deferred = 0x7fe xnb_stat_rx_cksum_no_need = 0x416 xnb_stat_tx_rsp_notok = 0 xnb_stat_tx_notify_sent = 0x2844 xnb_stat_tx_notify_deferred = 0x5d2b xnb_stat_rx_notify_sent = 0x68b xnb_stat_rx_notify_deferred = 0x498 xnb_stat_tx_too_early = 0xa8 xnb_stat_rx_too_early = 0 xnb_stat_rx_allocb_failed = 0 xnb_stat_tx_allocb_failed = 0 xnb_stat_tx_foreign_page = 0 xnb_stat_mac_full = 0 xnb_stat_spurious_intr = 0x1 xnb_stat_allocation_success = 0 xnb_stat_allocation_failure = 0 xnb_stat_small_allocation_success = 0 xnb_stat_small_allocation_failure = 0 xnb_stat_other_allocation_failure = 0 xnb_stat_tx_pagebndry_crossed = 0x2a6 xnb_stat_tx_cpoparea_grown = 0x1 xnb_stat_csum_hardware = 0 xnb_stat_csum_software = 0x416 xnb_kstat_aux = 0xffffff013c552830 xnb_cksum_offload = 1 (B_TRUE) xnb_icookie = 6 xnb_rx_lock = { _opaque = [ 0xffffff00058d5c86 ] } xnb_tx_lock = { _opaque = [ 0xffffff00058d5c86 ] } xnb_rx_unmop_count = 0 xnb_rx_buf_count = 0 xnb_rx_pages_writable = 0 (B_FALSE) xnb_rx_ring = { rsp_prod_pvt = 0x856f req_cons = 0x856f nr_ents = 0x100 sring = 0xffffff013fd29000 } xnb_rx_ring_addr = 0 xnb_rx_ring_ref = 0x79 xnb_rx_ring_handle = 0xffffffff xnb_tx_ring = { rsp_prod_pvt = 0xb3c req_cons = 0xb3c nr_ents = 0x100 sring = 0xffffff0140230000 } xnb_tx_ring_addr = 0 xnb_tx_ring_ref = 0x300 xnb_tx_ring_handle = 0xffffffff xnb_connected = 0 (B_FALSE) xnb_hotplugged = 1 (B_TRUE) xnb_detachable = 1 (B_TRUE) xnb_evtchn = 0 xnb_peer = 0x2 xnb_rx_bufp = [ 0xffffff013fb30950, 0xffffff013fb30878, 0xffffff013fb30950, 0xffffff013fb30950, 0xffffff013fb30998, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ] xnb_rx_mop = [ { host_addr = 0xffffff013c34a000 flags = 0x6 ref = 0xab dom = 0x2 status = 0 handle = 0xc0 dev_bus_addr = 0x12420000 } { host_addr = 0xffffff013ff39000 flags = 0x6 ref = 0xac dom = 0x2 status = 0 handle = 0xe8 dev_bus_addr = 0x11b2a000 } { host_addr = 0xffffff013c34a000 flags = 0x6 ref = 0xab dom = 0x2 status = 0 handle = 0xc0 dev_bus_addr = 0x11b2a000 } { host_addr = 0xffffff013c34a000 flags = 0x6 ref = 0xa9 dom = 0x2 status = 0 handle = 0x13b dev_bus_addr = 0x23257000 } { host_addr = 0xffffff01402c4000 flags = 0x6 ref = 0x3e dom = 0x2 status = 0 handle = 0x69 dev_bus_addr = 0x2337f000 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } { host_addr = 0 flags = 0 ref = 0 dom = 0 status = 0 handle = 0 dev_bus_addr = 0 } ... ] xnb_rx_unmop = [ { host_addr = 0xffffff013c34a000 dev_bus_addr = 0x12420000 handle = 0xc0 status = 0 } { host_addr = 0xffffff0140229000 dev_bus_addr = 0x7eaa3000 handle = 0x70 status = 0 } { host_addr = 0xffffff013e7d4000 dev_bus_addr = 0x11f73000 handle = 0xbe status = 0 } { host_addr = 0xffffff0135632000 dev_bus_addr = 0x7eaa3000 handle = 0xc0 status = 0 } { host_addr = 0xffffff013571b000 dev_bus_addr = 0x7eaa3000 handle = 0x53 status = 0 } { host_addr = 0xffffff014022e000 dev_bus_addr = 0x7eaa3000 handle = 0x58 status = 0 } { host_addr = 0xffffff013572e000 dev_bus_addr = 0x7eaa3000 handle = 0xe8 status = 0 } { host_addr = 0xffffff013f919000 dev_bus_addr = 0x7eaa3000 handle = 0x165 status = 0 } { host_addr = 0xffffff0133fc7000 dev_bus_addr = 0x7eaa3000 handle = 0x62 status = 0 } { host_addr = 0xffffff013c554000 dev_bus_addr = 0x1219d000 handle = 0x16c status = 0 } { host_addr = 0xffffff013c553000 dev_bus_addr = 0x17739000 handle = 0x13b status = 0 } { host_addr = 0xffffff0135635000 dev_bus_addr = 0x121e3000 handle = 0xe3 status = 0 } { host_addr = 0xffffff0135636000 dev_bus_addr = 0x12244000 handle = 0x75 status = 0 } { host_addr = 0xffffff0135637000 dev_bus_addr = 0x12643000 handle = 0x16a status = 0 } { host_addr = 0xffffff013a77c000 dev_bus_addr = 0x12244000 handle = 0xdc status = 0 } { host_addr = 0xffffff013c344000 dev_bus_addr = 0x121c5000 handle = 0x9a status = 0 } { host_addr = 0xffffff013c345000 dev_bus_addr = 0x11b2a000 handle = 0x56 status = 0 } { host_addr = 0xffffff014022c000 dev_bus_addr = 0x123d3000 handle = 0x2 status = 0 } { host_addr = 0xffffff013c346000 dev_bus_addr = 0x1204f000 handle = 0x20 status = 0 } { host_addr = 0xffffff01402c5000 dev_bus_addr = 0x123ea000 handle = 0xd5 status = 0 } { host_addr = 0xffffff013c349000 dev_bus_addr = 0x11b2a000 handle = 0x3a status = 0 } { host_addr = 0xffffff01402c4000 dev_bus_addr = 0x12b47000 handle = 0xfa status = 0 } { host_addr = 0xffffff013c34a000 dev_bus_addr = 0x123d3000 handle = 0x3f status = 0 } { host_addr = 0xffffff013ff35000 dev_bus_addr = 0x129a3000 handle = 0x6a status = 0 } { host_addr = 0xffffff013ff36000 dev_bus_addr = 0x17739000 handle = 0x13a status = 0 } { host_addr = 0xffffff013ff39000 dev_bus_addr = 0x12115000 handle = 0x126 status = 0 } { host_addr = 0 dev_bus_addr = 0 handle = 0 status = 0 } { host_addr = 0 dev_bus_addr = 0 handle = 0 status = 0 } { host_addr = 0 dev_bus_addr = 0 handle = 0 status = 0 } { host_addr = 0 dev_bus_addr = 0 handle = 0 status = 0 } { host_addr = 0 dev_bus_addr = 0 handle = 0 status = 0 } { host_addr = 0 dev_bus_addr = 0 handle = 0 status = 0 } ... ] xnb_rx_unmop_rxp = [ 0xffffff013fb30950, 0xffffff013fb30f38, 0xffffff013fb30ef0, 0xffffff013fb30ea8, 0xffffff013fb30e60, 0xffffff013fb30e18, 0xffffff013fb30dd0, 0xffffff013fb30d88, 0xffffff013fb30d40, 0xffffff013fb30cf8, 0xffffff013fb30cb0, 0xffffff013fb30c68, 0xffffff013fb30c20, 0xffffff013fb30bd8, 0xffffff013fb30b90, 0xffffff013fb30b48, 0xffffff013fb30b00, 0xffffff013fb30ab8, 0xffffff013fb30a70, 0xffffff013fb30a28, 0xffffff013fb309e0, 0xffffff013fb30998, 0xffffff013fb30950, 0xffffff013fb30908, 0xffffff013fb308c0, 0xffffff013fb30878, 0, 0, 0, 0, 0, 0, ... ] xnb_tx_va = 0xffffff013571f000 xnb_tx_top = [ { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } { mfn = 0 domid = 0 ref = 0 status = 0 } ... ] xnb_hv_copy = 1 (B_TRUE) xnb_tx_cpop = 0xffffff013ea40dc0 xnb_cpop_sz = 0x8 } ssctech@atlantis:/var/crash/atlantis# This message posted from opensolaris.org
Jeffery Kuhn
2008-Jun-20 15:58 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
Just tried the Kernel Debugging flag. The machine produced the same reaction. On a DomU shutdown, the Dom0 was restarted. there is now a new unix.6 and vmcore.6 in the /var/crash/atlantis directory. I re-ran the commands from above, and here is the output. Looks the same to me. ssctech@atlantis:/var/crash/atlantis# ls -rlt total 2667709 -rw-r--r-- 1 root root 2507840 2008-06-17 13:53 unix.0 -rw-r--r-- 1 root root 374861824 2008-06-17 13:53 vmcore.0 -rw-r--r-- 1 root root 2474509 2008-06-17 14:47 unix.1 -rw-r--r-- 1 root root 337444864 2008-06-17 14:47 vmcore.1 -rw-r--r-- 1 root root 2474509 2008-06-17 17:27 unix.2 -rw-r--r-- 1 root root 336556032 2008-06-17 17:27 vmcore.2 -rw-r--r-- 1 root root 2474509 2008-06-18 14:35 unix.3 -rw-r--r-- 1 root root 339410944 2008-06-18 14:35 vmcore.3 -rw-r--r-- 1 root root 2474509 2008-06-18 15:58 unix.4 -rw-r--r-- 1 root root 322670592 2008-06-18 15:58 vmcore.4 -rw-r--r-- 1 root root 2474509 2008-06-19 14:36 unix.5 -rw-r--r-- 1 root root 337600512 2008-06-19 14:36 vmcore.5 -rw-r--r-- 1 root root 1546889 2008-06-20 11:42 unix.6 -rw-r--r-- 1 root root 664203264 2008-06-20 11:42 vmcore.6 -rw-r--r-- 1 root root 2 2008-06-20 11:42 bounds ssctech@atlantis:/var/crash/atlantis# echo ''::status'' | mdb -k 6 debugging crash dump vmcore.6 (64-bit) from atlantis operating system: 5.11 snv_86 (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0003aed6d0 addr=ffffff016a3551b8 dump content: kernel pages only ssctech@atlantis:/var/crash/atlantis# echo ''$C'' | mdb -k 6 ffffff0003aed880 xnb_copy_to_peer+0x32(ffffff016a34f000, ffffff013a9368a0) ffffff0003aed8b0 xnbo_from_mac+0x1c(ffffff016a34f000, 0, ffffff013a9368a0) ffffff0003aed930 mac_do_rx+0xb9(ffffff013d35d8a8, 0, ffffff013a9368a0, 0) ffffff0003aed960 mac_rx+0x1b(ffffff013d35d8a8, 0, ffffff013a9368a0) ffffff0003aed9b0 vnic_rx+0x59(ffffff016a44fe90, ffffff016a44fe90, ffffff013a9368a0) ffffff0003aeda60 vnic_promisc_rx+0x12e(ffffff016a44ef70, 0, ffffff013c0d2160) ffffff0003aedac0 vnic_classifier_rx+0x43(ffffff016a44ef70, 0, ffffff013c0d2160) ffffff0003aedb40 mac_do_rx+0xb9(ffffff013d35dcf8, 0, ffffff013c0d2160, 0) ffffff0003aedb70 mac_rx+0x1b(ffffff013d35dcf8, 0, ffffff013c0d2160) ffffff0003aedbc0 e1000g_intr_pciexpress+0x102(ffffff013d143000) ffffff0003aedc20 av_dispatch_autovect+0x78(12) ffffff0003aedc60 dispatch_hardint+0x33(12, 0) ffffff0003feb5a0 switch_sp_and_call+0x13() ffffff0003feb5f0 do_interrupt+0x9b(ffffff0003feb6b0, 67e8f) ffffff0003feb6a0 xen_callback_handler+0x370(ffffff0003feb6b0, 67e8f) ffffff0003feb6b0 xen_callback+0xcd() ffffff0003feb7c0 mfn_to_pfn+0xde(1000001528f267) ffffff0003feb850 hat_pte_unmap+0x7f(ffffff016a411960, 151, 4, 1000001528f267, 0 ) ffffff0003feb9c0 hat_unload_callback+0x1f9(ffffff013608bf20, ffffff02b0d51000, 1000, 4, 0) ffffff0003feba10 hat_unload+0x63(ffffff013608bf20, ffffff02b0d51000, 1000, 4) ffffff0003febaa0 segkmem_free_vn+0x73(ffffff0131807000, ffffff02b0d51000, 1000, fffffffffbc3dcc0, 0) ffffff0003febad0 segkmem_free+0x23(ffffff0131807000, ffffff02b0d51000, 1000) ffffff0003febb30 vmem_xfree+0x10c(ffffff0131808000, ffffff02b0d51000, 1000) ffffff0003febb60 vmem_free+0x25(ffffff0131808000, ffffff02b0d51000, 1000) ffffff0003febba0 kmem_slab_destroy+0x88(ffffff0131831908, ffffff0166299c90) ffffff0003febbf0 kmem_slab_free+0x22f(ffffff0131831908, ffffff02b0d51000) ffffff0003febc50 kmem_cache_free+0x1dc(ffffff0131831908, ffffff02b0d51000) ffffff0003febc70 kmem_free+0x142(ffffff02b0d51000, 1000) ffffff0003febce0 evtchndrv_write+0x109(c500000001, ffffff0003febe90, ffffff014e7286d8) ffffff0003febd10 cdev_write+0x3c(c500000001, ffffff0003febe90, ffffff014e7286d8 ) ffffff0003febdd0 spec_write+0x46a(ffffff014fdbd480, ffffff0003febe90, 0, ffffff014e7286d8, 0) ffffff0003febe40 fop_write+0x69(ffffff014fdbd480, ffffff0003febe90, 0, ffffff014e7286d8, 0) ffffff0003febf00 write+0x2af(b, 7fffffdfbc58, 4) ffffff0003febf10 sys_syscall+0x1c9() This message posted from opensolaris.org
Jürgen Keil
2008-Jun-20 16:20 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
> Just tried the Kernel Debugging flag. The machine > produced the same reaction. On a DomU shutdown, the > Dom0 was restarted.How do you shutdown the domUs?>From within the domU, using the linux / unix shutdown command?Or from dom0, using "xm shutdown ..." ?> there is now a new unix.6 and vmcore.6 in the > /var/crash/atlantis directory.> panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0003aed6d0 addr=ffffff016a3551b8> ffffff0003aed880 xnb_copy_to_peer+0x32(ffffff016a34f000, ffffff013a9368a0)Hmm, looks like the "xnb_t" structure that is passed to xnb_copy_to_peer is already freed; that would probably explain the "bad mutex" panic when running without kernel heap debugging. And with heap debugging, we''re crashing a few bytes earlier... What does mdb know about the xnb_t pointer to address ffffff016a34f000 (the 1st argument to function xnb_copy_to_peer) in crash dump #6? Try this: echo ''ffffff016a34f000::whatis'' | mdb -k 6 This message posted from opensolaris.org
Jeffery Kuhn
2008-Jun-20 16:23 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
Ok, I did a little digging in mdb. I''m just going to say up front that I''ve NEVER debugged a Kernel, but i am the unix developer for my software company, so I have a little experience.>From the stack trace, the last call was:ffffff0003aed880 xnb_copy_to_peer+0x32(ffffff016a34f000, ffffff013a9368a0) The first parameter doesn''t evaluate to anything based on:> ffffff016a34f000::dump\/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef mdb: failed to read data at 0xffffff016a34f000: no mapping for address Now, the second parameter does evaluate, but to NULL, all 0''s ffffff013a9368a0::dump \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef ffffff013a9368a0: 00000000 00000000 00000000 00000000 ................ Is the function xnb_copy_to_peer suppose to assign the second parameter to the first? Is so, that may explain the problem. The second parameter was NULL, and if that wasn''t check, it could be a NULL pointer exception when it was attempted to be used. Like I said, i''ve only had limited exposure to mdb, so if there are other commands you guys would like me to run to try and figure this out, let me know. Thanks This message posted from opensolaris.org
Jürgen Keil
2008-Jun-20 16:50 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
> Ok, I did a little digging in mdb. I''m just going to > say up front that I''ve NEVER debugged a Kernel, but i > am the unix developer for my software company, so I > have a little experience. > > From the stack trace, the last call was: > ffffff0003aed880 xnb_copy_to_peer+0x32(ffffff016a34f000, ffffff013a9368a0) > > The first parameter doesn''t evaluate to anything based on: > > ffffff016a34f000::dump > \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef > mdb: failed to read data at 0xffffff016a34f000: no mapping for addressThat first parameter is a big data structure that gets dynamically allocated / freed. Since there is "no mapping", the structure must have been freed, but some part of the kernel is still doing function calls to xnb_copy_to_peer passing a pointer to the freed memory block as first argument.> Is the function xnb_copy_to_peer suppose to assign > the second parameter to the first? Is so, that may > explain the problem. The second parameter was NULL, > and if that wasn''t check, it could be a NULL pointer > exception when it was attempted to be used.Source code for xnb_copy_to_peer can be found here: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/xen/io/xnb.c#926 926 mblk_t * 927 xnb_copy_to_peer(xnb_t *xnbp, mblk_t *mp) 928 { 929 mblk_t *free = mp, *mp_prev = NULL, *saved_mp = mp; 930 mblk_t *ml, *ml_prev; 931 gnttab_copy_t *gop_cp; 932 boolean_t notify; 933 RING_IDX loop, prod; 934 int i; 935 936 if (!xnbp->xnb_hv_copy) 937 return (xnb_to_peer(xnbp, mp)); 938 939 /* 940 * For each packet the sequence of operations is: 941 * 942 * 1. get a request slot from the ring. 943 * 2. set up data for hypercall (see NOTE below) 944 * 3. have the hypervisore copy the data 945 * 4. update the request slot. 946 * 5. kick the peer. 947 * 948 * NOTE ad 2. 949 * In order to reduce the number of hypercalls, we prepare 950 * several packets (mp->b_cont != NULL) for the peer and 951 * perform a single hypercall to transfer them. 952 * We also have to set up a seperate copy operation for 953 * every page. 954 * 955 * If we have more than one message (mp->b_next != NULL), 956 * we do this whole dance repeatedly. 957 */ 958 959 mutex_enter(&xnbp->xnb_tx_lock); In vmcore.6 it is crashing at line 936, when trying to dereference an invalid (freed?) pointer. The enabled heap checking probably has removed mmu mappings for the freed block, so that we get a page fault when trying to access the freed data. In vmcore.5 it was was crashing inside the mutex_enter call at line 959. This was without heap checking; the xnbp pointer points to mapped memory, but that probably has already been re-used by someone else and now contains unexpected data (==> panic: bad mutex owner). vmcore.6 looks very similar to the issue reported as bug 6600374 / 6657428 ... This message posted from opensolaris.org
Jeffery Kuhn
2008-Jun-20 17:30 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
So it seems like xnbp should be checked for Null before proceeding, and if it is, exit with a condition that cancels the operation.. if that is possible. I have rebooted my machine into the normal Kernel, and been using VirtualBox which I like a bit more than the SUNWxvm packages. I''ll still keep the two xvm/xen machines setup, and keep the kernel there, so I can do any testing that you guys need to fix this problem in future releases. Thanks for the prompt response. Hopefully this information will fix the problem in a future release. I''m curious as to why this happened in the first place. You know, I did have the network card enter a faulty state from a power outage, and used fmadm repair to fix it, and it is working again now, but I wonder if that caused any problems with the Xen configurations.., since this happens with packets, and trying to send them... This message posted from opensolaris.org
David Edmondson
2008-Jun-24 09:44 UTC
Re: rebooting a DomU causes my OpenSolaris Dom0 to reboot too!!
On 20 Jun 2008, at 5:50pm, Jürgen Keil wrote:> That first parameter is a big data structure that gets > dynamically allocated / freed. Since there is "no mapping", > the structure must have been freed, but some part of > the kernel is still doing function calls to xnb_copy_to_peer > passing a pointer to the freed memory block as first > argument.Agreed. That''s "not supposed to happen." Looking at the xnb_t that was provided, it does appear that the backend went through the normal close path (ring pointers are NULL, etc.). Given that, nothing should be calling up into xnbo :-( I''ll try to reproduce the failure here.
I''ve been working with a Sun xVM developer the past few days, and he''s been able to narrow it down, and fix the problem. He''ll put the correction in the code base, so no need to try and reproduce it. This message posted from opensolaris.org