Juergen Keil
2008-Oct-15 09:13 UTC
Re: onnv_98 domain 0 panic on xvm built from latest xvm
> This panic happened between line 729 and 730. The disassemble of this code is: > > [0]> ec_bind_virq_to_irq::dis > ec_bind_virq_to_irq+0x95: call -0x97a <alloc_irq> > ec_bind_virq_to_irq+0x9a: > movw %ax,0xfffffffffbc46ac0(%r12) <virq_info+0x200> > ec_bind_virq_to_irq+0xa3: movq %r13,%rdi > ec_bind_virq_to_irq+0xa6: call +0x16d35 <mutex_exit> > ec_bind_virq_to_irq+0xab: addb %al,(%rax) > ec_bind_virq_to_irq+0xad: addb %al,(%rax) > ec_bind_virq_to_irq+0xaf: addb %al,(%rax) > ec_bind_virq_to_irq+0xb1: addb %al,(%rax) > ec_bind_virq_to_irq+0xb3: sti > ec_bind_virq_to_irq+0xb4: popq %r14 > ec_bind_virq_to_irq+0xb6: popq %r13 > ec_bind_virq_to_irq+0xb8: popq %r12 > ec_bind_virq_to_irq+0xba: popq %rbx > ec_bind_virq_to_irq+0xbb: leave > ec_bind_virq_to_irq+0xbc: retThe code starting at ec_bind_virq_to_irq+0xab looks corrupted. Seems as if someone stored a 64-bit "0" to that location, corrupting the original code. When you boot the dom0 kernel with options "-kd" and disassemble ec_bind_virq_to_irq+0xab (before the kernel starts running), does the code look different? In case it''s already corrupted before the kernel starts running: try to check the dom0 kernel binary /platform/i86xpv/kernel/amd64/unix with "dis -F ec_bind_virq_to_irq /platform/i86xpv/kernel/amd64/unix", check if ec_bind_virq_to_irq+0xab contains reasonble code there. If case the corruption happens at kernel runtime, it might be possble to find the source of the corruption using kmdb and a write watchpoint: ec_bind_virq_to_irq+0xab::wp -w -L 8
Thanks for the quick response. On Wed, Oct 15, 2008 at 5:13 PM, Juergen Keil <jk@tools.de> wrote:>> This panic happened between line 729 and 730. The disassemble of this code is: >> >> [0]> ec_bind_virq_to_irq::dis >> ec_bind_virq_to_irq+0x95: call -0x97a <alloc_irq> >> ec_bind_virq_to_irq+0x9a: >> movw %ax,0xfffffffffbc46ac0(%r12) <virq_info+0x200> >> ec_bind_virq_to_irq+0xa3: movq %r13,%rdi >> ec_bind_virq_to_irq+0xa6: call +0x16d35 <mutex_exit> >> ec_bind_virq_to_irq+0xab: addb %al,(%rax) >> ec_bind_virq_to_irq+0xad: addb %al,(%rax) >> ec_bind_virq_to_irq+0xaf: addb %al,(%rax) >> ec_bind_virq_to_irq+0xb1: addb %al,(%rax) >> ec_bind_virq_to_irq+0xb3: sti >> ec_bind_virq_to_irq+0xb4: popq %r14 >> ec_bind_virq_to_irq+0xb6: popq %r13 >> ec_bind_virq_to_irq+0xb8: popq %r12 >> ec_bind_virq_to_irq+0xba: popq %rbx >> ec_bind_virq_to_irq+0xbb: leave >> ec_bind_virq_to_irq+0xbc: ret > > > The code starting at ec_bind_virq_to_irq+0xab looks corrupted. > > Seems as if someone stored a 64-bit "0" to that location, corrupting > the original code. > > > When you boot the dom0 kernel with options "-kd" and disassemble > ec_bind_virq_to_irq+0xab (before the kernel starts running), > does the code look different?Yes. Before the kernel starts running, the disassemble code looks like [0]> ec_bind_virq_to_irq::dis ec_bind_virq_to_irq+0x95: call -0x97a <alloc_irq> ec_bind_virq_to_irq+0x9a: movw %ax,0xfffffffffbc46ac0(%r12) <virq_info+0x200> ec_bind_virq_to_irq+0xa3: movq %r13,%rdi ec_bind_virq_to_irq+0xa6: call +0x16d35 <mutex_exit> ec_bind_virq_to_irq+0xb4: popq %r14 ec_bind_virq_to_irq+0xb6: popq %r13 ec_bind_virq_to_irq+0xb8: popq %r12 ec_bind_virq_to_irq+0xba: popq %rbx ec_bind_virq_to_irq+0xbb: leave ec_bind_virq_to_irq+0xbc: ret Above aren''t exactly copied and pasted from the real serial output since I can''t access the develop mechines when writing this email. But I am quit sure that there are no addb''s there. Actually, this function has been called twice before it runs into panic at the third time. For the first 2 calls, there seems no addb''s there and no panic''s.> > In case it''s already corrupted before the kernel starts running: > try to check the dom0 kernel binary /platform/i86xpv/kernel/amd64/unix > with "dis -F ec_bind_virq_to_irq /platform/i86xpv/kernel/amd64/unix", > check if ec_bind_virq_to_irq+0xab contains reasonble code there. > > > If case the corruption happens at kernel runtime, it might be possble > to find the source of the corruption using kmdb and a write watchpoint: > > ec_bind_virq_to_irq+0xab::wp -w -L 8Will verify this, and post the result.> > >
Hi Keil, The assemble of ec_bind_virq_to_irq() are different between the points where the kernel hasn''t yet started running and where the panic happens. I set a write watch point for the changed place, but nothing happens before panic. I posted my debug processes below for your information. I removed two cmn_err() lines in ec_bind_virq_to_irq() which I added for debugging. So the offset is different from the last post. krtld: Unused kernel arguments: `xen_processor_pmbits=0''. Loading kmdb... Welcome to kmdb kmdb: unable to determine terminal type: assuming `vt100'' Loaded modules: [ unix krtld genunix ] [0]> ec_bind_virq_to_irq+0x90::dis ec_bind_virq_to_irq+0x90: sti ec_bind_virq_to_irq+0x91: movq %r13,%rdi ec_bind_virq_to_irq+0x94: call +0x16d37 <mutex_exit> ec_bind_virq_to_irq+0x99: movswl 0xfffffffffbc46ac0(%r12),%eax <virq_info+0x200> << this line of code corrupt before panic ec_bind_virq_to_irq+0xa2: popq %r14 ec_bind_virq_to_irq+0xa4: popq %r13 ec_bind_virq_to_irq+0xa6: popq %r12 ec_bind_virq_to_irq+0xa8: popq %rbx ec_bind_virq_to_irq+0xa9: leave ec_bind_virq_to_irq+0xaa: ret ec_bind_virq_to_irq+0xab: leaq +0x110296(%rip),%rdi <0xfffffffffb956fd8> [0]> ec_bind_virq_to_irq+0x90::wp -w -L 8 [0]> ec_bind_virq_to_irq+0x98::wp -w -L 8 [0]> ec_bind_virq_to_irq+0xa0::wp -w -L 2 [0]> $b ID S TA HT LM Description Action ----- - -- -- -- ---------------------------------------- ---------------------- [ 1 ] + 0 0 stop on write of [ec_bind_virq_to_irq+0x - [ 2 ] + 0 0 stop on write of [ec_bind_virq_to_irq+0x - [ 3 ] + 0 0 stop on write of [ec_bind_virq_to_irq+0x - [0]> :c WARNING: Cannot extract revision on this hypervisor version: v-unstable-xvm, unexpected version format v3.3-unstable-xvm chgset ''Tue Oct 14 12:54:19 2008 +0800 17779:88ed7e891155'' SunOS Release 5.11 Version onnv-98 64-bit Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. DEBUG enabled panic[cpu0]/thread=fffffffffbc736e0: BAD TRAP: type=e (#pf Page fault) rp=fffffffffbca6090 addr=d occurred in module "unix" due to a NULL pointer dereference #pf Page fault Bad kernel fault at addr=0xd pid=0, pc=0xfffffffffb846d29, sp=0xfffffffffbca6180, eflags=0x10246 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 2620<vmxe,xmme,fxsr,pae> cr2: d rdi: 286 rsi: 0 rdx: fffffffe rcx: 1 r8: 0 r9: 40000 rax: d rbx: 0 rbp: fffffffffbca61c0 r10: fffffffffbc74ab0 r11: ffffff012fe59000 r12: 0 r13: fffffffffbcb6dc0 r14: 1 r15: ffffff0135a8d580 fsb: 200000000 gsb: fffffffffbc74ab0 ds: 0 es: 0 fs: 0 gs: 0 trp: e err: 2 rip: fffffffffb846d29 cs: e030 rfl: 10246 rsp: fffffffffbca6180 ss: e02b cpu address timestamp type vc handler pc 0 fffffffffbc20f18 6afbf91146 trap e #pf ec_bind_virq_to_irq+99 0 fffffffffbc20d90 6afb5cd249 intr 104 cbe_fire inflate_fast+112 0 fffffffffbc20c08 6afb155e6b intr 13 uhci_intr HYPERVISOR_sched_op+29 0 fffffffffbc20a80 6afb02521b intr 13 uhci_intr HYPERVISOR_sched_op+29 0 fffffffffbc208f8 6afaeef1e9 intr 13 uhci_intr HYPERVISOR_sched_op+29 0 fffffffffbc20770 6afad1750e intr 13 uhci_intr HYPERVISOR_sched_op+29 0 fffffffffbc205e8 6afab239db intr 13 uhci_intr HYPERVISOR_sched_op+29 0 fffffffffbc20460 6afa8da042 intr 13 uhci_intr HYPERVISOR_sched_op+29 0 fffffffffbc202d8 6afa725ad3 intr 13 uhci_intr copy_pattern+1f 0 fffffffffbc20150 6afa618a15 intr 13 uhci_intr HYPERVISOR_sched_op+29 fffffffffbca5f50 unix:die+d2 () fffffffffbca6080 unix:trap+162f () fffffffffbca6090 unix:cmntrap+24d () fffffffffbca61c0 unix:ec_bind_virq_to_irq+99 () fffffffffbca61f0 xpv_psm:xen_psm_cpu_start+4b () fffffffffbca6210 unix:mach_cpu_start+4a () fffffffffbca6270 unix:start_cpu+5e () fffffffffbca62b0 unix:start_other_cpus+db () fffffffffbca62f0 genunix:main+2bf () fffffffffbca6300 unix:_locore_start+80 () panic: entering debugger (no dump device, continue to reboot) Loaded modules: [ scsi_vhci neti xpv_psm zfs uhci hook ip usba specfs sctp arp xpv_uppc ] kmdb: target stopped at: kmdb_enter+0xb: movq %rax,%rdi [0]> ec_bind_virq_to_irq+0x90::dis ec_bind_virq_to_irq+0x90: sti ec_bind_virq_to_irq+0x91: movq %r13,%rdi ec_bind_virq_to_irq+0x94: call +0x16d37 <mutex_exit> ec_bind_virq_to_irq+0x99: addb %al,(%rax) << assemble after corrupt ec_bind_virq_to_irq+0x9b: addb %al,(%rax) ec_bind_virq_to_irq+0x9d: addb %al,(%rax) ec_bind_virq_to_irq+0x9f: addb %al,(%rax) ec_bind_virq_to_irq+0xa1: sti ec_bind_virq_to_irq+0xa2: popq %r14 ec_bind_virq_to_irq+0xa4: popq %r13 ec_bind_virq_to_irq+0xa6: popq %r12 [0]> ::dis ec_bind_virq_to_irq+0xa8: popq %rbx ec_bind_virq_to_irq+0xa9: leave ec_bind_virq_to_irq+0xaa: ret ec_bind_virq_to_irq+0xab: leaq +0x110296(%rip),%rdi <0xfffffffffb956fd8> [0]> End of debugging information. Thanks, -Baolu On Wed, Oct 15, 2008 at 5:13 PM, Juergen Keil <jk@tools.de> wrote:>> This panic happened between line 729 and 730. The disassemble of this code is: >> >> [0]> ec_bind_virq_to_irq::dis >> ec_bind_virq_to_irq+0x95: call -0x97a <alloc_irq> >> ec_bind_virq_to_irq+0x9a: >> movw %ax,0xfffffffffbc46ac0(%r12) <virq_info+0x200> >> ec_bind_virq_to_irq+0xa3: movq %r13,%rdi >> ec_bind_virq_to_irq+0xa6: call +0x16d35 <mutex_exit> >> ec_bind_virq_to_irq+0xab: addb %al,(%rax) >> ec_bind_virq_to_irq+0xad: addb %al,(%rax) >> ec_bind_virq_to_irq+0xaf: addb %al,(%rax) >> ec_bind_virq_to_irq+0xb1: addb %al,(%rax) >> ec_bind_virq_to_irq+0xb3: sti >> ec_bind_virq_to_irq+0xb4: popq %r14 >> ec_bind_virq_to_irq+0xb6: popq %r13 >> ec_bind_virq_to_irq+0xb8: popq %r12 >> ec_bind_virq_to_irq+0xba: popq %rbx >> ec_bind_virq_to_irq+0xbb: leave >> ec_bind_virq_to_irq+0xbc: ret > > > The code starting at ec_bind_virq_to_irq+0xab looks corrupted. > > Seems as if someone stored a 64-bit "0" to that location, corrupting > the original code. > > > When you boot the dom0 kernel with options "-kd" and disassemble > ec_bind_virq_to_irq+0xab (before the kernel starts running), > does the code look different? > > In case it''s already corrupted before the kernel starts running: > try to check the dom0 kernel binary /platform/i86xpv/kernel/amd64/unix > with "dis -F ec_bind_virq_to_irq /platform/i86xpv/kernel/amd64/unix", > check if ec_bind_virq_to_irq+0xab contains reasonble code there. > > > If case the corruption happens at kernel runtime, it might be possble > to find the source of the corruption using kmdb and a write watchpoint: > > ec_bind_virq_to_irq+0xab::wp -w -L 8 > > >
Is it possible that the hypervisor writes 64-bits 0''s to domain 0 kernel space (ec_bind_virq_to_irq+0x99)? Below is the byte change when panic happens. Bytes before kernel start runing: [0]> fffffffffb846d29,10/B ec_bind_virq_to_irq+0x99: 41 f bf 84 24 c0 6a c4 fb 41 5e 41 5d 41 5c 5b Bytes after kernel panic: [0]> fffffffffb846d29,10/B ec_bind_virq_to_irq+0x99: 0 0 0 0 0 0 0 0 fb 41 5e 41 5d 41 5c 5b Thanks, -Baolu On Thu, Oct 16, 2008 at 11:06 AM, Lu Baolu <allen.virt@gmail.com> wrote:> Hi Keil, > > The assemble of ec_bind_virq_to_irq() are different between the points > where the kernel hasn''t yet started running and where the panic > happens. I set a write watch point for the changed place, but nothing > happens before panic. I posted my debug processes below for your > information. > > I removed two cmn_err() lines in ec_bind_virq_to_irq() which I added > for debugging. So the offset is different from the last post. > > > krtld: Unused kernel arguments: `xen_processor_pmbits=0''. > Loading kmdb... > > Welcome to kmdb > kmdb: unable to determine terminal type: assuming `vt100'' > Loaded modules: [ unix krtld genunix ] > [0]> ec_bind_virq_to_irq+0x90::dis > ec_bind_virq_to_irq+0x90: sti > ec_bind_virq_to_irq+0x91: movq %r13,%rdi > ec_bind_virq_to_irq+0x94: call +0x16d37 <mutex_exit> > ec_bind_virq_to_irq+0x99: > movswl 0xfffffffffbc46ac0(%r12),%eax <virq_info+0x200> > << this line of code corrupt before panic > ec_bind_virq_to_irq+0xa2: popq %r14 > ec_bind_virq_to_irq+0xa4: popq %r13 > ec_bind_virq_to_irq+0xa6: popq %r12 > ec_bind_virq_to_irq+0xa8: popq %rbx > ec_bind_virq_to_irq+0xa9: leave > ec_bind_virq_to_irq+0xaa: ret > ec_bind_virq_to_irq+0xab: leaq +0x110296(%rip),%rdi > <0xfffffffffb956fd8> > [0]> ec_bind_virq_to_irq+0x90::wp -w -L 8 > [0]> ec_bind_virq_to_irq+0x98::wp -w -L 8 > [0]> ec_bind_virq_to_irq+0xa0::wp -w -L 2 > [0]> $b > ID S TA HT LM Description Action > ----- - -- -- -- ---------------------------------------- ---------------------- > [ 1 ] + 0 0 stop on write of [ec_bind_virq_to_irq+0x - > [ 2 ] + 0 0 stop on write of [ec_bind_virq_to_irq+0x - > [ 3 ] + 0 0 stop on write of [ec_bind_virq_to_irq+0x - > [0]> :c > WARNING: Cannot extract revision on this hypervisor version: > v-unstable-xvm, unexpected version format > v3.3-unstable-xvm chgset ''Tue Oct 14 12:54:19 2008 +0800 17779:88ed7e891155'' > > SunOS Release 5.11 Version onnv-98 64-bit > Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. > Use is subject to license terms. > DEBUG enabled > > panic[cpu0]/thread=fffffffffbc736e0: BAD TRAP: type=e (#pf Page fault) > rp=fffffffffbca6090 addr=d occurred in module "unix" due to a NULL > pointer dereference > > #pf Page fault > Bad kernel fault at addr=0xd > pid=0, pc=0xfffffffffb846d29, sp=0xfffffffffbca6180, eflags=0x10246 > cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 2620<vmxe,xmme,fxsr,pae> > cr2: d > rdi: 286 rsi: 0 rdx: fffffffe > rcx: 1 r8: 0 r9: 40000 > rax: d rbx: 0 rbp: fffffffffbca61c0 > r10: fffffffffbc74ab0 r11: ffffff012fe59000 r12: 0 > r13: fffffffffbcb6dc0 r14: 1 r15: ffffff0135a8d580 > fsb: 200000000 gsb: fffffffffbc74ab0 ds: 0 > es: 0 fs: 0 gs: 0 > trp: e err: 2 rip: fffffffffb846d29 > cs: e030 rfl: 10246 rsp: fffffffffbca6180 > ss: e02b > > cpu address timestamp type vc handler pc > 0 fffffffffbc20f18 6afbf91146 trap e #pf ec_bind_virq_to_irq+99 > 0 fffffffffbc20d90 6afb5cd249 intr 104 cbe_fire inflate_fast+112 > 0 fffffffffbc20c08 6afb155e6b intr 13 uhci_intr HYPERVISOR_sched_op+29 > 0 fffffffffbc20a80 6afb02521b intr 13 uhci_intr HYPERVISOR_sched_op+29 > 0 fffffffffbc208f8 6afaeef1e9 intr 13 uhci_intr HYPERVISOR_sched_op+29 > 0 fffffffffbc20770 6afad1750e intr 13 uhci_intr HYPERVISOR_sched_op+29 > 0 fffffffffbc205e8 6afab239db intr 13 uhci_intr HYPERVISOR_sched_op+29 > 0 fffffffffbc20460 6afa8da042 intr 13 uhci_intr HYPERVISOR_sched_op+29 > 0 fffffffffbc202d8 6afa725ad3 intr 13 uhci_intr copy_pattern+1f > 0 fffffffffbc20150 6afa618a15 intr 13 uhci_intr HYPERVISOR_sched_op+29 > > fffffffffbca5f50 unix:die+d2 () > fffffffffbca6080 unix:trap+162f () > fffffffffbca6090 unix:cmntrap+24d () > fffffffffbca61c0 unix:ec_bind_virq_to_irq+99 () > fffffffffbca61f0 xpv_psm:xen_psm_cpu_start+4b () > fffffffffbca6210 unix:mach_cpu_start+4a () > fffffffffbca6270 unix:start_cpu+5e () > fffffffffbca62b0 unix:start_other_cpus+db () > fffffffffbca62f0 genunix:main+2bf () > fffffffffbca6300 unix:_locore_start+80 () > > panic: entering debugger (no dump device, continue to reboot) > > Loaded modules: [ scsi_vhci neti xpv_psm zfs uhci hook ip usba specfs sctp arp > xpv_uppc ] > kmdb: target stopped at: > kmdb_enter+0xb: movq %rax,%rdi > [0]> ec_bind_virq_to_irq+0x90::dis > ec_bind_virq_to_irq+0x90: sti > ec_bind_virq_to_irq+0x91: movq %r13,%rdi > ec_bind_virq_to_irq+0x94: call +0x16d37 <mutex_exit> > ec_bind_virq_to_irq+0x99: addb %al,(%rax) > << assemble after corrupt > ec_bind_virq_to_irq+0x9b: addb %al,(%rax) > ec_bind_virq_to_irq+0x9d: addb %al,(%rax) > ec_bind_virq_to_irq+0x9f: addb %al,(%rax) > ec_bind_virq_to_irq+0xa1: sti > ec_bind_virq_to_irq+0xa2: popq %r14 > ec_bind_virq_to_irq+0xa4: popq %r13 > ec_bind_virq_to_irq+0xa6: popq %r12 > [0]> ::dis > ec_bind_virq_to_irq+0xa8: popq %rbx > ec_bind_virq_to_irq+0xa9: leave > ec_bind_virq_to_irq+0xaa: ret > ec_bind_virq_to_irq+0xab: leaq +0x110296(%rip),%rdi > <0xfffffffffb956fd8> > [0]> > > End of debugging information. > > Thanks, > -Baolu > > On Wed, Oct 15, 2008 at 5:13 PM, Juergen Keil <jk@tools.de> wrote: >>> This panic happened between line 729 and 730. The disassemble of this code is: >>> >>> [0]> ec_bind_virq_to_irq::dis >>> ec_bind_virq_to_irq+0x95: call -0x97a <alloc_irq> >>> ec_bind_virq_to_irq+0x9a: >>> movw %ax,0xfffffffffbc46ac0(%r12) <virq_info+0x200> >>> ec_bind_virq_to_irq+0xa3: movq %r13,%rdi >>> ec_bind_virq_to_irq+0xa6: call +0x16d35 <mutex_exit> >>> ec_bind_virq_to_irq+0xab: addb %al,(%rax) >>> ec_bind_virq_to_irq+0xad: addb %al,(%rax) >>> ec_bind_virq_to_irq+0xaf: addb %al,(%rax) >>> ec_bind_virq_to_irq+0xb1: addb %al,(%rax) >>> ec_bind_virq_to_irq+0xb3: sti >>> ec_bind_virq_to_irq+0xb4: popq %r14 >>> ec_bind_virq_to_irq+0xb6: popq %r13 >>> ec_bind_virq_to_irq+0xb8: popq %r12 >>> ec_bind_virq_to_irq+0xba: popq %rbx >>> ec_bind_virq_to_irq+0xbb: leave >>> ec_bind_virq_to_irq+0xbc: ret >> >> >> The code starting at ec_bind_virq_to_irq+0xab looks corrupted. >> >> Seems as if someone stored a 64-bit "0" to that location, corrupting >> the original code. >> >> >> When you boot the dom0 kernel with options "-kd" and disassemble >> ec_bind_virq_to_irq+0xab (before the kernel starts running), >> does the code look different? >> >> In case it''s already corrupted before the kernel starts running: >> try to check the dom0 kernel binary /platform/i86xpv/kernel/amd64/unix >> with "dis -F ec_bind_virq_to_irq /platform/i86xpv/kernel/amd64/unix", >> check if ec_bind_virq_to_irq+0xab contains reasonble code there. >> >> >> If case the corruption happens at kernel runtime, it might be possble >> to find the source of the corruption using kmdb and a write watchpoint: >> >> ec_bind_virq_to_irq+0xab::wp -w -L 8 >> >> >> >