Hello all, I''m running 2009.06 and I''ve got a "random" kernel panic that keeps killing my system under high IO loads. It happens almost every time I start loading up the writes on at pool. Memory has been tested extensively and I''m relatively certain this is not a hardware related issue. here is the panic: Sep 9 22:09:45 eon genunix: [ID 683410 kern.notice] BAD TRAP: type=d (#gp General protection) rp=ffffff0010362770 addr=ff7fff02fe41cc78 Sep 9 22:09:45 eon unix: [ID 100000 kern.notice] Sep 9 22:09:45 eon unix: [ID 839527 kern.notice] sched: Sep 9 22:09:45 eon unix: [ID 753105 kern.notice] #gp General protection Sep 9 22:09:45 eon unix: [ID 358286 kern.notice] addr=0xff7fff02fe41cc78 Sep 9 22:09:45 eon unix: [ID 243837 kern.notice] pid=0, pc=0xfffffffff78a614f, sp=0xffffff0010362860, eflags=0x10246 Sep 9 22:09:45 eon unix: [ID 211416 kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de> Sep 9 22:09:45 eon unix: [ID 624947 kern.notice] cr2: 816b000 Sep 9 22:09:45 eon unix: [ID 625075 kern.notice] cr3: 3c00000 Sep 9 22:09:45 eon unix: [ID 625715 kern.notice] cr8: c Sep 9 22:09:45 eon unix: [ID 100000 kern.notice] Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] rdi: ffffff02d239d100 rsi: 2 rdx: ffffff0010362c60 Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] rcx: 0 r8: fffffffff78a6120 r9: ffffff02df164a40 Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] rax: 0 rbx: 0 rbp: ffffff00103628a0 Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] r10: ffffff02f5046058 r11: 1 r12: ffffff02d239d100 Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] r13: ffffff02d1649be0 r14: ffffff02e94b2d08 r15: ff7fff02fe41cc78 Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] fsb: 0 gsb: ffffff02cfb91ac0 ds: 4b Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] es: 4b fs: 0 gs: 1c3 Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] trp: d err: 0 rip: fffffffff78a614f Sep 9 22:09:45 eon unix: [ID 592667 kern.notice] cs: 30 rfl: 10246 rsp: ffffff0010362860 Sep 9 22:09:45 eon unix: [ID 266532 kern.notice] ss: 38 Sep 9 22:09:45 eon unix: [ID 100000 kern.notice] Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362650 unix:die+10f () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362760 unix:trap+43e () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362770 unix:cmntrap+e9 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff00103628a0 zfs:arc_write_done+2f () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff00103628f0 zfs:zio_done+26d () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362920 zfs:zio_execute+a0 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362980 zfs:zio_notify_parent+a6 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff00103629d0 zfs:zio_done+2d9 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362a00 zfs:zio_execute+a0 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362a60 zfs:zio_notify_parent+a6 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362ab0 zfs:zio_done+2d9 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362ae0 zfs:zio_execute+a0 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362b40 zfs:zio_notify_parent+a6 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362b90 zfs:zio_done+2d9 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362bc0 zfs:zio_execute+a0 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362c40 genunix:taskq_thread+193 () Sep 9 22:09:45 eon genunix: [ID 655072 kern.notice] ffffff0010362c50 unix:thread_start+8 () Sep 9 22:09:45 eon unix: [ID 100000 kern.notice] Sep 9 22:09:45 eon genunix: [ID 672855 kern.notice] syncing file systems... Sep 9 22:09:45 eon genunix: [ID 904073 kern.notice] done Sep 9 22:09:46 eon genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
>Hello all, I''m running 2009.06 and I''ve got a "random" kernel panic >that keeps killing my system under high IO loads. It happens almost >every time I start loading up the writes on at pool. Memory has been >tested extensively and I''m relatively certain this is not a hardware >related issue. here is the panic: >Sep 9 22:09:45 eon genunix: [ID 683410 kern.notice] BAD TRAP: type=d >(#gp General protection) rp=ffffff0010362770 addr=ff7fff02fe41cc78 >Sep 9 22:09:45 eon unix: [ID 100000 kern.notice] >Sep 9 22:09:45 eon unix: [ID 839527 kern.notice] sched: >Sep 9 22:09:45 eon unix: [ID 753105 kern.notice] #gp General protection >Sep 9 22:09:45 eon unix: [ID 358286 kern.notice] addr=0xff7fff02fe41cc78 >Sep 9 22:09:45 eon unix: [ID 243837 kern.notice] pid=0,"Random" panics are, unfortunately, mostly caused by bad hardware. Do you have ECC memory in the system? Did you run memtest86 on your system? Casper
On Thu, Sep 10, 2009 at 5:11 AM, <Casper.Dik at sun.com> wrote:> >>Hello all, I''m running 2009.06 and I''ve got a "random" kernel panic >>that keeps killing my system under high IO loads. ?It happens almost >>every time I start loading up the writes on at pool. ?Memory has been >>tested extensively and I''m relatively certain this is not a hardware >>related issue. ?here is the panic: >>Sep ?9 22:09:45 eon genunix: [ID 683410 kern.notice] BAD TRAP: type=d >>(#gp General protection) rp=ffffff0010362770 addr=ff7fff02fe41cc78 >>Sep ?9 22:09:45 eon unix: [ID 100000 kern.notice] >>Sep ?9 22:09:45 eon unix: [ID 839527 kern.notice] sched: >>Sep ?9 22:09:45 eon unix: [ID 753105 kern.notice] #gp General protection >>Sep ?9 22:09:45 eon unix: [ID 358286 kern.notice] addr=0xff7fff02fe41cc78 >>Sep ?9 22:09:45 eon unix: [ID 243837 kern.notice] pid=0, > > > "Random" panics are, unfortunately, mostly caused by bad hardware. > > Do you have ECC memory in the system? ?Did you run memtest86 on your > system?Casper, I have run memtest86 on the machine for about 4 hours which was enough time to complete two passes. It is not ECC memory in this machine. Perhaps if I said this isn''t a random panic but more of an easily reproducable panic... :) If I do dd if=/dev/zero of=/pool/blah bs=1024k count=10000 it will always panic and reboot. In this type of a scenario it seems less like hardware to me and more like a bug. What do you think? Brandon
On Sep 10, 2009, at 7:07 AM, Brandon Mercer wrote:> On Thu, Sep 10, 2009 at 5:11 AM, <Casper.Dik at sun.com> wrote: >> >>> Hello all, I''m running 2009.06 and I''ve got a "random" kernel panic >>> that keeps killing my system under high IO loads. It happens almost >>> every time I start loading up the writes on at pool. Memory has >>> been >>> tested extensively and I''m relatively certain this is not a hardware >>> related issue. here is the panic: >>> Sep 9 22:09:45 eon genunix: [ID 683410 kern.notice] BAD TRAP: >>> type=d >>> (#gp General protection) rp=ffffff0010362770 addr=ff7fff02fe41cc78 >>> Sep 9 22:09:45 eon unix: [ID 100000 kern.notice] >>> Sep 9 22:09:45 eon unix: [ID 839527 kern.notice] sched: >>> Sep 9 22:09:45 eon unix: [ID 753105 kern.notice] #gp General >>> protection >>> Sep 9 22:09:45 eon unix: [ID 358286 kern.notice] >>> addr=0xff7fff02fe41cc78 >>> Sep 9 22:09:45 eon unix: [ID 243837 kern.notice] pid=0, >> >> >> "Random" panics are, unfortunately, mostly caused by bad hardware. >> >> Do you have ECC memory in the system? Did you run memtest86 on your >> system? > > Casper, > I have run memtest86 on the machine for about 4 hours which was enough > time to complete two passes. It is not ECC memory in this machine. > Perhaps if I said this isn''t a random panic but more of an easily > reproducable panic... :) If I do dd if=/dev/zero of=/pool/blah > bs=1024k count=10000 it will always panic and reboot. In this type of > a scenario it seems less like hardware to me and more like a bug. > What do you think?Brandon, It looks like you have some bad RAM. The bad address (ff7fff02fe41cc78) appears to have a single-bit error (the leading ff7 should probably be fff). -Chris
On Thu, Sep 10, 2009 at 9:09 AM, Chris Kirby <Christopher.Kirby at sun.com> wrote:> On Sep 10, 2009, at 7:07 AM, Brandon Mercer wrote: > >> On Thu, Sep 10, 2009 at 5:11 AM, ?<Casper.Dik at sun.com> wrote: >>> >>>> Hello all, I''m running 2009.06 and I''ve got a "random" kernel panic >>>> that keeps killing my system under high IO loads. ?It happens almost >>>> every time I start loading up the writes on at pool. ?Memory has been >>>> tested extensively and I''m relatively certain this is not a hardware >>>> related issue. ?here is the panic: >>>> Sep ?9 22:09:45 eon genunix: [ID 683410 kern.notice] BAD TRAP: type=d >>>> (#gp General protection) rp=ffffff0010362770 addr=ff7fff02fe41cc78 >>>> Sep ?9 22:09:45 eon unix: [ID 100000 kern.notice] >>>> Sep ?9 22:09:45 eon unix: [ID 839527 kern.notice] sched: >>>> Sep ?9 22:09:45 eon unix: [ID 753105 kern.notice] #gp General protection >>>> Sep ?9 22:09:45 eon unix: [ID 358286 kern.notice] >>>> addr=0xff7fff02fe41cc78 >>>> Sep ?9 22:09:45 eon unix: [ID 243837 kern.notice] pid=0, >>> >>> >>> "Random" panics are, unfortunately, mostly caused by bad hardware. >>> >>> Do you have ECC memory in the system? ?Did you run memtest86 on your >>> system? >> >> Casper, >> I have run memtest86 on the machine for about 4 hours which was enough >> time to complete two passes. ?It is not ECC memory in this machine. >> Perhaps if I said this isn''t a random panic but more of an easily >> reproducable panic... :) ?If I do dd if=/dev/zero of=/pool/blah >> bs=1024k count=10000 it will always panic and reboot. ?In this type of >> a scenario it seems less like hardware to me and more like a bug. >> What do you think? > > Brandon, > ? It looks like you have some bad RAM. ?The bad address (ff7fff02fe41cc78) > appears to have a single-bit error (the leading ff7 should probably be fff).Chris, You may well be right. I appreciate you taking the time to look at this. Just wish there were more reliable tools to find this type of thing. I guess I''ll look into some options with the memory... perhaps I can add voltage or adjust the timing by hand until it goes away. Perhaps I could just buy ECC memory ;) Again, my previous emails only reflected what I know based on the tools I have. Thanks again. Brandon