Hi all, I''ve encountered a not so fun problem with one of our pools, the pool was built with raidz1 according to the zfs-manual, the discs was imported through an ERQ 16x750GB FC-Array (exported as JBOD) via (QLogic) FC-HBA''s to Solaris 10u3 (x86). Everything have worked fine and dandy until this morning when the disc-enclosure "crashed" (Reason unknown) and subsequently dragged the whole system with it, I didnt get the coredump at the time but now when i''ve restarted and reattached the enclosure and tried to import the zpool again I got the following: # zpool status -vx pool: migrated_data state: FAULTED status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-CS scrub: none requested config: ... And just a couple of seconds after zpool status -vx the machine coredumps with: panic[cpu0]/thread=fffffe80fcd34ba0: BAD TRAP: type=e (#pf Page fault) rp=fffffe 800138cb10 addr=0 occurred in module "zfs" due to a NULL pointer dereference zpool: #pf Page fault Bad kernel fault at addr=0x0 pid=1116, pc=0xfffffffff0663b45, sp=0xfffffe800138cc00, eflags=0x10202 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> cr2: 0 cr3: e5f2000 cr8: c rdi: ffffffff80039200 rsi: ffffffff89d883c0 rdx: 0 rcx: fffffe80e3667000 r8: 1 r9: 0 rax: 0 rbx: 1 rbp: fffffe800138cc10 r10: ffffffff938eb920 r11: 3 r12: ffffffffb0bc4080 r13: ffffffffb0bc42f0 r14: 1 r15: 0 fsb: ffffffff80000000 gsb: fffffffffbc240e0 ds: 43 es: 43 fs: 0 gs: 1c3 trp: e err: 0 rip: fffffffff0663b45 cs: 28 rfl: 10202 rsp: fffffe800138cc00 ss: 30 ... This occurs a couple of seconds after the system is fully booted, i''ve tried several times to be fast enough to unconfigure the fc-controllers but.. to slow :-) . So I shut the path for the machine to the FC-enclosure, and of course the pool is now "UNAVAIL" which is ok since my other pools work fine. Im curious though - how can metadata be corrupted like that? Why does the system panic? Can it be repaired? I know I should have backups but I dont, and if it''s a lost cause it''s fine, the data itself is not important. -- Timh Bergstr?m System Operations Manager Diino AB - www.diino.com :wq
This sounds like http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6587723 which was fixed a long time ago. You might check that bug against your stack trace (which was not included in this post). You may be able to boot from a later OS release and import/export the pool to repair. -- richard Timh Bergstr?m wrote:> Hi all, > > I''ve encountered a not so fun problem with one of our pools, the pool > was built with raidz1 according to the zfs-manual, the discs was > imported through an ERQ 16x750GB FC-Array (exported as JBOD) via > (QLogic) FC-HBA''s to Solaris 10u3 (x86). Everything have worked fine > and dandy until this morning when the disc-enclosure "crashed" (Reason > unknown) and subsequently dragged the whole system with it, I didnt > get the coredump at the time but now when i''ve restarted and > reattached the enclosure and tried to import the zpool again I got the > following: > > # zpool status -vx > pool: migrated_data > state: FAULTED > status: The pool metadata is corrupted and the pool cannot be opened. > action: Destroy and re-create the pool from a backup source. > see: http://www.sun.com/msg/ZFS-8000-CS > scrub: none requested > config: > ... > > And just a couple of seconds after zpool status -vx the machine coredumps with: > > panic[cpu0]/thread=fffffe80fcd34ba0: BAD TRAP: type=e (#pf Page fault) rp=fffffe > 800138cb10 addr=0 occurred in module "zfs" due to a NULL pointer dereference > zpool: #pf Page fault > Bad kernel fault at addr=0x0 > pid=1116, pc=0xfffffffff0663b45, sp=0xfffffe800138cc00, eflags=0x10202 > cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> > cr2: 0 cr3: e5f2000 cr8: c > rdi: ffffffff80039200 rsi: ffffffff89d883c0 rdx: 0 > rcx: fffffe80e3667000 r8: 1 r9: 0 > rax: 0 rbx: 1 rbp: fffffe800138cc10 > r10: ffffffff938eb920 r11: 3 r12: ffffffffb0bc4080 > r13: ffffffffb0bc42f0 r14: 1 r15: 0 > fsb: ffffffff80000000 gsb: fffffffffbc240e0 ds: 43 > es: 43 fs: 0 gs: 1c3 > trp: e err: 0 rip: fffffffff0663b45 > cs: 28 rfl: 10202 rsp: fffffe800138cc00 > ss: 30 > ... > > This occurs a couple of seconds after the system is fully booted, i''ve > tried several times to be fast enough to unconfigure the > fc-controllers but.. to slow :-) . So I shut the path for the machine > to the FC-enclosure, and of course the pool is now "UNAVAIL" which is > ok since my other pools work fine. > > Im curious though - how can metadata be corrupted like that? Why does > the system panic? Can it be repaired? > > I know I should have backups but I dont, and if it''s a lost cause it''s > fine, the data itself is not important. > >
Hi, It indeed does, I am running a really old version of zfs (3?) so i figured a newer release would atleast not panic, but the bug report shows exactly what I saw. I''ll give it a shot, thanks. //Timh Den den 11 juni 2009 17:35 skrev Richard Elling<richard.elling at gmail.com>:> This sounds like > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6587723 > which was fixed a long time ago. ?You might check that bug against your > stack trace (which was not included in this post). > > You may be able to boot from a later OS release and import/export the pool > to repair. > -- richard > > Timh Bergstr?m wrote: >> >> Hi all, >> >> I''ve encountered a not so fun problem with one of our pools, the pool >> was built with raidz1 according to the zfs-manual, the discs was >> imported through an ERQ 16x750GB FC-Array (exported as JBOD) via >> (QLogic) FC-HBA''s to Solaris 10u3 (x86). Everything have worked fine >> and dandy until this morning when the disc-enclosure "crashed" (Reason >> unknown) and subsequently dragged the whole system with it, I didnt >> get the coredump at the time but now when i''ve restarted and >> reattached the enclosure and tried to import the zpool again I got the >> following: >> >> # zpool status -vx >> pool: migrated_data >> state: FAULTED >> status: The pool metadata is corrupted and the pool cannot be opened. >> action: Destroy and re-create the pool from a backup source. >> see: http://www.sun.com/msg/ZFS-8000-CS >> scrub: none requested >> config: >> ... >> >> And just a couple of seconds after zpool status -vx the machine coredumps >> with: >> >> panic[cpu0]/thread=fffffe80fcd34ba0: BAD TRAP: type=e (#pf Page fault) >> rp=fffffe >> 800138cb10 addr=0 occurred in module "zfs" due to a NULL pointer >> dereference >> zpool: #pf Page fault >> Bad kernel fault at addr=0x0 >> pid=1116, pc=0xfffffffff0663b45, sp=0xfffffe800138cc00, eflags=0x10202 >> cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> >> cr2: 0 cr3: e5f2000 cr8: c >> ? ? ? ? rdi: ffffffff80039200 rsi: ffffffff89d883c0 rdx: ? ? ? ? ? ? ? ?0 >> ? ? ? ? rcx: fffffe80e3667000 ?r8: ? ? ? ? ? ? ? ?1 ?r9: ? ? ? ? ? ? ? ?0 >> ? ? ? ? rax: ? ? ? ? ? ? ? ?0 rbx: ? ? ? ? ? ? ? ?1 rbp: fffffe800138cc10 >> ? ? ? ? r10: ffffffff938eb920 r11: ? ? ? ? ? ? ? ?3 r12: ffffffffb0bc4080 >> ? ? ? ? r13: ffffffffb0bc42f0 r14: ? ? ? ? ? ? ? ?1 r15: ? ? ? ? ? ? ? ?0 >> ? ? ? ? fsb: ffffffff80000000 gsb: fffffffffbc240e0 ?ds: ? ? ? ? ? ? ? 43 >> ? ? ? ? es: ? ? ? ? ? ? ? 43 ?fs: ? ? ? ? ? ? ? ?0 ?gs: ? ? ? ? ? ? ?1c3 >> ? ? ? ? trp: ? ? ? ? ? ? ? ?e err: ? ? ? ? ? ? ? ?0 rip: fffffffff0663b45 >> ? ? ? ? ?cs: ? ? ? ? ? ? ? 28 rfl: ? ? ? ? ? ?10202 rsp: fffffe800138cc00 >> ? ? ? ? ?ss: ? ? ? ? ? ? ? 30 >> ... >> >> This occurs a couple of seconds after the system is fully booted, i''ve >> tried several times to be fast enough to unconfigure the >> fc-controllers but.. to slow :-) . So I shut the path for the machine >> to the FC-enclosure, and of course the pool is now "UNAVAIL" which is >> ok since my other pools work fine. >> >> Im curious though - how can metadata be corrupted like that? Why does >> the system panic? Can it be repaired? >> >> I know I should have backups but I dont, and if it''s a lost cause it''s >> fine, the data itself is not important. >> >> >-- Timh Bergstr?m System Operations Manager Diino AB - www.diino.com :wq
Timh Bergstr?m wrote:> Hi, > > It indeed does, I am running a really old version of zfs (3?) so i > figured a newer release would atleast not panic, but the bug report > shows exactly what I saw. >A newer release should not panic, or at least not at the same place. If it does, then we might be seeing a regression, which would need a new bug to be filed against it. -- richard> I''ll give it a shot, thanks. > > //Timh > > Den den 11 juni 2009 17:35 skrev Richard Elling<richard.elling at gmail.com>: > >> This sounds like >> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6587723 >> which was fixed a long time ago. You might check that bug against your >> stack trace (which was not included in this post). >> >> You may be able to boot from a later OS release and import/export the pool >> to repair. >> -- richard >> >> Timh Bergstr?m wrote: >> >>> Hi all, >>> >>> I''ve encountered a not so fun problem with one of our pools, the pool >>> was built with raidz1 according to the zfs-manual, the discs was >>> imported through an ERQ 16x750GB FC-Array (exported as JBOD) via >>> (QLogic) FC-HBA''s to Solaris 10u3 (x86). Everything have worked fine >>> and dandy until this morning when the disc-enclosure "crashed" (Reason >>> unknown) and subsequently dragged the whole system with it, I didnt >>> get the coredump at the time but now when i''ve restarted and >>> reattached the enclosure and tried to import the zpool again I got the >>> following: >>> >>> # zpool status -vx >>> pool: migrated_data >>> state: FAULTED >>> status: The pool metadata is corrupted and the pool cannot be opened. >>> action: Destroy and re-create the pool from a backup source. >>> see: http://www.sun.com/msg/ZFS-8000-CS >>> scrub: none requested >>> config: >>> ... >>> >>> And just a couple of seconds after zpool status -vx the machine coredumps >>> with: >>> >>> panic[cpu0]/thread=fffffe80fcd34ba0: BAD TRAP: type=e (#pf Page fault) >>> rp=fffffe >>> 800138cb10 addr=0 occurred in module "zfs" due to a NULL pointer >>> dereference >>> zpool: #pf Page fault >>> Bad kernel fault at addr=0x0 >>> pid=1116, pc=0xfffffffff0663b45, sp=0xfffffe800138cc00, eflags=0x10202 >>> cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> >>> cr2: 0 cr3: e5f2000 cr8: c >>> rdi: ffffffff80039200 rsi: ffffffff89d883c0 rdx: 0 >>> rcx: fffffe80e3667000 r8: 1 r9: 0 >>> rax: 0 rbx: 1 rbp: fffffe800138cc10 >>> r10: ffffffff938eb920 r11: 3 r12: ffffffffb0bc4080 >>> r13: ffffffffb0bc42f0 r14: 1 r15: 0 >>> fsb: ffffffff80000000 gsb: fffffffffbc240e0 ds: 43 >>> es: 43 fs: 0 gs: 1c3 >>> trp: e err: 0 rip: fffffffff0663b45 >>> cs: 28 rfl: 10202 rsp: fffffe800138cc00 >>> ss: 30 >>> ... >>> >>> This occurs a couple of seconds after the system is fully booted, i''ve >>> tried several times to be fast enough to unconfigure the >>> fc-controllers but.. to slow :-) . So I shut the path for the machine >>> to the FC-enclosure, and of course the pool is now "UNAVAIL" which is >>> ok since my other pools work fine. >>> >>> Im curious though - how can metadata be corrupted like that? Why does >>> the system panic? Can it be repaired? >>> >>> I know I should have backups but I dont, and if it''s a lost cause it''s >>> fine, the data itself is not important. >>> >>> >>> > > > >