thr3ads.net - zfs discuss - [zfs-discuss] Corrupt meta data, the coredump [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Timh Bergström

2009-Jun-11 14:25 UTC

[zfs-discuss] Corrupt meta data, the coredump

Hi all,

I''ve encountered a not so fun problem with one of our pools, the pool
was built with raidz1 according to the zfs-manual, the discs was
imported through an ERQ 16x750GB FC-Array (exported as JBOD) via
(QLogic) FC-HBA''s to Solaris 10u3 (x86). Everything have worked fine
and dandy until this morning when the disc-enclosure "crashed" (Reason
unknown) and subsequently dragged the whole system with it, I didnt
get the coredump at the time but now when i''ve restarted and
reattached the enclosure and tried to import the zpool again I got the
following:

# zpool status -vx
pool: migrated_data
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
see: http://www.sun.com/msg/ZFS-8000-CS
scrub: none requested
config:
...

And just a couple of seconds after zpool status -vx the machine coredumps with:

panic[cpu0]/thread=fffffe80fcd34ba0: BAD TRAP: type=e (#pf Page fault) rp=fffffe
800138cb10 addr=0 occurred in module "zfs" due to a NULL pointer
dereference
zpool: #pf Page fault
Bad kernel fault at addr=0x0
pid=1116, pc=0xfffffffff0663b45, sp=0xfffffe800138cc00, eflags=0x10202
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4:
6f0<xmme,fxsr,pge,mce,pae,pse>
cr2: 0 cr3: e5f2000 cr8: c
         rdi: ffffffff80039200 rsi: ffffffff89d883c0 rdx:                0
         rcx: fffffe80e3667000  r8:                1  r9:                0
         rax:                0 rbx:                1 rbp: fffffe800138cc10
         r10: ffffffff938eb920 r11:                3 r12: ffffffffb0bc4080
         r13: ffffffffb0bc42f0 r14:                1 r15:                0
         fsb: ffffffff80000000 gsb: fffffffffbc240e0  ds:               43
         es:               43  fs:                0  gs:              1c3
         trp:                e err:                0 rip: fffffffff0663b45
          cs:               28 rfl:            10202 rsp: fffffe800138cc00
          ss:               30
...

This occurs a couple of seconds after the system is fully booted, i''ve
tried several times to be fast enough to unconfigure the
fc-controllers but.. to slow :-) . So I shut the path for the machine
to the FC-enclosure, and of course the pool is now "UNAVAIL" which is
ok since my other pools work fine.

Im curious though - how can metadata be corrupted like that? Why does
the system panic? Can it be repaired?

I know I should have backups but I dont, and if it''s a lost cause
it''s
fine, the data itself is not important.

-- 
Timh Bergstr?m
System Operations Manager
Diino AB - www.diino.com
:wq

Richard Elling

2009-Jun-11 15:35 UTC

head link

[zfs-discuss] Corrupt meta data, the coredump

This sounds like
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6587723
which was fixed a long time ago.  You might check that bug against your
stack trace (which was not included in this post).

You may be able to boot from a later OS release and import/export the pool
to repair.
 -- richard

Timh Bergstr?m wrote:> Hi all,
>
> I''ve encountered a not so fun problem with one of our pools, the
pool
> was built with raidz1 according to the zfs-manual, the discs was
> imported through an ERQ 16x750GB FC-Array (exported as JBOD) via
> (QLogic) FC-HBA''s to Solaris 10u3 (x86). Everything have worked
fine
> and dandy until this morning when the disc-enclosure "crashed"
(Reason
> unknown) and subsequently dragged the whole system with it, I didnt
> get the coredump at the time but now when i''ve restarted and
> reattached the enclosure and tried to import the zpool again I got the
> following:
>
> # zpool status -vx
> pool: migrated_data
> state: FAULTED
> status: The pool metadata is corrupted and the pool cannot be opened.
> action: Destroy and re-create the pool from a backup source.
> see: http://www.sun.com/msg/ZFS-8000-CS
> scrub: none requested
> config:
> ...
>
> And just a couple of seconds after zpool status -vx the machine coredumps
with:
>
> panic[cpu0]/thread=fffffe80fcd34ba0: BAD TRAP: type=e (#pf Page fault)
rp=fffffe
> 800138cb10 addr=0 occurred in module "zfs" due to a NULL pointer
dereference
> zpool: #pf Page fault
> Bad kernel fault at addr=0x0
> pid=1116, pc=0xfffffffff0663b45, sp=0xfffffe800138cc00, eflags=0x10202
> cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4:
6f0<xmme,fxsr,pge,mce,pae,pse>
> cr2: 0 cr3: e5f2000 cr8: c
>          rdi: ffffffff80039200 rsi: ffffffff89d883c0 rdx:                0
>          rcx: fffffe80e3667000  r8:                1  r9:                0
>          rax:                0 rbx:                1 rbp: fffffe800138cc10
>          r10: ffffffff938eb920 r11:                3 r12: ffffffffb0bc4080
>          r13: ffffffffb0bc42f0 r14:                1 r15:                0
>          fsb: ffffffff80000000 gsb: fffffffffbc240e0  ds:               43
>          es:               43  fs:                0  gs:              1c3
>          trp:                e err:                0 rip: fffffffff0663b45
>           cs:               28 rfl:            10202 rsp: fffffe800138cc00
>           ss:               30
> ...
>
> This occurs a couple of seconds after the system is fully booted,
i''ve
> tried several times to be fast enough to unconfigure the
> fc-controllers but.. to slow :-) . So I shut the path for the machine
> to the FC-enclosure, and of course the pool is now "UNAVAIL"
which is
> ok since my other pools work fine.
>
> Im curious though - how can metadata be corrupted like that? Why does
> the system panic? Can it be repaired?
>
> I know I should have backups but I dont, and if it''s a lost cause
it''s
> fine, the data itself is not important.
>
>

Timh Bergström

2009-Jun-12 04:33 UTC

head link

[zfs-discuss] Corrupt meta data, the coredump

Hi,

It indeed does, I am running a really old version of zfs (3?) so i
figured a newer release would atleast not panic, but the bug report
shows exactly what I saw.

I''ll give it a shot, thanks.

//Timh

Den den 11 juni 2009 17:35 skrev Richard Elling<richard.elling at
gmail.com>:> This sounds like
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6587723
> which was fixed a long time ago. ?You might check that bug against your
> stack trace (which was not included in this post).
>
> You may be able to boot from a later OS release and import/export the pool
> to repair.
> -- richard
>
> Timh Bergstr?m wrote:
>>
>> Hi all,
>>
>> I''ve encountered a not so fun problem with one of our pools,
the pool
>> was built with raidz1 according to the zfs-manual, the discs was
>> imported through an ERQ 16x750GB FC-Array (exported as JBOD) via
>> (QLogic) FC-HBA''s to Solaris 10u3 (x86). Everything have
worked fine
>> and dandy until this morning when the disc-enclosure
"crashed" (Reason
>> unknown) and subsequently dragged the whole system with it, I didnt
>> get the coredump at the time but now when i''ve restarted and
>> reattached the enclosure and tried to import the zpool again I got the
>> following:
>>
>> # zpool status -vx
>> pool: migrated_data
>> state: FAULTED
>> status: The pool metadata is corrupted and the pool cannot be opened.
>> action: Destroy and re-create the pool from a backup source.
>> see: http://www.sun.com/msg/ZFS-8000-CS
>> scrub: none requested
>> config:
>> ...
>>
>> And just a couple of seconds after zpool status -vx the machine
coredumps
>> with:
>>
>> panic[cpu0]/thread=fffffe80fcd34ba0: BAD TRAP: type=e (#pf Page fault)
>> rp=fffffe
>> 800138cb10 addr=0 occurred in module "zfs" due to a NULL
pointer
>> dereference
>> zpool: #pf Page fault
>> Bad kernel fault at addr=0x0
>> pid=1116, pc=0xfffffffff0663b45, sp=0xfffffe800138cc00, eflags=0x10202
>> cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4:
6f0<xmme,fxsr,pge,mce,pae,pse>
>> cr2: 0 cr3: e5f2000 cr8: c
>> ? ? ? ? rdi: ffffffff80039200 rsi: ffffffff89d883c0 rdx: ? ? ? ? ? ? ?
?0
>> ? ? ? ? rcx: fffffe80e3667000 ?r8: ? ? ? ? ? ? ? ?1 ?r9: ? ? ? ? ? ? ?
?0
>> ? ? ? ? rax: ? ? ? ? ? ? ? ?0 rbx: ? ? ? ? ? ? ? ?1 rbp:
fffffe800138cc10
>> ? ? ? ? r10: ffffffff938eb920 r11: ? ? ? ? ? ? ? ?3 r12:
ffffffffb0bc4080
>> ? ? ? ? r13: ffffffffb0bc42f0 r14: ? ? ? ? ? ? ? ?1 r15: ? ? ? ? ? ? ?
?0
>> ? ? ? ? fsb: ffffffff80000000 gsb: fffffffffbc240e0 ?ds: ? ? ? ? ? ? ?
43
>> ? ? ? ? es: ? ? ? ? ? ? ? 43 ?fs: ? ? ? ? ? ? ? ?0 ?gs: ? ? ? ? ? ?
?1c3
>> ? ? ? ? trp: ? ? ? ? ? ? ? ?e err: ? ? ? ? ? ? ? ?0 rip:
fffffffff0663b45
>> ? ? ? ? ?cs: ? ? ? ? ? ? ? 28 rfl: ? ? ? ? ? ?10202 rsp:
fffffe800138cc00
>> ? ? ? ? ?ss: ? ? ? ? ? ? ? 30
>> ...
>>
>> This occurs a couple of seconds after the system is fully booted,
i''ve
>> tried several times to be fast enough to unconfigure the
>> fc-controllers but.. to slow :-) . So I shut the path for the machine
>> to the FC-enclosure, and of course the pool is now "UNAVAIL"
which is
>> ok since my other pools work fine.
>>
>> Im curious though - how can metadata be corrupted like that? Why does
>> the system panic? Can it be repaired?
>>
>> I know I should have backups but I dont, and if it''s a lost
cause it''s
>> fine, the data itself is not important.
>>
>>
>


-- 
Timh Bergstr?m
System Operations Manager
Diino AB - www.diino.com
:wq

Richard Elling

2009-Jun-12 14:05 UTC

head link

[zfs-discuss] Corrupt meta data, the coredump

Timh Bergstr?m wrote:> Hi,
>
> It indeed does, I am running a really old version of zfs (3?) so i
> figured a newer release would atleast not panic, but the bug report
> shows exactly what I saw.
>   
A newer release should not panic, or at least not at the same place.
If it does, then we might be seeing a regression, which would need a
new bug to be filed against it.
 -- richard
> I''ll give it a shot, thanks.
>
> //Timh
>
> Den den 11 juni 2009 17:35 skrev Richard Elling<richard.elling at
gmail.com>:
>   
>> This sounds like
>> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6587723
>> which was fixed a long time ago.  You might check that bug against your
>> stack trace (which was not included in this post).
>>
>> You may be able to boot from a later OS release and import/export the
pool
>> to repair.
>> -- richard
>>
>> Timh Bergstr?m wrote:
>>     
>>> Hi all,
>>>
>>> I''ve encountered a not so fun problem with one of our
pools, the pool
>>> was built with raidz1 according to the zfs-manual, the discs was
>>> imported through an ERQ 16x750GB FC-Array (exported as JBOD) via
>>> (QLogic) FC-HBA''s to Solaris 10u3 (x86). Everything have
worked fine
>>> and dandy until this morning when the disc-enclosure
"crashed" (Reason
>>> unknown) and subsequently dragged the whole system with it, I didnt
>>> get the coredump at the time but now when i''ve restarted
and
>>> reattached the enclosure and tried to import the zpool again I got
the
>>> following:
>>>
>>> # zpool status -vx
>>> pool: migrated_data
>>> state: FAULTED
>>> status: The pool metadata is corrupted and the pool cannot be
opened.
>>> action: Destroy and re-create the pool from a backup source.
>>> see: http://www.sun.com/msg/ZFS-8000-CS
>>> scrub: none requested
>>> config:
>>> ...
>>>
>>> And just a couple of seconds after zpool status -vx the machine
coredumps
>>> with:
>>>
>>> panic[cpu0]/thread=fffffe80fcd34ba0: BAD TRAP: type=e (#pf Page
fault)
>>> rp=fffffe
>>> 800138cb10 addr=0 occurred in module "zfs" due to a NULL
pointer
>>> dereference
>>> zpool: #pf Page fault
>>> Bad kernel fault at addr=0x0
>>> pid=1116, pc=0xfffffffff0663b45, sp=0xfffffe800138cc00,
eflags=0x10202
>>> cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4:
6f0<xmme,fxsr,pge,mce,pae,pse>
>>> cr2: 0 cr3: e5f2000 cr8: c
>>>         rdi: ffffffff80039200 rsi: ffffffff89d883c0 rdx:           
0
>>>         rcx: fffffe80e3667000  r8:                1  r9:           
0
>>>         rax:                0 rbx:                1 rbp:
fffffe800138cc10
>>>         r10: ffffffff938eb920 r11:                3 r12:
ffffffffb0bc4080
>>>         r13: ffffffffb0bc42f0 r14:                1 r15:           
0
>>>         fsb: ffffffff80000000 gsb: fffffffffbc240e0  ds:           
43
>>>         es:               43  fs:                0  gs:            
1c3
>>>         trp:                e err:                0 rip:
fffffffff0663b45
>>>          cs:               28 rfl:            10202 rsp:
fffffe800138cc00
>>>          ss:               30
>>> ...
>>>
>>> This occurs a couple of seconds after the system is fully booted,
i''ve
>>> tried several times to be fast enough to unconfigure the
>>> fc-controllers but.. to slow :-) . So I shut the path for the
machine
>>> to the FC-enclosure, and of course the pool is now
"UNAVAIL" which is
>>> ok since my other pools work fine.
>>>
>>> Im curious though - how can metadata be corrupted like that? Why
does
>>> the system panic? Can it be repaired?
>>>
>>> I know I should have backups but I dont, and if it''s a
lost cause it''s
>>> fine, the data itself is not important.
>>>
>>>
>>>       
>
>
>
>

zfs discuss - Jun 2009 - Corrupt meta data, the coredump

[zfs-discuss] Corrupt meta data, the coredump

[zfs-discuss] Corrupt meta data, the coredump

[zfs-discuss] Corrupt meta data, the coredump

[zfs-discuss] Corrupt meta data, the coredump