thr3ads.net - zfs discuss - [zfs-discuss] zfs panic on boot [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Gavin Maltby

2007-Sep-29 21:00 UTC

[zfs-discuss] zfs panic on boot

Hi,

Our zfs nfs build server running snv_73 (pool created back before
zfs integrated to ON) paniced I guess from zfs the first time
and now panics on attempted boot every time as below.  Is this
a known issue and, more importantly (2TB of data in the pool),
any suggestions on how to recover (other than from backup).

panic[cpu0]/thread=ffffff003cc8dc80: zfs: allocating allocated
segment(offset=24872013824 size=4096)
ffffff003cc8d3c0 genunix:vcmn_err+28 ()ck to the main menu.
ffffff003cc8d4b0 zfs:zfs_panic_recover+b6 ()
ffffff003cc8d540 zfs:space_map_add+db ()
ffffff003cc8d5e0 zfs:space_map_load+1f4 ()
ffffff003cc8d620 zfs:metaslab_activate+66 ()
ffffff003cc8d6e0 zfs:metaslab_group_alloc+24e ()
ffffff003cc8d7b0 zfs:metaslab_alloc_dva+192 ()
ffffff003cc8d850 zfs:metaslab_alloc+82 ()
ffffff003cc8d8a0 zfs:zio_dva_allocate+68 ()
ffffff003cc8d8c0 zfs:zio_next_stage+b3 ()
ffffff003cc8d8f0 zfs:zio_checksum_generate+6e ()
ffffff003cc8d910 zfs:zio_next_stage+b3 ()
ffffff003cc8d980 zfs:zio_write_compress+239 ()
ffffff003cc8d9a0 zfs:zio_next_stage+b3 ()
ffffff003cc8d9f0 zfs:zio_wait_for_children+5d ()
ffffff003cc8da10 zfs:zio_wait_children_ready+20 ()
ffffff003cc8da30 zfs:zio_next_stage_async+bb ()
ffffff003cc8da50 zfs:zio_nowait+11 ()
ffffff003cc8dad0 zfs:dmu_objset_sync+172 ()
ffffff003cc8db40 zfs:dsl_pool_sync+199 ()
ffffff003cc8dbd0 zfs:spa_sync+1c5 ()
ffffff003cc8dc60 zfs:txg_sync_thread+19a ()
ffffff003cc8dc70 unix:thread_start+8 ()

In case it matters this is an X4600 M2.  There is about
1.5TB in use out of a 2TB pool.  The IO devices are
nothing exciting but adequate for building - 2 x T3b.
The pool was created under sparc on the old nfs server.

Thanks

Gavin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3249 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070929/d760a40a/attachment.bin>

Gavin Maltby

2007-Oct-01 08:31 UTC

head link

[zfs-discuss] zfs panic on boot

Hi,

On 09/29/07 22:00, Gavin Maltby wrote:> Hi,
> 
> Our zfs nfs build server running snv_73 (pool created back before
> zfs integrated to ON) paniced I guess from zfs the first time
> and now panics on attempted boot every time as below.  Is this
> a known issue and, more importantly (2TB of data in the pool),
> any suggestions on how to recover (other than from backup).
> 
> panic[cpu0]/thread=ffffff003cc8dc80: zfs: allocating allocated 
> segment(offset=24872013824 size=4096)
So in desperation I set ''zfs_recover'' which just produced an
assertion failure moments after the original panic location.
but also set ''aok'' to blast through assertions has allowed
me to import the pool again (I had booted -m milestone=none
and blown away /etc/zfs/zpool.cache to be able to boot at
all).

Luckily just the single corruption apparent at the moment, ie
just a single assertion caught after running for half a day like this:

Sep 30 17:01:53 tb3 genunix: [ID 415322 kern.warning] WARNING: zfs:
allocating allocated segment(offset=24872013824 size=4096)
Sep 30 17:01:53 tb3 genunix: [ID 411747 kern.notice] ASSERTION CAUGHT:
sm->sm_space == space (0xc4896c00 == 0xc4897c00), file:
../../common/fs/zfs/space_map.c, line: 355

What I''d really like to know is whether/how I can map from that
assertion at the pool level back down to a single filesystem
or even file using this segment - perhaps I can recycle that file
to free the segment and set the world straight again?

A scrub is only 20% complete, but has found no errors thus far.  I check
the T3 pair and no complaints there either - I did reboot them just for
luck (last reboot was 2 years ago, apparently!).

Gavin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3249 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071001/dd1fdc15/attachment.bin>

Richard Elling

2007-Oct-01 16:01 UTC

head link

[zfs-discuss] zfs panic on boot

T3 comment below...

Gavin Maltby wrote:> Hi,
> 
> On 09/29/07 22:00, Gavin Maltby wrote:
>> Hi,
>>
>> Our zfs nfs build server running snv_73 (pool created back before
>> zfs integrated to ON) paniced I guess from zfs the first time
>> and now panics on attempted boot every time as below.  Is this
>> a known issue and, more importantly (2TB of data in the pool),
>> any suggestions on how to recover (other than from backup).
>>
>> panic[cpu0]/thread=ffffff003cc8dc80: zfs: allocating allocated 
>> segment(offset=24872013824 size=4096)
> 
> So in desperation I set ''zfs_recover'' which just produced
an
> assertion failure moments after the original panic location.
> but also set ''aok'' to blast through assertions has
allowed
> me to import the pool again (I had booted -m milestone=none
> and blown away /etc/zfs/zpool.cache to be able to boot at
> all).
> 
> Luckily just the single corruption apparent at the moment, ie
> just a single assertion caught after running for half a day like this:
> 
> Sep 30 17:01:53 tb3 genunix: [ID 415322 kern.warning] WARNING: zfs:
> allocating allocated segment(offset=24872013824 size=4096)
> Sep 30 17:01:53 tb3 genunix: [ID 411747 kern.notice] ASSERTION CAUGHT:
> sm->sm_space == space (0xc4896c00 == 0xc4897c00), file:
> ../../common/fs/zfs/space_map.c, line: 355
> 
> What I''d really like to know is whether/how I can map from that
> assertion at the pool level back down to a single filesystem
> or even file using this segment - perhaps I can recycle that file
> to free the segment and set the world straight again?
> 
> A scrub is only 20% complete, but has found no errors thus far.  I check
> the T3 pair and no complaints there either - I did reboot them just for
> luck (last reboot was 2 years ago, apparently!).
Living on the edge...
The T3 has a 2 year battery life (time is counted).  When it decides the
batteries are too old, it will shut down the nonvolatile write cache.
You''ll want to make sure you have fresh batteries soon.
  -- richard

Gavin Maltby

2007-Oct-01 16:07 UTC

head link

[zfs-discuss] zfs panic on boot

On 10/01/07 17:01, Richard Elling wrote:> T3 comment below...
[cut]>> A scrub is only 20% complete, but has found no errors thus far.  I
check
>> the T3 pair and no complaints there either - I did reboot them just for
>> luck (last reboot was 2 years ago, apparently!).
> 
> Living on the edge...
> The T3 has a 2 year battery life (time is counted).  When it decides the
> batteries are too old, it will shut down the nonvolatile write cache.
> You''ll want to make sure you have fresh batteries soon.
Thanks - we have replaced batteries in that time - there is no need to shutdown
during battery replacement.

Gavin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3249 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071001/a3c0918e/attachment.bin>

Marion Hakanson

2007-Oct-01 17:26 UTC

head link

[zfs-discuss] zfs panic on boot

Richard.Elling at Sun.COM said:> Living on the edge... The T3 has a 2 year battery life (time is counted).
> When it decides the batteries are too old, it will shut down the
nonvolatile
> write cache. You''ll want to make sure you have fresh batteries
soon.
Hmm, doesn''t the array put the cache into "write-through"
mode when the
batteries go out?  You can do so manually, as well.

My understanding was that when this happens, you''re not going to suffer
the corruption that would occur when the power goes out, because there
won''t be any uncommitted writes living in the (no-longer NV) cache.
Even if we tell ZFS to not flush/sync the cache, there shouldn''t be
anything in the cache that the array hasn''t already flushed.

Am I wrong?  Is there some danger, other than slower writes, in running
without working batteries?

Regards,

Marion

Richard Elling

2007-Oct-01 20:00 UTC

head link

[zfs-discuss] zfs panic on boot

Marion Hakanson wrote:> Richard.Elling at Sun.COM said:
>> Living on the edge... The T3 has a 2 year battery life (time is
counted).
>> When it decides the batteries are too old, it will shut down the
nonvolatile
>> write cache. You''ll want to make sure you have fresh batteries
soon.
> 
> Hmm, doesn''t the array put the cache into
"write-through" mode when the
> batteries go out?  You can do so manually, as well.
> 
> My understanding was that when this happens, you''re not going to
suffer
> the corruption that would occur when the power goes out, because there
> won''t be any uncommitted writes living in the (no-longer NV)
cache.
> Even if we tell ZFS to not flush/sync the cache, there shouldn''t
be
> anything in the cache that the array hasn''t already flushed.
> 
> Am I wrong?  Is there some danger, other than slower writes, in running
> without working batteries?
You are correct.  This is what I meant by "shut down the nonvolatile
cache"
  -- richard

Gordon Ross

2008-Jan-03 20:25 UTC

head link

[zfs-discuss] zfs panic on boot

I''m seeing this too.  Nothing unusual happened before the panic.
Just a shutdown (init 5) and later startup.  I have the crashdump
and copy of the problem zpool (on swan).  Here''s the stack trace:
> $Cffffff0004463680 vpanic()
ffffff00044636b0 vcmn_err+0x28(3, fffffffff792ecf0, ffffff0004463778)
ffffff00044637a0 zfs_panic_recover+0xb6()
ffffff0004463830 space_map_add+0xdb(ffffff014c1a21b8, 472785000, 1000)
ffffff00044638e0 space_map_load+0x1fc(ffffff014c1a21b8, fffffffffbd52568, 1,
ffffff014c1a1e88, ffffff0149c88c30)
ffffff0004463920 metaslab_activate+0x66(ffffff014c1a1e80, 4000000000000000)
ffffff00044639e0 metaslab_group_alloc+0x24e(ffffff014bdeb000, 4000, 3a6734,
1435b0000, ffffff014baa9840, 2)
ffffff0004463ab0 metaslab_alloc_dva+0x1da(ffffff01477880c0, ffffff014beefa70,
4000, ffffff014baa9840, 2, 0, 3a6734, 0)
ffffff0004463b50 metaslab_alloc+0x82(ffffff01477880c0, ffffff014beefa70, 4000,
ffffff014baa9840, 3, 3a6734, 0, 0)
ffffff0004463ba0 zio_dva_allocate+0x62(ffffff014934c458)
ffffff0004463bd0 zio_execute+0x7f(ffffff014934c458)
ffffff0004463c60 taskq_thread+0x1a7(ffffff014bfb77a0)
ffffff0004463c70 thread_start+8()

This is on a Ferrari laptop (AMD X64) running snv79.
I''d love to rescue my zpool.  Any suggestions?

Thanks,
Gordon
 
 
This message posted from opensolaris.org

Rob Logan

2008-Jan-04 00:10 UTC

head link

[zfs-discuss] zfs panic on boot

> space_map_add+0xdb(ffffff014c1a21b8, 472785000, 1000) > space_map_load+0x1fc(ffffff014c1a21b8, fffffffffbd52568, 1,  
ffffff014c1a1e88, ffffff0149c88c30)
 > running snv79.

hmm.. did you spend any time in snv_74 or snv_75 that might
have gotten http://bugs.opensolaris.org/view_bug.do?bug_id=6603147

zdb -e <name_of_pool_that_crashes_on_import>
would be interesting, but the damage might have been done.

				Rob

zfs discuss - Sep 2007 - zfs panic on boot

[zfs-discuss] zfs panic on boot

[zfs-discuss] zfs panic on boot

[zfs-discuss] zfs panic on boot

[zfs-discuss] zfs panic on boot

[zfs-discuss] zfs panic on boot

[zfs-discuss] zfs panic on boot

[zfs-discuss] zfs panic on boot

[zfs-discuss] zfs panic on boot