Hi, Our zfs nfs build server running snv_73 (pool created back before zfs integrated to ON) paniced I guess from zfs the first time and now panics on attempted boot every time as below. Is this a known issue and, more importantly (2TB of data in the pool), any suggestions on how to recover (other than from backup). panic[cpu0]/thread=ffffff003cc8dc80: zfs: allocating allocated segment(offset=24872013824 size=4096) ffffff003cc8d3c0 genunix:vcmn_err+28 ()ck to the main menu. ffffff003cc8d4b0 zfs:zfs_panic_recover+b6 () ffffff003cc8d540 zfs:space_map_add+db () ffffff003cc8d5e0 zfs:space_map_load+1f4 () ffffff003cc8d620 zfs:metaslab_activate+66 () ffffff003cc8d6e0 zfs:metaslab_group_alloc+24e () ffffff003cc8d7b0 zfs:metaslab_alloc_dva+192 () ffffff003cc8d850 zfs:metaslab_alloc+82 () ffffff003cc8d8a0 zfs:zio_dva_allocate+68 () ffffff003cc8d8c0 zfs:zio_next_stage+b3 () ffffff003cc8d8f0 zfs:zio_checksum_generate+6e () ffffff003cc8d910 zfs:zio_next_stage+b3 () ffffff003cc8d980 zfs:zio_write_compress+239 () ffffff003cc8d9a0 zfs:zio_next_stage+b3 () ffffff003cc8d9f0 zfs:zio_wait_for_children+5d () ffffff003cc8da10 zfs:zio_wait_children_ready+20 () ffffff003cc8da30 zfs:zio_next_stage_async+bb () ffffff003cc8da50 zfs:zio_nowait+11 () ffffff003cc8dad0 zfs:dmu_objset_sync+172 () ffffff003cc8db40 zfs:dsl_pool_sync+199 () ffffff003cc8dbd0 zfs:spa_sync+1c5 () ffffff003cc8dc60 zfs:txg_sync_thread+19a () ffffff003cc8dc70 unix:thread_start+8 () In case it matters this is an X4600 M2. There is about 1.5TB in use out of a 2TB pool. The IO devices are nothing exciting but adequate for building - 2 x T3b. The pool was created under sparc on the old nfs server. Thanks Gavin -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3249 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070929/d760a40a/attachment.bin>
Hi, On 09/29/07 22:00, Gavin Maltby wrote:> Hi, > > Our zfs nfs build server running snv_73 (pool created back before > zfs integrated to ON) paniced I guess from zfs the first time > and now panics on attempted boot every time as below. Is this > a known issue and, more importantly (2TB of data in the pool), > any suggestions on how to recover (other than from backup). > > panic[cpu0]/thread=ffffff003cc8dc80: zfs: allocating allocated > segment(offset=24872013824 size=4096)So in desperation I set ''zfs_recover'' which just produced an assertion failure moments after the original panic location. but also set ''aok'' to blast through assertions has allowed me to import the pool again (I had booted -m milestone=none and blown away /etc/zfs/zpool.cache to be able to boot at all). Luckily just the single corruption apparent at the moment, ie just a single assertion caught after running for half a day like this: Sep 30 17:01:53 tb3 genunix: [ID 415322 kern.warning] WARNING: zfs: allocating allocated segment(offset=24872013824 size=4096) Sep 30 17:01:53 tb3 genunix: [ID 411747 kern.notice] ASSERTION CAUGHT: sm->sm_space == space (0xc4896c00 == 0xc4897c00), file: ../../common/fs/zfs/space_map.c, line: 355 What I''d really like to know is whether/how I can map from that assertion at the pool level back down to a single filesystem or even file using this segment - perhaps I can recycle that file to free the segment and set the world straight again? A scrub is only 20% complete, but has found no errors thus far. I check the T3 pair and no complaints there either - I did reboot them just for luck (last reboot was 2 years ago, apparently!). Gavin -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3249 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071001/dd1fdc15/attachment.bin>
T3 comment below... Gavin Maltby wrote:> Hi, > > On 09/29/07 22:00, Gavin Maltby wrote: >> Hi, >> >> Our zfs nfs build server running snv_73 (pool created back before >> zfs integrated to ON) paniced I guess from zfs the first time >> and now panics on attempted boot every time as below. Is this >> a known issue and, more importantly (2TB of data in the pool), >> any suggestions on how to recover (other than from backup). >> >> panic[cpu0]/thread=ffffff003cc8dc80: zfs: allocating allocated >> segment(offset=24872013824 size=4096) > > So in desperation I set ''zfs_recover'' which just produced an > assertion failure moments after the original panic location. > but also set ''aok'' to blast through assertions has allowed > me to import the pool again (I had booted -m milestone=none > and blown away /etc/zfs/zpool.cache to be able to boot at > all). > > Luckily just the single corruption apparent at the moment, ie > just a single assertion caught after running for half a day like this: > > Sep 30 17:01:53 tb3 genunix: [ID 415322 kern.warning] WARNING: zfs: > allocating allocated segment(offset=24872013824 size=4096) > Sep 30 17:01:53 tb3 genunix: [ID 411747 kern.notice] ASSERTION CAUGHT: > sm->sm_space == space (0xc4896c00 == 0xc4897c00), file: > ../../common/fs/zfs/space_map.c, line: 355 > > What I''d really like to know is whether/how I can map from that > assertion at the pool level back down to a single filesystem > or even file using this segment - perhaps I can recycle that file > to free the segment and set the world straight again? > > A scrub is only 20% complete, but has found no errors thus far. I check > the T3 pair and no complaints there either - I did reboot them just for > luck (last reboot was 2 years ago, apparently!).Living on the edge... The T3 has a 2 year battery life (time is counted). When it decides the batteries are too old, it will shut down the nonvolatile write cache. You''ll want to make sure you have fresh batteries soon. -- richard
On 10/01/07 17:01, Richard Elling wrote:> T3 comment below...[cut]>> A scrub is only 20% complete, but has found no errors thus far. I check >> the T3 pair and no complaints there either - I did reboot them just for >> luck (last reboot was 2 years ago, apparently!). > > Living on the edge... > The T3 has a 2 year battery life (time is counted). When it decides the > batteries are too old, it will shut down the nonvolatile write cache. > You''ll want to make sure you have fresh batteries soon.Thanks - we have replaced batteries in that time - there is no need to shutdown during battery replacement. Gavin -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3249 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071001/a3c0918e/attachment.bin>
Richard.Elling at Sun.COM said:> Living on the edge... The T3 has a 2 year battery life (time is counted). > When it decides the batteries are too old, it will shut down the nonvolatile > write cache. You''ll want to make sure you have fresh batteries soon.Hmm, doesn''t the array put the cache into "write-through" mode when the batteries go out? You can do so manually, as well. My understanding was that when this happens, you''re not going to suffer the corruption that would occur when the power goes out, because there won''t be any uncommitted writes living in the (no-longer NV) cache. Even if we tell ZFS to not flush/sync the cache, there shouldn''t be anything in the cache that the array hasn''t already flushed. Am I wrong? Is there some danger, other than slower writes, in running without working batteries? Regards, Marion
Marion Hakanson wrote:> Richard.Elling at Sun.COM said: >> Living on the edge... The T3 has a 2 year battery life (time is counted). >> When it decides the batteries are too old, it will shut down the nonvolatile >> write cache. You''ll want to make sure you have fresh batteries soon. > > Hmm, doesn''t the array put the cache into "write-through" mode when the > batteries go out? You can do so manually, as well. > > My understanding was that when this happens, you''re not going to suffer > the corruption that would occur when the power goes out, because there > won''t be any uncommitted writes living in the (no-longer NV) cache. > Even if we tell ZFS to not flush/sync the cache, there shouldn''t be > anything in the cache that the array hasn''t already flushed. > > Am I wrong? Is there some danger, other than slower writes, in running > without working batteries?You are correct. This is what I meant by "shut down the nonvolatile cache" -- richard
I''m seeing this too. Nothing unusual happened before the panic. Just a shutdown (init 5) and later startup. I have the crashdump and copy of the problem zpool (on swan). Here''s the stack trace:> $Cffffff0004463680 vpanic() ffffff00044636b0 vcmn_err+0x28(3, fffffffff792ecf0, ffffff0004463778) ffffff00044637a0 zfs_panic_recover+0xb6() ffffff0004463830 space_map_add+0xdb(ffffff014c1a21b8, 472785000, 1000) ffffff00044638e0 space_map_load+0x1fc(ffffff014c1a21b8, fffffffffbd52568, 1, ffffff014c1a1e88, ffffff0149c88c30) ffffff0004463920 metaslab_activate+0x66(ffffff014c1a1e80, 4000000000000000) ffffff00044639e0 metaslab_group_alloc+0x24e(ffffff014bdeb000, 4000, 3a6734, 1435b0000, ffffff014baa9840, 2) ffffff0004463ab0 metaslab_alloc_dva+0x1da(ffffff01477880c0, ffffff014beefa70, 4000, ffffff014baa9840, 2, 0, 3a6734, 0) ffffff0004463b50 metaslab_alloc+0x82(ffffff01477880c0, ffffff014beefa70, 4000, ffffff014baa9840, 3, 3a6734, 0, 0) ffffff0004463ba0 zio_dva_allocate+0x62(ffffff014934c458) ffffff0004463bd0 zio_execute+0x7f(ffffff014934c458) ffffff0004463c60 taskq_thread+0x1a7(ffffff014bfb77a0) ffffff0004463c70 thread_start+8() This is on a Ferrari laptop (AMD X64) running snv79. I''d love to rescue my zpool. Any suggestions? Thanks, Gordon This message posted from opensolaris.org
> space_map_add+0xdb(ffffff014c1a21b8, 472785000, 1000)> space_map_load+0x1fc(ffffff014c1a21b8, fffffffffbd52568, 1, ffffff014c1a1e88, ffffff0149c88c30) > running snv79. hmm.. did you spend any time in snv_74 or snv_75 that might have gotten http://bugs.opensolaris.org/view_bug.do?bug_id=6603147 zdb -e <name_of_pool_that_crashes_on_import> would be interesting, but the damage might have been done. Rob