Darren J Moffat
2006-Sep-13 12:20 UTC
[zfs-code] ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Using the ZFS crypto bits, see [1] for webrev, which are in sync with ON as of 2006-09-12 (ie they include the BrandZ stuff and the changes that Eric putback on the 12th). [1] http://cr.grommit.com/~darrenm/zfs-crypto/ I created a new pool: # zpool create -f tank c0t1d0 I then created four new file systems # zfs create -o encryption=aes256 tank/cipher-aes256 # zfs create -o encryption=aes128 tank/cipher-aes128 # zfs create -o encryption=aes192 tank/cipher-aes192 # zfs create -o encryption=off tank/clear I listed the encryption property, then I exported the pool. When I did so the machine panic''d thus: panic[cpu1]/thread=ffffffffbe728880: assertion failed: dn->dn_nlevels > level (0x0 > 0x0), file: ../../common/fs/zfs/dbuf.c, line: 1523 fffffe8000bb4730 genunix:assfail3+b9 () fffffe8000bb47d0 zfs:dbuf_hold_impl+329 () fffffe8000bb4810 zfs:dbuf_hold+2b () fffffe8000bb48a0 zfs:dnode_hold_impl+bd () fffffe8000bb48d0 zfs:dnode_hold+2b () fffffe8000bb4950 zfs:dmu_buf_hold+45 () fffffe8000bb4a20 zfs:zap_lockdir+58 () fffffe8000bb4aa0 zfs:zap_lookup+4d () fffffe8000bb4b10 zfs:dsl_pool_open+94 () fffffe8000bb4bb0 zfs:spa_load+566 () fffffe8000bb4c00 zfs:spa_tryimport+90 () fffffe8000bb4c50 zfs:zfs_ioc_pool_tryimport+31 () fffffe8000bb4cd0 zfs:zfsdev_ioctl+115 () fffffe8000bb4d10 genunix:cdev_ioctl+48 () fffffe8000bb4d50 specfs:spec_ioctl+86 () fffffe8000bb4db0 genunix:fop_ioctl+37 () fffffe8000bb4eb0 genunix:ioctl+16b () fffffe8000bb4f00 unix:brand_sys_syscall32+2a1 () For some reason savecore didn''t grab the dump so I tried again: This time I only go the first two filesystems created and I got a different panic: panic[cpu1]/thread=fffffe800036bc80: assertion failed: (((bp)->blk_birth == 0) ? 0 : ((((((bp)->blk_prop[0]) >> (0)) & ((1ULL << (16)) - 1)) + (1)) << (9))) == db->db_level == 1 ? dn->dn_datablksz : (1<<dn->dn_phys->dn_indblkshift) (0x200 == 0x4000), file: ../../common/fs/zfs/dbuf.c, line: 2186 fffffe800036b4d0 genunix:assfail3+b9 () fffffe800036b820 zfs:dbuf_write_done+920 () fffffe800036b880 zfs:arc_write_done+1d3 () fffffe800036ba10 zfs:zio_done+2e4 () fffffe800036ba40 zfs:zio_next_stage+112 () fffffe800036ba90 zfs:zio_wait_for_children+56 () fffffe800036bab0 zfs:zio_wait_children_done+20 () fffffe800036bae0 zfs:zio_next_stage+112 () fffffe800036bb30 zfs:zio_vdev_io_assess+140 () fffffe800036bb60 zfs:zio_next_stage+112 () fffffe800036bbb0 zfs:vdev_mirror_io_done+377 () fffffe800036bbd0 zfs:zio_vdev_io_done+26 () fffffe800036bc60 genunix:taskq_thread+1dc () fffffe800036bc70 unix:thread_start+8 () Then on reboot from that panic we see this one: panic[cpu0]/thread=ffffffff870a0c00: assertion failed: dn->dn_nlevels > level (0x0 > 0x0), file: ../../common/fs/zfs/dbuf.c, line: 1523 fffffe80005806a0 genunix:assfail3+b9 () fffffe8000580740 zfs:dbuf_hold_impl+329 () fffffe8000580780 zfs:dbuf_hold+2b () fffffe8000580810 zfs:dnode_hold_impl+bd () fffffe8000580840 zfs:dnode_hold+2b () fffffe80005808c0 zfs:dmu_buf_hold+45 () fffffe8000580990 zfs:zap_lockdir+58 () fffffe8000580a10 zfs:zap_lookup+4d () fffffe8000580a80 zfs:dsl_pool_open+94 () fffffe8000580b20 zfs:spa_load+566 () fffffe8000580b90 zfs:spa_open_common+c5 () fffffe8000580c00 zfs:spa_get_stats+4a () fffffe8000580c50 zfs:zfs_ioc_pool_stats+32 () fffffe8000580cd0 zfs:zfsdev_ioctl+115 () fffffe8000580d10 genunix:cdev_ioctl+48 () fffffe8000580d50 specfs:spec_ioctl+86 () fffffe8000580db0 genunix:fop_ioctl+37 () fffffe8000580eb0 genunix:ioctl+16b () fffffe8000580f00 unix:brand_sys_syscall32+2a1 () ie the same as the panic from the first attempt. So I rebooted into failsafe and cleared the zpool.cache file so I could come back up and get the dump (which I did this time). I then bfu''d the machine to the base ON nightly that I''m in sync with, to check everything is okay in the base and to create the pool there. So I did this: # zpool create -f tank c0t1d0 # zfs create tank/clear # zpool export tank Then bfu''d back to the zfs-crypto bits again and rebooted and attempted to import the pool which was created with the base ON bits: banking# zpool import panic[cpu2]/thread=ffffffff82b9c3e0: assertion failed: dn->dn_indblkshift <= 17 (0xb1 <= 0x11), file: ../../common/fs/zfs/dnode.c, line: 136 fffffe8000b41850 genunix:assfail3+b9 () fffffe8000b41890 zfs:dnode_verify+320 () fffffe8000b418d0 zfs:dnode_special_open+2c () fffffe8000b41aa0 zfs:dmu_objset_open_impl+3aa () fffffe8000b41b10 zfs:dsl_pool_open+59 () fffffe8000b41bb0 zfs:spa_load+566 () fffffe8000b41c00 zfs:spa_tryimport+90 () fffffe8000b41c50 zfs:zfs_ioc_pool_tryimport+31 () fffffe8000b41cd0 zfs:zfsdev_ioctl+115 () fffffe8000b41d10 genunix:cdev_ioctl+48 () fffffe8000b41d50 specfs:spec_ioctl+86 () fffffe8000b41db0 genunix:fop_ioctl+37 () fffffe8000b41eb0 genunix:ioctl+16b () fffffe8000b41f00 unix:brand_sys_syscall32+2a1 () The dumps are available on SWAN from: /net/borg.sfbay/cube/builds/darrenm/zfs-crypto-dumps [note borg is sparcv9 but the dumps are from an amd64 kernel ] I think I must have missed something with the merging in of the crypto pipeline changes but I can''t see what it is. These panics are in very strange places. I need help, the ZFS crypto project is halted until this is resolved. Thanks in advance. -- Darren J Moffat
Darren J Moffat
2006-Sep-19 17:04 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
I really need some help on this. Without help the ZFS crypto project is stalled. I''ve updated my bits to the ON gate as of last night. The way I recreate this is slightly different but the assert is still the same. I can create a pool with my bits and export it; when I import it I get the dn_levels assert. Please I really need help. Darren J Moffat wrote:> Using the ZFS crypto bits, see [1] for webrev, which are in sync with ON > as of 2006-09-12 (ie they include the BrandZ stuff and the changes > that Eric putback on the 12th). > > [1] http://cr.grommit.com/~darrenm/zfs-crypto/ > > > I created a new pool: > > # zpool create -f tank c0t1d0 > > I then created four new file systems > > # zfs create -o encryption=aes256 tank/cipher-aes256 > # zfs create -o encryption=aes128 tank/cipher-aes128 > # zfs create -o encryption=aes192 tank/cipher-aes192 > # zfs create -o encryption=off tank/clear > > I listed the encryption property, then I exported the pool. > > When I did so the machine panic''d thus: > > > panic[cpu1]/thread=ffffffffbe728880: assertion failed: dn->dn_nlevels > > level (0x0 > 0x0), file: ../../common/fs/zfs/dbuf.c, line: 1523 > > fffffe8000bb4730 genunix:assfail3+b9 () > fffffe8000bb47d0 zfs:dbuf_hold_impl+329 () > fffffe8000bb4810 zfs:dbuf_hold+2b () > fffffe8000bb48a0 zfs:dnode_hold_impl+bd () > fffffe8000bb48d0 zfs:dnode_hold+2b () > fffffe8000bb4950 zfs:dmu_buf_hold+45 () > fffffe8000bb4a20 zfs:zap_lockdir+58 () > fffffe8000bb4aa0 zfs:zap_lookup+4d () > fffffe8000bb4b10 zfs:dsl_pool_open+94 () > fffffe8000bb4bb0 zfs:spa_load+566 () > fffffe8000bb4c00 zfs:spa_tryimport+90 () > fffffe8000bb4c50 zfs:zfs_ioc_pool_tryimport+31 () > fffffe8000bb4cd0 zfs:zfsdev_ioctl+115 () > fffffe8000bb4d10 genunix:cdev_ioctl+48 () > fffffe8000bb4d50 specfs:spec_ioctl+86 () > fffffe8000bb4db0 genunix:fop_ioctl+37 () > fffffe8000bb4eb0 genunix:ioctl+16b () > fffffe8000bb4f00 unix:brand_sys_syscall32+2a1 () > > For some reason savecore didn''t grab the dump so I tried again: > > This time I only go the first two filesystems created and I got > a different panic: > > panic[cpu1]/thread=fffffe800036bc80: assertion failed: (((bp)->blk_birth > == 0) ? 0 : ((((((bp)->blk_prop[0]) >> (0)) & ((1ULL << (16)) - 1)) + > (1)) << (9))) == db->db_level == 1 ? dn->dn_datablksz : > (1<<dn->dn_phys->dn_indblkshift) (0x200 == 0x4000), file: > ../../common/fs/zfs/dbuf.c, line: 2186 > > fffffe800036b4d0 genunix:assfail3+b9 () > fffffe800036b820 zfs:dbuf_write_done+920 () > fffffe800036b880 zfs:arc_write_done+1d3 () > fffffe800036ba10 zfs:zio_done+2e4 () > fffffe800036ba40 zfs:zio_next_stage+112 () > fffffe800036ba90 zfs:zio_wait_for_children+56 () > fffffe800036bab0 zfs:zio_wait_children_done+20 () > fffffe800036bae0 zfs:zio_next_stage+112 () > fffffe800036bb30 zfs:zio_vdev_io_assess+140 () > fffffe800036bb60 zfs:zio_next_stage+112 () > fffffe800036bbb0 zfs:vdev_mirror_io_done+377 () > fffffe800036bbd0 zfs:zio_vdev_io_done+26 () > fffffe800036bc60 genunix:taskq_thread+1dc () > fffffe800036bc70 unix:thread_start+8 () > > Then on reboot from that panic we see this one: > > panic[cpu0]/thread=ffffffff870a0c00: assertion failed: dn->dn_nlevels > > level (0x0 > 0x0), file: ../../common/fs/zfs/dbuf.c, line: 1523 > > fffffe80005806a0 genunix:assfail3+b9 () > fffffe8000580740 zfs:dbuf_hold_impl+329 () > fffffe8000580780 zfs:dbuf_hold+2b () > fffffe8000580810 zfs:dnode_hold_impl+bd () > fffffe8000580840 zfs:dnode_hold+2b () > fffffe80005808c0 zfs:dmu_buf_hold+45 () > fffffe8000580990 zfs:zap_lockdir+58 () > fffffe8000580a10 zfs:zap_lookup+4d () > fffffe8000580a80 zfs:dsl_pool_open+94 () > fffffe8000580b20 zfs:spa_load+566 () > fffffe8000580b90 zfs:spa_open_common+c5 () > fffffe8000580c00 zfs:spa_get_stats+4a () > fffffe8000580c50 zfs:zfs_ioc_pool_stats+32 () > fffffe8000580cd0 zfs:zfsdev_ioctl+115 () > fffffe8000580d10 genunix:cdev_ioctl+48 () > fffffe8000580d50 specfs:spec_ioctl+86 () > fffffe8000580db0 genunix:fop_ioctl+37 () > fffffe8000580eb0 genunix:ioctl+16b () > fffffe8000580f00 unix:brand_sys_syscall32+2a1 () > > ie the same as the panic from the first attempt. > > So I rebooted into failsafe and cleared the zpool.cache file so I could > come back up and get the dump (which I did this time). > > I then bfu''d the machine to the base ON nightly that I''m in sync with, > to check everything is okay in the base and to create the pool there. > > So I did this: > > # zpool create -f tank c0t1d0 > # zfs create tank/clear > # zpool export tank > > Then bfu''d back to the zfs-crypto bits again and rebooted and attempted > to import the pool which was created with the base ON bits: > > > banking# zpool import > > panic[cpu2]/thread=ffffffff82b9c3e0: assertion failed: > dn->dn_indblkshift <= 17 (0xb1 <= 0x11), file: > ../../common/fs/zfs/dnode.c, line: 136 > > fffffe8000b41850 genunix:assfail3+b9 () > fffffe8000b41890 zfs:dnode_verify+320 () > fffffe8000b418d0 zfs:dnode_special_open+2c () > fffffe8000b41aa0 zfs:dmu_objset_open_impl+3aa () > fffffe8000b41b10 zfs:dsl_pool_open+59 () > fffffe8000b41bb0 zfs:spa_load+566 () > fffffe8000b41c00 zfs:spa_tryimport+90 () > fffffe8000b41c50 zfs:zfs_ioc_pool_tryimport+31 () > fffffe8000b41cd0 zfs:zfsdev_ioctl+115 () > fffffe8000b41d10 genunix:cdev_ioctl+48 () > fffffe8000b41d50 specfs:spec_ioctl+86 () > fffffe8000b41db0 genunix:fop_ioctl+37 () > fffffe8000b41eb0 genunix:ioctl+16b () > fffffe8000b41f00 unix:brand_sys_syscall32+2a1 () > > > The dumps are available on SWAN from: > /net/borg.sfbay/cube/builds/darrenm/zfs-crypto-dumps > > [note borg is sparcv9 but the dumps are from an amd64 kernel ] > > I think I must have missed something with the merging in of the crypto > pipeline changes but I can''t see what it is. These panics are in very > strange places. > > I need help, the ZFS crypto project is halted until this is resolved. > > Thanks in advance. >-- Darren J Moffat
johansen
2006-Sep-19 17:38 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Hi Darren, I took a look at your webrev and have a couple of questions: 1. When you built this source did you clobber all of the kernel portion and rebuild? I ask this because in your sdiff for uts/common/fs/zfs/sys/dnode.h you''ve inserted the dnode structure member dn_crypt above dn_nlevels. If you have code that is trying to manipulate dn_crypt but some old code still references dn_nlevels at the offset of dn_crypt, you''re going to run into very strange problems. 2. If you''ve performed a clobber build, this could potentially indicate a corrupted dnode_phys_t or corrupted objset_impl_t. The dnode gets dn_nlevels set from dmu_objset_create_impl where it pulls the value our of the objset os_meta_dnode. In dnode_create, it uses the dnode_phys_t''s dn_nlevels to set the value. If the value is getting set to zero by one of these routines you may have corrupted something in the objset or the dnode_phys. Have you investigated what is going on with these structures? I hope this helps, -j -- This messages posted from opensolaris.org
Darren J Moffat
2006-Sep-22 13:49 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
johansen wrote:> Hi Darren, > I took a look at your webrev and have a couple of questions: > > 1. When you built this source did you clobber all of the kernel portion and rebuild?I''ve done a full clobber build and it still happens.> I ask this because in your sdiff for uts/common/fs/zfs/sys/dnode.h you''ve inserted the dnode structure member dn_crypt above dn_nlevels. If you have code that is trying to manipulate dn_crypt but some old code still references dn_nlevels at the offset of dn_crypt, you''re going to run into very strange problems. > > 2. If you''ve performed a clobber build, this could potentially indicate a corrupted dnode_phys_t or corrupted objset_impl_t. The dnode gets dn_nlevels set from dmu_objset_create_impl where it pulls the value our of the objset os_meta_dnode. In dnode_create, it uses the dnode_phys_t''s dn_nlevels to set the value. If the value is getting set to zero by one of these routines you may have corrupted something in the objset or the dnode_phys. Have you investigated what is going on with these structures?Thats what I though to but I''ve looked at all the places where a dnode_t is translated into something else and none of them appear to be wrong. If I operate on an existing Version 3 pool I don''t have any of these problems. This only happens when using a Version 4 (crypto is version 4 in my workspace) pool. Thanks for your suggestions, but I''m still stuck! -- Darren J Moffat
eric kustarz
2006-Sep-22 16:36 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Darren J Moffat wrote:> johansen wrote: >> Hi Darren, >> I took a look at your webrev and have a couple of questions: >> >> 1. When you built this source did you clobber all of the kernel >> portion and rebuild? > > I''ve done a full clobber build and it still happens. > >> I ask this because in your sdiff for uts/common/fs/zfs/sys/dnode.h >> you''ve inserted the dnode structure member dn_crypt above dn_nlevels. >> If you have code that is trying to manipulate dn_crypt but some old >> code still references dn_nlevels at the offset of dn_crypt, you''re >> going to run into very strange problems. >> >> 2. If you''ve performed a clobber build, this could potentially >> indicate a corrupted dnode_phys_t or corrupted objset_impl_t. The >> dnode gets dn_nlevels set from dmu_objset_create_impl where it pulls >> the value our of the objset os_meta_dnode. In dnode_create, it uses >> the dnode_phys_t''s dn_nlevels to set the value. If the value is >> getting set to zero by one of these routines you may have corrupted >> something in the objset or the dnode_phys. Have you investigated what >> is going on with these structures? > > Thats what I though to but I''ve looked at all the places where a dnode_t > is translated into something else and none of them appear to be wrong. > > If I operate on an existing Version 3 pool I don''t have any of these > problems. This only happens when using a Version 4 (crypto is version 4 > in my workspace) pool. > > Thanks for your suggestions, but I''m still stuck! >Not sure if you''ve already posted this or not, but do you have a webrev? eric
Darren J Moffat
2006-Sep-25 10:09 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
eric kustarz wrote:> Not sure if you''ve already posted this or not, but do you have a webrev?Yes it was in the original message: http://cr.grommit.com/~darrenm/zfs-crypto/ -- Darren J Moffat
eric kustarz
2006-Sep-28 02:38 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Darren J Moffat wrote:> eric kustarz wrote: > >> Not sure if you''ve already posted this or not, but do you have a webrev? > > Yes it was in the original message: > > http://cr.grommit.com/~darrenm/zfs-crypto/ >I didn''t see anything obvious in the webrev, but have you tried to temporarily make zio_decrypt_data() and zio_encrypt_data() no-ops to rule out that new code and focus on the existing framework? If it isn''t that then i''d ping Bill to make sure the zio pipeline changes are corrrect and Matt to make sure the DMU changes are right (if you haven''t already). eric
Darren J Moffat
2006-Sep-28 09:06 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
eric kustarz wrote:> Darren J Moffat wrote: >> eric kustarz wrote: >> >>> Not sure if you''ve already posted this or not, but do you have a webrev? >> >> Yes it was in the original message: >> >> http://cr.grommit.com/~darrenm/zfs-crypto/ >> > > I didn''t see anything obvious in the webrev, but have you tried to > temporarily make zio_decrypt_data() and zio_encrypt_data() no-ops to > rule out that new code and focus on the existing framework?I haven''t recently but I can try that again.> If it isn''t that then i''d ping Bill to make sure the zio pipeline > changes are corrrect and Matt to make sure the DMU changes are right (if > you haven''t already).Bill ping Matt ping I know they both hang out here :-) -- Darren J Moffat
Darren J Moffat
2006-Sep-28 11:13 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
eric kustarz wrote:> Darren J Moffat wrote: >> eric kustarz wrote: >> >>> Not sure if you''ve already posted this or not, but do you have a webrev? >> >> Yes it was in the original message: >> >> http://cr.grommit.com/~darrenm/zfs-crypto/ >> > > I didn''t see anything obvious in the webrev, but have you tried to > temporarily make zio_decrypt_data() and zio_encrypt_data() no-ops to > rule out that new code and focus on the existing framework?Silly me I should have noticed this myself :-) If you look at the source you will see that other than some of the types those functions are all #ifdef _KERNEL. Given that I can make exactly the same assert trip with ztest/zdb which uses the libzfs/libzpool builds of the uts/common/fs/zfs code I guess I''ve already done that. -- Darren J Moffat
Matthew Ahrens
2006-Sep-28 17:36 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Darren J Moffat wrote:> eric kustarz wrote: >> Darren J Moffat wrote: >>> eric kustarz wrote: >>> >>>> Not sure if you''ve already posted this or not, but do you have a >>>> webrev? >>> >>> Yes it was in the original message: >>> >>> http://cr.grommit.com/~darrenm/zfs-crypto/ >>> >> >> I didn''t see anything obvious in the webrev, but have you tried to >> temporarily make zio_decrypt_data() and zio_encrypt_data() no-ops to >> rule out that new code and focus on the existing framework? > > I haven''t recently but I can try that again. > >> If it isn''t that then i''d ping Bill to make sure the zio pipeline >> changes are corrrect and Matt to make sure the DMU changes are right >> (if you haven''t already). > > Bill ping > Matt ping > > I know they both hang out here :-)I just looked at your code and didn''t see anything that would be causing your bugs. I can''t speak for the SPA stuff, but the rest looks good. Here are some general comments: in dmu_object_info_t, you need to decrease the size of doi_pad by 1 byte. This is so that all structure members will have the same offset on 32 and 64 bits, so that it can be passed between a 64-bit kernel and a 32-bit app. in dbuf_sync(), the comment says "ZFS metadata is in the clear but ZPL metadata should be encrypted". I assume that "ZFS metadata" means "DMU/DSL/SPA metadata"? But it looks like you are encrypting DMU metadata (indirect blocks -- db_level > 0), and you are not encrypting any metadata, not even ZPL metadata like directories (from ot_metadata). eventually, the zil''s blocks will need to be encrypted as well. --matt
Darren J Moffat
2006-Sep-29 11:28 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Matthew Ahrens wrote:> I just looked at your code and didn''t see anything that would be causing > your bugs. I can''t speak for the SPA stuff, but the rest looks good. > Here are some general comments:Thanks for looking.> in dmu_object_info_t, you need to decrease the size of doi_pad by 1 > byte. This is so that all structure members will have the same offset > on 32 and 64 bits, so that it can be passed between a 64-bit kernel and > a 32-bit app.Fixed thanks.> in dbuf_sync(), the comment says "ZFS metadata is in the clear but ZPL > metadata should be encrypted". I assume that "ZFS metadata" means > "DMU/DSL/SPA metadata"? But it looks like you are encrypting DMU > metadata (indirect blocks -- db_level > 0), and you are not encrypting > any metadata, not even ZPL metadata like directories (from ot_metadata).Yeah I know about that, but I wanted get over the current hurdle first.> eventually, the zil''s blocks will need to be encrypted as well.Yep I think I have some XXX darrenm''s in the code about that :-) Thanks again for trying. Now what do I do ;-( -- Darren J Moffat
Matthew Ahrens
2006-Sep-29 15:20 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Darren J Moffat wrote:> Thanks again for trying. Now what do I do ;-(The bug doesn''t exist in nevada, right? So try removing your code until you find what''s broken. --matt
Darren J Moffat
2006-Sep-29 15:22 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Matthew Ahrens wrote:> Darren J Moffat wrote: >> Thanks again for trying. Now what do I do ;-( > > The bug doesn''t exist in nevada, right? So try removing your code until > you find what''s broken.Yep, Whats more if I use my code to operate on a version 3 pool (crypto bumps the version to 4 in my workspace) it doesn''t happen either. -- Darren J Moffat
Mark Maybee
2006-Sep-29 23:57 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Darren, I looked a bit at your dumps... in both cases, the problem is that the os_phys block that we read from the disk is garbage: > 0xffffffff9377b000::print objset_phys_t { os_meta_dnode = { dn_type = 0 dn_indblkshift = 0 dn_nlevels = 0 dn_nblkptr = 0 dn_bonustype = 0 dn_checksum = 0 dn_compress = 0 dn_flags = 0 dn_datablkszsec = 0x3 dn_bonuslen = 0 dn_pad2 = [ 0, 0, 0 ] dn_crypt = 0 dn_maxblkid = 0 dn_used = 0 dn_pad3 = [ 0, 0, 0, 0 ] dn_blkptr = [ { blk_dva = [ { dva_word = [ 0, 0 ] } ... I checked the actual arc buf this came from, and it looks the same. So the buf was successfully read, and it checksummed, but it doesn''t have good data. This pretty much says that the problem is on the write side. When we wrote out the root block in dmu_objset_sync(), we must have written garbage. I''m not yet quite sure how this happened... perhaps something is messed up in your write path changes (arc_write->zio_write->...), but its not obvious. I''ll investigate some more when I get a chance.... -Mark Darren J Moffat wrote:> I really need some help on this. Without help the ZFS crypto project is > stalled. > > I''ve updated my bits to the ON gate as of last night. The way I > recreate this is slightly different but the assert is still the same. > > I can create a pool with my bits and export it; when I import it I > get the dn_levels assert. > > Please I really need help. > > Darren J Moffat wrote: > >> Using the ZFS crypto bits, see [1] for webrev, which are in sync with ON >> as of 2006-09-12 (ie they include the BrandZ stuff and the changes >> that Eric putback on the 12th). >> >> [1] http://cr.grommit.com/~darrenm/zfs-crypto/ >> >> >> I created a new pool: >> >> # zpool create -f tank c0t1d0 >> >> I then created four new file systems >> >> # zfs create -o encryption=aes256 tank/cipher-aes256 >> # zfs create -o encryption=aes128 tank/cipher-aes128 >> # zfs create -o encryption=aes192 tank/cipher-aes192 >> # zfs create -o encryption=off tank/clear >> >> I listed the encryption property, then I exported the pool. >> >> When I did so the machine panic''d thus: >> >> >> panic[cpu1]/thread=ffffffffbe728880: assertion failed: dn->dn_nlevels > >> level (0x0 > 0x0), file: ../../common/fs/zfs/dbuf.c, line: 1523 >> >> fffffe8000bb4730 genunix:assfail3+b9 () >> fffffe8000bb47d0 zfs:dbuf_hold_impl+329 () >> fffffe8000bb4810 zfs:dbuf_hold+2b () >> fffffe8000bb48a0 zfs:dnode_hold_impl+bd () >> fffffe8000bb48d0 zfs:dnode_hold+2b () >> fffffe8000bb4950 zfs:dmu_buf_hold+45 () >> fffffe8000bb4a20 zfs:zap_lockdir+58 () >> fffffe8000bb4aa0 zfs:zap_lookup+4d () >> fffffe8000bb4b10 zfs:dsl_pool_open+94 () >> fffffe8000bb4bb0 zfs:spa_load+566 () >> fffffe8000bb4c00 zfs:spa_tryimport+90 () >> fffffe8000bb4c50 zfs:zfs_ioc_pool_tryimport+31 () >> fffffe8000bb4cd0 zfs:zfsdev_ioctl+115 () >> fffffe8000bb4d10 genunix:cdev_ioctl+48 () >> fffffe8000bb4d50 specfs:spec_ioctl+86 () >> fffffe8000bb4db0 genunix:fop_ioctl+37 () >> fffffe8000bb4eb0 genunix:ioctl+16b () >> fffffe8000bb4f00 unix:brand_sys_syscall32+2a1 () >> >> For some reason savecore didn''t grab the dump so I tried again: >> >> This time I only go the first two filesystems created and I got >> a different panic: >> >> panic[cpu1]/thread=fffffe800036bc80: assertion failed: (((bp)->blk_birth >> == 0) ? 0 : ((((((bp)->blk_prop[0]) >> (0)) & ((1ULL << (16)) - 1)) + >> (1)) << (9))) == db->db_level == 1 ? dn->dn_datablksz : >> (1<<dn->dn_phys->dn_indblkshift) (0x200 == 0x4000), file: >> ../../common/fs/zfs/dbuf.c, line: 2186 >> >> fffffe800036b4d0 genunix:assfail3+b9 () >> fffffe800036b820 zfs:dbuf_write_done+920 () >> fffffe800036b880 zfs:arc_write_done+1d3 () >> fffffe800036ba10 zfs:zio_done+2e4 () >> fffffe800036ba40 zfs:zio_next_stage+112 () >> fffffe800036ba90 zfs:zio_wait_for_children+56 () >> fffffe800036bab0 zfs:zio_wait_children_done+20 () >> fffffe800036bae0 zfs:zio_next_stage+112 () >> fffffe800036bb30 zfs:zio_vdev_io_assess+140 () >> fffffe800036bb60 zfs:zio_next_stage+112 () >> fffffe800036bbb0 zfs:vdev_mirror_io_done+377 () >> fffffe800036bbd0 zfs:zio_vdev_io_done+26 () >> fffffe800036bc60 genunix:taskq_thread+1dc () >> fffffe800036bc70 unix:thread_start+8 () >> >> Then on reboot from that panic we see this one: >> >> panic[cpu0]/thread=ffffffff870a0c00: assertion failed: dn->dn_nlevels > >> level (0x0 > 0x0), file: ../../common/fs/zfs/dbuf.c, line: 1523 >> >> fffffe80005806a0 genunix:assfail3+b9 () >> fffffe8000580740 zfs:dbuf_hold_impl+329 () >> fffffe8000580780 zfs:dbuf_hold+2b () >> fffffe8000580810 zfs:dnode_hold_impl+bd () >> fffffe8000580840 zfs:dnode_hold+2b () >> fffffe80005808c0 zfs:dmu_buf_hold+45 () >> fffffe8000580990 zfs:zap_lockdir+58 () >> fffffe8000580a10 zfs:zap_lookup+4d () >> fffffe8000580a80 zfs:dsl_pool_open+94 () >> fffffe8000580b20 zfs:spa_load+566 () >> fffffe8000580b90 zfs:spa_open_common+c5 () >> fffffe8000580c00 zfs:spa_get_stats+4a () >> fffffe8000580c50 zfs:zfs_ioc_pool_stats+32 () >> fffffe8000580cd0 zfs:zfsdev_ioctl+115 () >> fffffe8000580d10 genunix:cdev_ioctl+48 () >> fffffe8000580d50 specfs:spec_ioctl+86 () >> fffffe8000580db0 genunix:fop_ioctl+37 () >> fffffe8000580eb0 genunix:ioctl+16b () >> fffffe8000580f00 unix:brand_sys_syscall32+2a1 () >> >> ie the same as the panic from the first attempt. >> >> So I rebooted into failsafe and cleared the zpool.cache file so I could >> come back up and get the dump (which I did this time). >> >> I then bfu''d the machine to the base ON nightly that I''m in sync with, >> to check everything is okay in the base and to create the pool there. >> >> So I did this: >> >> # zpool create -f tank c0t1d0 >> # zfs create tank/clear >> # zpool export tank >> >> Then bfu''d back to the zfs-crypto bits again and rebooted and attempted >> to import the pool which was created with the base ON bits: >> >> >> banking# zpool import >> >> panic[cpu2]/thread=ffffffff82b9c3e0: assertion failed: >> dn->dn_indblkshift <= 17 (0xb1 <= 0x11), file: >> ../../common/fs/zfs/dnode.c, line: 136 >> >> fffffe8000b41850 genunix:assfail3+b9 () >> fffffe8000b41890 zfs:dnode_verify+320 () >> fffffe8000b418d0 zfs:dnode_special_open+2c () >> fffffe8000b41aa0 zfs:dmu_objset_open_impl+3aa () >> fffffe8000b41b10 zfs:dsl_pool_open+59 () >> fffffe8000b41bb0 zfs:spa_load+566 () >> fffffe8000b41c00 zfs:spa_tryimport+90 () >> fffffe8000b41c50 zfs:zfs_ioc_pool_tryimport+31 () >> fffffe8000b41cd0 zfs:zfsdev_ioctl+115 () >> fffffe8000b41d10 genunix:cdev_ioctl+48 () >> fffffe8000b41d50 specfs:spec_ioctl+86 () >> fffffe8000b41db0 genunix:fop_ioctl+37 () >> fffffe8000b41eb0 genunix:ioctl+16b () >> fffffe8000b41f00 unix:brand_sys_syscall32+2a1 () >> >> >> The dumps are available on SWAN from: >> /net/borg.sfbay/cube/builds/darrenm/zfs-crypto-dumps >> >> [note borg is sparcv9 but the dumps are from an amd64 kernel ] >> >> I think I must have missed something with the merging in of the crypto >> pipeline changes but I can''t see what it is. These panics are in very >> strange places. >> >> I need help, the ZFS crypto project is halted until this is resolved. >> >> Thanks in advance. >> > >
Mark Maybee
2006-Oct-04 21:58 UTC
[zfs-code] Re: ASSERT failed dn->dn_nlevels > level (0x0 > 0x0) dbuf.c, line: 1523
Darren, It looks like you modeled your ENCRYPT/DECRYPT stages on the COMPRESS/UNCOMPRESS stages. This is all well and good... however your emulation is not quite accurate: The WRITE_COMPRESS stage is *always* included in the write pipeline, so the zio_write() func only adds it to the async_stages variable when compression is turned on. The WRITE_ENCRYPT stage, on the other hand is explicitly in the ASYNC_STAGES, but *not* always part of the write pipeline. In zio_write() you are setting it in the async_stages variable... which it is already in. I think you need to add it to the pipeline rather than the async_stages. Its still not clear to me how this can result in your problems, but then I don''t yet understand how the SPA io pipeline works in all circumstances. -Mark Mark Maybee wrote:> Darren, > > I looked a bit at your dumps... in both cases, the problem is that the > os_phys block that we read from the disk is garbage: > > > 0xffffffff9377b000::print objset_phys_t > { > os_meta_dnode = { > dn_type = 0 > dn_indblkshift = 0 > dn_nlevels = 0 > dn_nblkptr = 0 > dn_bonustype = 0 > dn_checksum = 0 > dn_compress = 0 > dn_flags = 0 > dn_datablkszsec = 0x3 > dn_bonuslen = 0 > dn_pad2 = [ 0, 0, 0 ] > dn_crypt = 0 > dn_maxblkid = 0 > dn_used = 0 > dn_pad3 = [ 0, 0, 0, 0 ] > dn_blkptr = [ > { > blk_dva = [ > { > dva_word = [ 0, 0 ] > } > ... > > I checked the actual arc buf this came from, and it looks the same. So > the buf was successfully read, and it checksummed, but it doesn''t have > good data. This pretty much says that the problem is on the write side. > When we wrote out the root block in dmu_objset_sync(), we must have > written garbage. I''m not yet quite sure how this happened... perhaps > something is messed up in your write path changes > (arc_write->zio_write->...), but its not obvious. I''ll investigate some > more when I get a chance.... > > -Mark > > Darren J Moffat wrote: > >> I really need some help on this. Without help the ZFS crypto project >> is stalled. >> >> I''ve updated my bits to the ON gate as of last night. The way I >> recreate this is slightly different but the assert is still the same. >> >> I can create a pool with my bits and export it; when I import it I >> get the dn_levels assert. >> >> Please I really need help. >> >> Darren J Moffat wrote: >> >>> Using the ZFS crypto bits, see [1] for webrev, which are in sync with ON >>> as of 2006-09-12 (ie they include the BrandZ stuff and the changes >>> that Eric putback on the 12th). >>> >>> [1] http://cr.grommit.com/~darrenm/zfs-crypto/