Jürgen Keil
2006-May-04 14:33 UTC
[zfs-discuss] zfs snapshot: unexpected error 16 at line 2322 of ../common/libzfs_data....
Yesterday my snv_39 32-bit x86 test box had a stange issue with "zfs snapshot" failing, the strange state lasted for ~ 5 - 10 minutes, but eventually the problem disappeared. Unfortunatelly I can''t reproduce the behaviour. What happened was this: zfs snapshot failed with an "unexpected error 16 at line 2322 of ../common/libzfs_dataset.c" message: tiger2 106# zfs snapshot data/usr at b40 internal error: unexpected error 16 at line 2322 of ../common/libzfs_dataset.c tiger2 107# mount -F zfs data/usr /mnt tiger2 108# zfs snapshot data/usr at b40 internal error: unexpected error 16 at line 2322 of ../common/libzfs_dataset.c tiger2 109# uname -a SunOS tiger2 5.11 wos_b39_2 i86pc i386 i86pc tiger2 110# find /data/opensolaris-20060501/usr/src/uts/ -name "*.h" -exec grep zh_claim_txg {} + /data/opensolaris-20060501/usr/src/uts/common/fs/zfs/sys/zil.h: uint64_t zh_claim_txg; /* txg in which log blocks were claimed */ tiger2 111# zfs snapshot data/usr at b40 [*] tiger2 112# A dozen attempts to create the snapshot (before and after a reboot) didn''t clear the EBUSY 16 error. Snapshots of other zfs filesystems could be created. data/usr has several snapshots and clones, one of the clones is mounted as the system''s /usr filesystem: data/usr 2,05G 146G 2,05G legacy data/usr at b37 1,58M - 2,05G - data/usr at b38 25,5K - 2,05G - data/usr-b37 104M 146G 2,04G legacy data/usr-b37-debug 103M 146G 2,04G legacy data/usr-b38 114M 146G 2,00G legacy data/usr-b38-debug 101M 146G 2,04G legacy Existing snapshots and clones of "data/usr" could be destroyed, but "zfs snapshot data/usr at b40" continued to return the EBUSY error 16. [*] But: The problem disappeared for some reason. ======================================================================= The EBUSY error 16 seems to come from zil_suspend(). A tried to trace function calls in the kernel for the failed ZFS_IOC_CREATE ioctl. I narrowed it down to the function zil_suspend() which was the first one that returned with "16". Unfortunatelly the problem disappeared half way through my dtrace experiments and the only dtrace output that survived is the following part that I copy&pasted from a gnome terminal: ... 0 <- dnode_cons 0 0 -> dnode_setdblksz 0 <- dnode_setdblksz 14 0 -> dmu_zfetch_init 0 -> list_create 0 <- list_create 3734548404 0 -> rw_init 0 <- rw_init 3734548400 0 <- dmu_zfetch_init 3734548400 0 -> list_insert_head 0 <- list_insert_head 3734548052 0 <- dnode_create 3734548048 0 <- dnode_special_open 3734548048 0 -> dsl_dataset_set_user_ptr 0 <- dsl_dataset_set_user_ptr 0 0 <- dmu_objset_open_impl 0 0 <- dmu_objset_open 0 0 -> dmu_objset_zil 0 <- dmu_objset_zil 3700903200 0 -> zil_suspend 0 | zil_suspend:entry zh_claim_txg: 83432 0 <- zil_suspend 16 0 -> dmu_objset_close 0 -> dsl_dataset_close 0 -> dbuf_rele 0 -> dbuf_evict_user 0 -> dsl_dataset_evict 0 -> unique_remove ... So it seems we''re failing here: http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zil.c#zil_suspend 1200 /* 1201 * Suspend an intent log. While in suspended mode, we still honor 1202 * synchronous semantics, but we rely on txg_wait_synced() to do it. 1203 * We suspend the log briefly when taking a snapshot so that the snapshot 1204 * contains all the data it''s supposed to, and has an empty intent log. 1205 */ 1206 int 1207 zil_suspend(zilog_t *zilog) 1208 { 1209 const zil_header_t *zh = zilog->zl_header; 1210 lwb_t *lwb; 1211 1212 mutex_enter(&zilog->zl_lock); 1213 if (zh->zh_claim_txg != 0) { /* unplayed log */ 1214 mutex_exit(&zilog->zl_lock); 1215 return (EBUSY); 1216 } ... Is this a known problem? I found several existing bugs in bugs.opensolaris.org, but they are all closed as fixed in snv_20 or snv_24: Bug ID: 6326215 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6326215 Bug ID: 6296292 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6296292 Bug ID: 6299882 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6299882 Bug ID: 6264708 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6264708 This message posted from opensolaris.org
Jürgen Keil
2006-Jun-15 15:08 UTC
[zfs-discuss] zfs snapshot: cannot snapshot, dataset is busy
> http://www.opensolaris.org/jive/thread.jspa?messageID=36229#36229The problem is back, on a different system: a laptop running on-20060605 bits. Compared to snv_29, the error message has improved, though: # zfs snapshot hdd/usr-base at b43 cannot snapshot ''hdd/usr-base at b43'': dataset is busy # zfs snapshot hdd/export at test cannot snapshot ''hdd/export at test'': dataset is busy Both the hdd/usr-base and hdd/export zfs are using a legacy mountpoint and are not mounted. zdb reports this: # zdb -ivvvv hdd/export Dataset hdd/export [ZPL], ID 313, cr_txg 378637, last_txg 598821, 1.71G, 147592 objects ZIL header: claim_txg 598821, seq 0 first block: [L0 ZIL intent log] 2000L/2000P DVA[0]=<0:51506000:2000> zilog uncompressed LE contiguous birth=519227 fill=0 cksum=ed322d1e7ca741fc:194d26d6f2582165:139:1 Block seqno 1, already claimed, [L0 ZIL intent log] 2000L/2000P DVA[0]=<0:51506000:2000> zilog uncompressed LE contiguous birth=519227 fill=0 cksum=ed322d1e7ca741fc:194d26d6f2582165:139:1 Note the "claim_txg" value != 0. For the hdd/usr-base zfs the output was similar, claim_txg != 0. By manually mounting and unmounting that filesystem the claim_txg was reset to 0 and the zfs snapshot command didn''t fail with "dataset is busy" any more. This message posted from opensolaris.org