thr3ads.net - zfs discuss - [zfs-discuss] zfs snapshot: unexpected error 16 at line 2322 of ../common/libzfs

If this information is useful, please help other people find it:
Share via:

Jürgen Keil

2006-May-04 14:33 UTC

[zfs-discuss] zfs snapshot: unexpected error 16 at line 2322 of ../common/libzfs_data....

Yesterday my snv_39 32-bit x86 test box had a stange issue with "zfs
snapshot" failing,
the strange state lasted for ~ 5 - 10 minutes, but eventually the problem
disappeared.
Unfortunatelly I can''t reproduce the behaviour.

What happened was this:

zfs snapshot failed with an "unexpected error 16 at line 2322 of
../common/libzfs_dataset.c" message:

tiger2 106#  zfs snapshot data/usr at b40
internal error: unexpected error 16 at line 2322 of ../common/libzfs_dataset.c
tiger2 107# mount -F zfs data/usr /mnt
tiger2 108# zfs snapshot data/usr at b40
internal error: unexpected error 16 at line 2322 of ../common/libzfs_dataset.c
tiger2 109# uname -a
SunOS tiger2 5.11 wos_b39_2 i86pc i386 i86pc
tiger2 110# find /data/opensolaris-20060501/usr/src/uts/ -name "*.h"
-exec grep zh_claim_txg {} +
/data/opensolaris-20060501/usr/src/uts/common/fs/zfs/sys/zil.h: uint64_t
zh_claim_txg;  /* txg in which log blocks were claimed */
tiger2 111# zfs snapshot data/usr at b40 				[*]
tiger2 112#



A dozen attempts to create the snapshot (before and after a reboot)
didn''t clear the EBUSY 16 error.

Snapshots of other zfs filesystems could be created.

data/usr has several snapshots and clones, one of the clones is mounted
as the system''s /usr filesystem:

data/usr              2,05G   146G  2,05G  legacy
data/usr at b37          1,58M      -  2,05G  -
data/usr at b38          25,5K      -  2,05G  -
data/usr-b37           104M   146G  2,04G  legacy
data/usr-b37-debug     103M   146G  2,04G  legacy
data/usr-b38           114M   146G  2,00G  legacy
data/usr-b38-debug     101M   146G  2,04G  legacy

Existing snapshots and clones of "data/usr" could be destroyed,
but "zfs snapshot data/usr at b40" continued to return the EBUSY error
16.

[*] But: The problem disappeared for some reason.

=======================================================================
The EBUSY error 16 seems to come from zil_suspend().  A tried to trace function
calls in the kernel for the failed ZFS_IOC_CREATE ioctl. I narrowed it down
to the function zil_suspend() which was the first one that returned with
"16".
Unfortunatelly the problem disappeared half way through my dtrace experiments
and the only dtrace output that survived is the following part that I
copy&pasted
from a gnome terminal:


...
  0                        <- dnode_cons                      0
  0                        -> dnode_setdblksz
  0                        <- dnode_setdblksz                14
  0                        -> dmu_zfetch_init
  0                          -> list_create
  0                          <- list_create          3734548404
  0                          -> rw_init
  0                          <- rw_init              3734548400
  0                        <- dmu_zfetch_init        3734548400
  0                        -> list_insert_head
  0                        <- list_insert_head        3734548052
  0                      <- dnode_create             3734548048
  0                    <- dnode_special_open         3734548048
  0                    -> dsl_dataset_set_user_ptr
  0                    <- dsl_dataset_set_user_ptr                 0
  0                  <- dmu_objset_open_impl                  0
  0                <- dmu_objset_open                         0
  0                -> dmu_objset_zil
  0                <- dmu_objset_zil                 3700903200
  0                -> zil_suspend
  0                 | zil_suspend:entry       zh_claim_txg: 83432
  0                <- zil_suspend                            16
  0                -> dmu_objset_close
  0                  -> dsl_dataset_close
  0                    -> dbuf_rele
  0                      -> dbuf_evict_user
  0                        -> dsl_dataset_evict
  0                          -> unique_remove
...


So it seems we''re failing here:

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zil.c#zil_suspend

  1200  /*
  1201   * Suspend an intent log.  While in suspended mode, we still honor
  1202   * synchronous semantics, but we rely on txg_wait_synced() to do it.
  1203   * We suspend the log briefly when taking a snapshot so that the
snapshot
  1204   * contains all the data it''s supposed to, and has an empty
intent log.
  1205   */
  1206  int
  1207  zil_suspend(zilog_t *zilog)
  1208  {
  1209          const zil_header_t *zh = zilog->zl_header;
  1210          lwb_t *lwb;
  1211
  1212          mutex_enter(&zilog->zl_lock);
  1213          if (zh->zh_claim_txg != 0) {            /* unplayed log */
  1214                  mutex_exit(&zilog->zl_lock);
  1215                  return (EBUSY);
  1216          }
                ...



Is this a known problem?

I found several existing bugs in bugs.opensolaris.org, but they are all closed
as fixed
in snv_20 or snv_24:

Bug ID:   	 6326215
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6326215

Bug ID:   	 6296292
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6296292

Bug ID:   	 6299882
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6299882

Bug ID:   	 6264708
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6264708
 
 
This message posted from opensolaris.org

Jürgen Keil

2006-Jun-15 15:08 UTC

head link

[zfs-discuss] zfs snapshot: cannot snapshot, dataset is busy

> http://www.opensolaris.org/jive/thread.jspa?messageID=36229#36229
The problem is back, on a different system: a laptop running on-20060605 bits.
Compared to snv_29, the error message has improved, though:

# zfs snapshot hdd/usr-base at b43
cannot snapshot ''hdd/usr-base at b43'': dataset is busy

# zfs snapshot hdd/export at test
cannot snapshot ''hdd/export at test'': dataset is busy

Both the hdd/usr-base and hdd/export zfs are using a legacy mountpoint and are
not mounted.

zdb reports this:

# zdb -ivvvv hdd/export
Dataset hdd/export [ZPL], ID 313, cr_txg 378637, last_txg 598821, 1.71G, 147592
objects

    ZIL header: claim_txg 598821, seq 0

        first block: [L0 ZIL intent log] 2000L/2000P
DVA[0]=<0:51506000:2000> zilog uncompressed LE contiguous birth=519227
fill=0 cksum=ed322d1e7ca741fc:194d26d6f2582165:139:1

        Block seqno 1, already claimed, [L0 ZIL intent log] 2000L/2000P
DVA[0]=<0:51506000:2000> zilog uncompressed LE contiguous birth=519227
fill=0 cksum=ed322d1e7ca741fc:194d26d6f2582165:139:1


Note the "claim_txg" value != 0.

For the hdd/usr-base zfs the output was similar, claim_txg != 0.  By manually
mounting and unmounting that filesystem the claim_txg was reset to 0 and
the zfs snapshot command didn''t fail with "dataset is busy"
any more.
 
 
This message posted from opensolaris.org

zfs discuss - May 2006 - zfs snapshot: unexpected error 16 at line 2322 of ../common/libzfs_data....

[zfs-discuss] zfs snapshot: unexpected error 16 at line 2322 of ../common/libzfs_data....

[zfs-discuss] zfs snapshot: cannot snapshot, dataset is busy