Hi! During a copy (zfs send/recv) of a ~1TB dataset from one zpool to another, my system seems to run into some issues. A simultaneous "find" on the source data set deadlocks. This is the kernel stack: $ procstat -kk 1786 PID TID COMM TDNAME KSTACK 1786 101344 find - mi_switch+0x194 sleepq_wait+0x42 _cv_wait+0x112 zio_wait+0x61 dbuf_read+0x619 dmu_buf_hold+0xe0 zap_get_leaf_byblk+0x4a zap_deref_leaf+0x68 fzap_cursor_retrieve+0xe7 zap_cursor_retrieve+0x155 zfs_freebsd_readdir+0x2d8 VOP_READDIR_APV+0x78 kern_getdirentries+0x212 sys_getdirentries+0x23 amd64_syscall+0x5ea Xfast_syscall+0xf7 The zfs send/recv has gotten very slow, albeit seems to make very slow progress (copy is, as obvious, from p0 to p2): p0 15.9T 2.20T 318 0 10.2M 0 p1 11.1T 7.00T 0 0 0 0 p2 2.55T 41.0T 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- p0 15.9T 2.20T 294 0 9.29M 0 p1 11.1T 7.00T 0 0 0 0 p2 2.55T 41.0T 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- p0 15.9T 2.20T 307 0 9.12M 0 p1 11.1T 7.00T 0 0 0 0 p2 2.55T 41.0T 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- p0 15.9T 2.20T 293 0 8.69M 0 p1 11.1T 7.00T 0 0 0 0 p2 2.55T 41.0T 0 58 0 1.61M ---------- ----- ----- ----- ----- ----- ----- p0 15.9T 2.20T 301 0 10.9M 0 p1 11.1T 7.00T 0 0 0 0 p2 2.55T 41.0T 0 1.62K 0 49.6M ---------- ----- ----- ----- ----- ----- ----- The machine is otherwise quite idle. When the copy started, I got around 200MB/s, now it's around 10MB/s. The ARC has gotten large, but that is likely normal: last pid: 1863; load averages: 0.20, 0.33, 0.63 up 0+02:27:44 16:31:52 50 processes: 1 running, 49 sleeping CPU: 0.0% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.8% idle Mem: 1688M Active, 61M Inact, 107G Wired, 3288K Cache, 126M Buf, 15G Free ARC: 99G Total, 2483M MFU, 89G MRU, 33M Anon, 888M Header, 7427M Other Swap: 128G Total, 128G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1229 root 1 20 0 39700K 3292K piperd 7 24:27 1.07% zfs 1228 root 2 20 0 39832K 3420K nanslp 5 17:02 0.39% zfs ... The source pool is pretty filled up, can that be an issue? $ zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT p0 18.1T 15.9T 2.20T 87% 1.00x ONLINE - p1 18.1T 11.1T 7.00T 61% 1.00x ONLINE - p2 43.5T 2.53T 41.0T 5% 1.00x ONLINE - The machine is running 9.3-REL and has two mps controllers. Any ideas? Bengt
Bengt Ahlgren <bengta at sics.se> writes:> During a copy (zfs send/recv) of a ~1TB dataset from one zpool to > another, my system seems to run into some issues. A simultaneous "find" > on the source data set deadlocks. This is the kernel stack: > > $ procstat -kk 1786 > PID TID COMM TDNAME KSTACK > 1786 101344 find - mi_switch+0x194 sleepq_wait+0x42 _cv_wait+0x112 zio_wait+0x61 dbuf_read+0x619 dmu_buf_hold+0xe0 zap_get_leaf_byblk+0x4a zap_deref_leaf+0x68 fzap_cursor_retrieve+0xe7 zap_cursor_retrieve+0x155 zfs_freebsd_readdir+0x2d8 VOP_READDIR_APV+0x78 kern_getdirentries+0x212 sys_getdirentries+0x23 amd64_syscall+0x5ea Xfast_syscall+0xf7 > > The zfs send/recv has gotten very slow, albeit seems to make very slow > progress (copy is, as obvious, from p0 to p2): > > p0 15.9T 2.20T 318 0 10.2M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > p0 15.9T 2.20T 294 0 9.29M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > p0 15.9T 2.20T 307 0 9.12M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 0 0 0 > ---------- ----- ----- ----- ----- ----- ----- > p0 15.9T 2.20T 293 0 8.69M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 58 0 1.61M > ---------- ----- ----- ----- ----- ----- ----- > p0 15.9T 2.20T 301 0 10.9M 0 > p1 11.1T 7.00T 0 0 0 0 > p2 2.55T 41.0T 0 1.62K 0 49.6M > ---------- ----- ----- ----- ----- ----- ----- > > The machine is otherwise quite idle. When the copy started, I got > around 200MB/s, now it's around 10MB/s. > > The ARC has gotten large, but that is likely normal: > > last pid: 1863; load averages: 0.20, 0.33, 0.63 up 0+02:27:44 16:31:52 > 50 processes: 1 running, 49 sleeping > CPU: 0.0% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.8% idle > Mem: 1688M Active, 61M Inact, 107G Wired, 3288K Cache, 126M Buf, 15G Free > ARC: 99G Total, 2483M MFU, 89G MRU, 33M Anon, 888M Header, 7427M Other > Swap: 128G Total, 128G Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > 1229 root 1 20 0 39700K 3292K piperd 7 24:27 1.07% zfs > 1228 root 2 20 0 39832K 3420K nanslp 5 17:02 0.39% zfs > ... > > The source pool is pretty filled up, can that be an issue? > > $ zpool list > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > p0 18.1T 15.9T 2.20T 87% 1.00x ONLINE - > p1 18.1T 11.1T 7.00T 61% 1.00x ONLINE - > p2 43.5T 2.53T 41.0T 5% 1.00x ONLINE - > > The machine is running 9.3-REL and has two mps controllers. > > Any ideas?Just for the record: there was no deadlock after all. It turned out to be caused by a directory with ~4.5M entries. Bengt