Pete Bentley
2007-Sep-04 17:01 UTC
[zfs-discuss] ZFS file access hanging processes under Solaris 10u3
Hi, I have a Solaris 10u3/x86 box with a single mirrored zpool, patched with 10_Recommended as of mid-May and which has been running with no obvious problems since that time until today. Today processes accessing certain zfs files starting hanging (sleeping in an unkillable state), which seems to have started when someone tried a simple "cat file | gunzip > otherfile" pipeline where "file" happens to be quite large (23G). No obvious errors in the systems logs, dmesg, iostat -e or zpool status and a precautionary "zpool scrub" *also* appears to have hung without triggering the actual scrub. Also I managed to create a further hung cat process with "truss cat file > /dev/null", it hanging after the file was mmap64()''d and before the first read() returned. wchan and kernel stack was the same for all the "cat" processes, so it would appear they were sleeping in the loop at http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu.c#240 (or its 10u3 equivalent) # ps -o pid,wchan,args -p 9335,9561,9427,9511,9532,9564 PID WCHAN COMMAND 9335 d3dfe0b2 ls -lart 9427 dce69114 cat administration-2007-01-10.tgz 9511 dce69114 cat administration-2007-01-10.tgz 9532 dce69114 cat administration-2007-01-10.tgz 9561 d8bad584 ls scratch/ 9564 fec63e46 zpool scrub data # mdb -k [...] > ::pgrep cat | ::walk thread | ::findstack stack pointer for thread d7b43e00: d50c3d6c d50c3d84 swtch+0x13e() d50c3d90 cv_wait+0x4b() d50c3dd8 dmu_buf_hold_array_by_dnode+0x236() d50c3e04 dmu_buf_hold_array_by_bonus+0x27() d50c3e74 zfs_read+0x182() d50c3eac fop_read+0x2a() d50c3f84 read+0x1f9() d50c3fac sys_sysenter+0x100() [...] Rebooting the system appears to have solved the problem (ie the truss, cat and gunzip commands above work just fine). I do have a crash dump if anyone is interested, but this particular server is not under support. Pete.