Anantha N. Srirama
2006-Sep-12 13:21 UTC
[zfs-discuss] Bizzare problem with ZFS filesystem
I''m experiencing a bizzare write performance problem while using a ZFS filesystem. Here are the relevant facts: [b]# zpool list[/b] NAME SIZE USED AVAIL CAP HEALTH ALTROOT mtdc 3.27T 502G 2.78T 14% ONLINE - zfspool 68.5G 30.8G 37.7G 44% ONLINE - [b]# zfs list[/b] NAME USED AVAIL REFER MOUNTPOINT mtdc 503G 2.73T 24.5K /mtdc mtdc/sasmeta 397M 627M 397M /sasmeta mtdc/u001 30.5G 226G 30.5G /u001 mtdc/u002 29.5G 227G 29.5G /u002 mtdc/u003 29.5G 226G 29.5G /u003 mtdc/u004 28.4G 228G 28.4G /u004 mtdc/u005 28.3G 228G 28.3G /u005 mtdc/u006 29.8G 226G 29.8G /u006 mtdc/u007 30.1G 226G 30.1G /u007 mtdc/u008 30.6G 225G 30.6G /u008 mtdc/u099 266G 502G 266G /u099 zfspool 30.8G 36.6G 24.5K /zfspool zfspool/apps 30.8G 33.2G 28.5G /apps zfspool/apps at presqldriver 2.28G - 29.8G - zfspool/home 15.4M 2.98G 15.4M /home [b]# zfs list mtdc/u099[/b] NAME PROPERTY VALUE SOURCE mtdc/u099 type filesystem - mtdc/u099 creation Thu Aug 17 10:21 2006 - mtdc/u099 used 267G - mtdc/u099 available 501G - mtdc/u099 referenced 267G - mtdc/u099 compressratio 3.10x - mtdc/u099 mounted yes - mtdc/u099 quota 768G local mtdc/u099 reservation none default mtdc/u099 recordsize 128K default mtdc/u099 mountpoint /u099 local mtdc/u099 sharenfs off default mtdc/u099 checksum on default mtdc/u099 compression on local mtdc/u099 atime off local mtdc/u099 devices on default mtdc/u099 exec on default mtdc/u099 setuid on default mtdc/u099 readonly off default mtdc/u099 zoned off default mtdc/u099 snapdir hidden default mtdc/u099 aclmode groupmask default mtdc/u099 aclinherit secure default [b]No error messages listed by zpool or /var/opt/messages.[/b] When I try to save a file the operation takes an inordinate amount of time, in the 30+ second range!!! I truss''d the vi session to see the hangup and it waits at the write system call. # truss -p <pid> read(0, 0xFFBFD0AF, 1) (sleeping...) read(0, " w", 1) = 1 write(1, " w", 1) = 1 read(0, " q", 1) = 1 write(1, " q", 1) = 1 read(0, 0xFFBFD00F, 1) (sleeping...) read(0, "\r", 1) = 1 ioctl(0, I_STR, 0x000579F8) Err#22 EINVAL write(1, "\r", 1) = 1 write(1, " " d e l e t e m e "", 10) = 10 stat64("deleteme", 0xFFBFCFA0) = 0 creat("deleteme", 0666) = 4 ioctl(2, TCSETSW, 0x00060C10) = 0 [b]write(4, " l f f j d\n", 6) = 6[/b] <---- still waiting while I type this message!! This problem manifests itself only on this filesystem and not on the other ZFS filesystems on the same server built from the same ZFS pool. While I was awaiting completion of the above write I was able to start a new vi session in another window and saved a file to the /u001 filesystem without any problem. System loads are very low. Can anybody comment on this bizzare behavior? This message posted from opensolaris.org
Anantha N. Srirama wrote:> I''m experiencing a bizzare write performance problem while using a > ZFS filesystem. Here are the relevant facts: > > [b]No error messages listed by zpool or /var/opt/messages.[/b] When I > try to save a file the operation takes an inordinate amount of time, > in the 30+ second range!!! I truss''d the vi session to see the hangup > and it waits at the write system call. > > This problem manifests itself only on this filesystem and not on the > other ZFS filesystems on the same server built from the same ZFS > pool. While I was awaiting completion of the above write I was able > to start a new vi session in another window and saved a file to the > /u001 filesystem without any problem. System loads are very low. Can > anybody comment on this bizzare behavior?Wow, you''re right, that is really bizarre! If you''re familiar with mdb, could you get a kernel stack trace of that process? Or if not, take a crash dump while it is hung and send it to us? Are there any interesting messages printed to the console (eg. SCSI reset timeouts)? If you do an administrative command (eg. ''zfs set'') while the write is "hung", does it take a long time too? Can you send us the output of ''zpool status'', so we can see if there are any I/O errors? --matt
Anantha N. Srirama
2006-Sep-13 01:08 UTC
[zfs-discuss] Re: Bizzare problem with ZFS filesystem
Here''s the information you requested. Script started on Tue Sep 12 16:46:46 2006 # uname -a SunOS umt1a-bio-srv2 5.10 Generic_118833-18 sun4u sparc SUNW,Netra-T12 # prtdiag System Configuration: Sun Microsystems sun4u Sun Fire E2900 System clock frequency: 150 MHZ Memory size: 96GB ======================================= CPUs ====================================== E$ CPU CPU CPU Freq Size Implementation Mask Status Location ------- -------- ---------- ------------------- ----- ------ -------- 0,512 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB0/P0 1,513 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB0/P1 2,514 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB0/P2 3,515 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB0/P3 8,520 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB2/P0 9,521 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB2/P1 10,522 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB2/P2 11,523 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB2/P3 16,528 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB4/P0 17,529 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB4/P1 18,530 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB4/P2 19,531 1500 MHz 32MB SUNW,UltraSPARC-IV+ 2.1 on-line SB4/P3 # md mdb -k (B)0Loading modules: [ unix krtld genunix dtrace specfs ufs sd sgsbbc md sgenv ip sctp usba fcp fctl qlc nca ssd lofs zfs random crypto ptm nfs ipc logindmux cpc sppp fcip wrsmd ]> arc::stat print{ anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0x11917e1200 p = 0x116e8a1a40 c = 0x11917cf428 c_min = 0xbf77c800 c_max = 0x17aef90000 hits = 0x489737a8 misses = 0x8869917 deleted = 0xc832650 skipped = 0x15b29b2 hash_elements = 0x1273d0 hash_elements_max = 0x17576f hash_collisions = 0x4e0ceee hash_chains = 0x3a9b2 Segmentation Fault - core dumped # mdb -k (B)0Loading modules: [ unix krtld genunix dtrace specfs ufs sd sgsbbc md sgenv ip sctp usba fcp fctl qlc nca ssd lofs zfs random crypto ptm nfs ipc logindmux cpc sppp fcip wrsmd ]> ::kmastat > ::pgrep vi | ::walk thread3086600f660> :[K> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598d91 [ 000002a104598d91 cv_wait_sig+0x114() ] 000002a104598e41 str_cv_wait+0x28() 000002a104598f01 strwaitq+0x238() 000002a104598fc1 strread+0x174() 000002a1045990a1 fop_read+0x20() 000002a104599161 read+0x274() 000002a1045992e1 syscall_trap32+0xcc()> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598e61> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598e61 000002a104598f61 zil_lwb_commit+0x1ac() 000002a104599011 zil_commit+0x1b0() 000002a1045990c1 zfs_fsync+0xa8() 000002a104599171 fop_fsync+0x14() 000002a104599231 fdsync+0x20() 000002a1045992e1 syscall_trap32+0xcc()> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598c71> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598e61 000002a104598f61 zil_lwb_commit+0x1ac() 000002a104599011 zil_commit+0x1b0() 000002a1045990c1 zfs_fsync+0xa8() 000002a104599171 fop_fsync+0x14() 000002a104599231 fdsync+0x20() 000002a1045992e1 syscall_trap32+0xcc()> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598e61 000002a104598f61 zil_lwb_commit+0x1ac() 000002a104599011 zil_commit+0x1b0() 000002a1045990c1 zfs_fsync+0xa8() 000002a104599171 fop_fsync+0x14() 000002a104599231 fdsync+0x20() 000002a1045992e1 syscall_trap32+0xcc()> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598e61> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598bb1> 3086600f660::findstackstack pointer for thread 3086600f660: 2a104598e61 000002a104598f61 zil_lwb_commit+0x1ac() 000002a104599011 zil_commit+0x1b0() 000002a1045990c1 zfs_fsync+0xa8() 000002a104599171 fop_fsync+0x14() 000002a104599231 fdsync+0x20() 000002a1045992e1 syscall_trap32+0xcc()> 3086600f660::findstackstack pointer for thread 3086600f660 (TS_FREE): 2a104598ba1 000002a104598fe1 segvn_unmap+0x1b8() 000002a1045990d1 as_free+0xf4() 000002a104599181 proc_exit+0x46c() 000002a104599231 exit+8() 000002a1045992e1 syscall_trap32+0xcc() # df -h Filesystem size used avail capacity Mounted on /dev/md/dsk/d10 32G 6.7G 25G 22% / /devices 0K 0K 0K 0% /devices ctfs 0K 0K 0K 0% /system/contract proc 0K 0K 0K 0% /proc mnttab 0K 0K 0K 0% /etc/mnttab swap 13G 1000K 13G 1% /etc/svc/volatile objfs 0K 0K 0K 0% /system/object /platform/sun4u-us3/lib/libc_psr/libc_psr_hwcap2.so.1 32G 6.7G 25G 22% /platform/sun4u-us3/lib/libc_psr.so.1 /platform/sun4u-us3/lib/sparcv9/libc_psr/libc_psr_hwcap2.so.1 32G 6.7G 25G 22% /platform/sun4u-us3/lib/sparcv9/libc_psr.so.1 fd 0K 0K 0K 0% /dev/fd /dev/md/dsk/d30 20G 956M 19G 5% /var swap 512M 952K 511M 1% /tmp swap 13G 40K 13G 1% /var/run zfspool/apps 64G 29G 33G 47% /apps zfspool/home 3.0G 15M 3.0G 1% /home zfspool 67G 24K 37G 1% /zfspool mtdc 3.2T 24K 2.8T 1% /mtdc mtdc/u001 256G 30G 226G 12% /u001 mtdc/u002 256G 29G 227G 12% /u002 mtdc/u003 256G 29G 227G 12% /u003 mtdc/u004 256G 28G 228G 12% /u004 mtdc/u005 256G 28G 228G 12% /u005 mtdc/u006 256G 30G 226G 12% /u006 mtdc/u007 256G 30G 226G 12% /u007 mtdc/u008 256G 31G 225G 12% /u008 mtdc/u099 768G 196G 572G 26% /u099 mtdc/sasmeta 1.0G 398M 626M 39% /sasmeta # zfs list NAME USED AVAIL REFER MOUNTPOINT mtdc 434G 2.80T 24.5K /mtdc mtdc/sasmeta 398M 626M 398M /sasmeta mtdc/u001 30.5G 226G 30.5G /u001 mtdc/u002 29.5G 227G 29.5G /u002 mtdc/u003 29.5G 227G 29.5G /u003 mtdc/u004 28.3G 228G 28.3G /u004 mtdc/u005 28.4G 228G 28.4G /u005 mtdc/u006 29.9G 226G 29.9G /u006 mtdc/u007 30.2G 226G 30.2G /u007 mtdc/u008 30.6G 225G 30.6G /u008 mtdc/u099 196G 572G 196G /u099 zfspool 30.8G 36.6G 24.5K /zfspool zfspool/apps 30.8G 33.2G 28.5G /apps zfspool/apps at presqldriver 2.28G - 29.8G - zfspool/home 15.4M 2.98G 15.4M /home # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT mtdc 3.27T 434G 2.85T 12% ONLINE - zfspool 68.5G 30.8G 37.7G 44% ONLINE - # tail -10 /var/adm/messages Sep 12 16:01:21 umt1a-bio-srv2 su: [ID 366847 auth.info] ''su root'' succeeded for erw4 on /dev/pts/11 Sep 12 16:02:26 umt1a-bio-srv2 su: [ID 366847 auth.info] ''su adm'' succeeded for erw4 on /dev/pts/11 Sep 12 16:19:33 umt1a-bio-srv2 sshd[3210]: [ID 800047 auth.info] Connection closed by 158.111.74.113 Sep 12 16:36:38 umt1a-bio-srv2 genunix: [ID 603404 kern.notice] NOTICE: core_log: mdb[3641] core dumped: /var/core/core_umt1a-bio-srv2_mdb_0_0_1158093395_3641 Sep 12 16:36:55 umt1a-bio-srv2 login: [ID 644210 auth.notice] ROOT LOGIN /dev/console Sep 12 16:38:59 umt1a-bio-srv2 sshd[3710]: [ID 800047 auth.info] Accepted keyboard-interactive for cix4 from 158.111.30.168 port 1943 ssh2 Sep 12 16:38:59 umt1a-bio-srv2 sshd[3714]: [ID 800047 auth.info] subsystem request for sftp Sep 12 16:41:49 umt1a-bio-srv2 genunix: [ID 603404 kern.notice] NOTICE: core_log: mdb[3770] core dumped: /var/core/core_umt1a-bio-srv2_mdb_0_0_1158093706_3770 Sep 12 16:42:10 umt1a-bio-srv2 genunix: [ID 603404 kern.notice] NOTICE: core_log: mdb[3778] core dumped: /var/core/core_umt1a-bio-srv2_mdb_0_0_1158093727_3778 Sep 12 16:47:28 umt1a-bio-srv2 genunix: [ID 603404 kern.notice] NOTICE: core_log: mdb[3964] core dumped: /var/core/core_umt1a-bio-srv2_mdb_0_0_1158094046_3964 # rm /var/c ^D This message posted from opensolaris.org
Anantha, How''s the output of: dtrace -F -n ''fbt:zfs::/pid==<pid of stuck process>/{trace(timestamp)}'' -- Just me, Wire ... On 9/13/06, Anantha N. Srirama <anantha.srirama at cdc.hhs.gov> wrote:> Here''s the information you requested.
Anantha N. Srirama
2006-Sep-13 15:42 UTC
[zfs-discuss] Re: Re: Bizzare problem with ZFS filesystem
I ran the DTrace script and the resulting output is rather large (1 million lines and 65MB), so I won''t burden this forum with that much data. Here are the top 100 lines from the DTrace output. Let me know if you need the full output and I''ll figure out a way for the group to get it. dtrace: description ''fbt:zfs::'' matched 2404 probes CPU FUNCTION 520 -> zfs_lookup 2929705866442880 520 -> zfs_zaccess 2929705866448160 520 -> zfs_zaccess_common 2929705866451840 520 -> zfs_acl_node_read 2929705866455040 520 -> zfs_acl_node_read_internal 2929705866458400 520 -> zfs_acl_alloc 2929705866461040 520 <- zfs_acl_alloc 2929705866462880 520 <- zfs_acl_node_read_internal 2929705866464080 520 <- zfs_acl_node_read 2929705866465600 520 -> zfs_ace_access 2929705866467760 520 <- zfs_ace_access 2929705866468880 520 -> zfs_ace_access 2929705866469520 520 <- zfs_ace_access 2929705866470320 520 -> zfs_acl_free 2929705866471920 520 <- zfs_acl_free 2929705866472960 520 <- zfs_zaccess_common 2929705866474720 520 <- zfs_zaccess 2929705866476320 520 -> zfs_dirlook 2929705866478320 520 -> zfs_dirent_lock 2929705866480880 520 <- zfs_dirent_lock 2929705866486560 520 -> zfs_dirent_unlock 2929705866489840 520 <- zfs_dirent_unlock 2929705866491600 520 <- zfs_dirlook 2929705866492560 520 <- zfs_lookup 2929705866494080 520 -> zfs_getattr 2929705866499360 520 -> dmu_object_size_from_db 2929705866503520 520 <- dmu_object_size_from_db 2929705866507920 520 <- zfs_getattr 2929705866509280 520 -> zfs_lookup 2929705866520400 520 -> zfs_zaccess 2929705866521200 520 -> zfs_zaccess_common 2929705866521920 520 -> zfs_acl_node_read 2929705866523280 520 -> zfs_acl_node_read_internal 2929705866524800 520 -> zfs_acl_alloc 2929705866526000 520 <- zfs_acl_alloc 2929705866526800 520 <- zfs_acl_node_read_internal 2929705866527280 520 <- zfs_acl_node_read 2929705866528160 520 -> zfs_ace_access 2929705866528720 520 <- zfs_ace_access 2929705866529280 520 -> zfs_ace_access 2929705866529920 520 <- zfs_ace_access 2929705866530800 520 -> zfs_acl_free 2929705866531360 520 <- zfs_acl_free 2929705866531920 520 <- zfs_zaccess_common 2929705866532560 520 <- zfs_zaccess 2929705866533440 520 -> zfs_dirlook 2929705866534000 520 -> zfs_dirent_lock 2929705866534640 520 <- zfs_dirent_lock 2929705866535600 520 -> zfs_dirent_unlock 2929705866536480 520 <- zfs_dirent_unlock 2929705866537120 520 <- zfs_dirlook 2929705866537760 520 <- zfs_lookup 2929705866538400 520 -> zfs_getsecattr 2929705866543600 520 -> zfs_getacl 2929705866546240 520 -> zfs_zaccess 2929705866546960 520 -> zfs_zaccess_common 2929705866547680 520 -> zfs_acl_node_read 2929705866548720 520 -> zfs_acl_node_read_internal 2929705866549440 520 -> zfs_acl_alloc 2929705866550080 520 <- zfs_acl_alloc 2929705866550720 520 <- zfs_acl_node_read_internal 2929705866551600 520 <- zfs_acl_node_read 2929705866552160 520 -> zfs_ace_access 2929705866552720 520 <- zfs_ace_access 2929705866553280 520 -> zfs_ace_access 2929705866554160 520 <- zfs_ace_access 2929705866554720 520 -> zfs_ace_access 2929705866555600 520 <- zfs_ace_access 2929705866556160 520 -> zfs_ace_access 2929705866557040 520 <- zfs_ace_access 2929705866557600 520 -> zfs_ace_access 2929705866558160 520 <- zfs_ace_access 2929705866558720 520 -> zfs_ace_access 2929705866559760 520 <- zfs_ace_access 2929705866560320 520 -> zfs_acl_free 2929705866560880 520 <- zfs_acl_free 2929705866561440 520 <- zfs_zaccess_common 2929705866562400 520 <- zfs_zaccess 2929705866562960 520 <- zfs_getacl 2929705866564560 520 <- zfs_getsecattr 2929705866566000 520 -> zfs_getattr 2929705866568880 520 -> dmu_object_size_from_db 2929705866569680 520 <- dmu_object_size_from_db 2929705866570240 520 <- zfs_getattr 2929705866570880 520 -> zfs_create 2929705866573120 520 -> zfs_dirent_lock 2929705866574320 520 <- zfs_dirent_lock 2929705866575200 520 -> zfs_zaccess_rwx 2929705866576960 This message posted from opensolaris.org
Anantha N. Srirama
2006-Sep-13 21:58 UTC
[zfs-discuss] Re: Bizzare problem with ZFS filesystem
One more piece of information. I was able to ascertain the slowdown happens only when ZFS is used heavily; meaning lots of inflight I/O. This morning when the system was quiet my writes to the /u099 filesystem was excellent and it has gone south like I reported earlier. I am currently awaiting the completion of a write to /u099, well over 60 seconds. At the same time I was able create/save files in /u001 without any problems. The only difference between the /u001 and /u099 is the size of the filesystem (256GB vs 768GB). Per your suggestion I ran a ''zfs set'' command and it completed after a wait of around 20 seconds while my file save from vi against /u099 is still pending!!! This message posted from opensolaris.org
It is highly likely you are seeing a duplicate of: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS which was fixed recently in build 48 on Nevada. The symptoms are very similar. That is a fsync from the vi would, prior to the bug being fixed, have to force out all other data through the intent log. Neil. Anantha N. Srirama wrote On 09/13/06 15:58,:> One more piece of information. I was able to ascertain the slowdown happens only when ZFS is used heavily; meaning lots of inflight I/O. This morning when the system was quiet my writes to the /u099 filesystem was excellent and it has gone south like I reported earlier. > > I am currently awaiting the completion of a write to /u099, well over 60 seconds. At the same time I was able create/save files in /u001 without any problems. The only difference between the /u001 and /u099 is the size of the filesystem (256GB vs 768GB). > > Per your suggestion I ran a ''zfs set'' command and it completed after a wait of around 20 seconds while my file save from vi against /u099 is still pending!!! > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Wee Yeh Tan
2006-Sep-16 17:38 UTC
[zfs-discuss] Re: Re: Bizzare problem with ZFS filesystem
Anantha, I was hoping to see a lot less trace records than that. Was DTrace running the whole time or did you start it just before you saw the problem? Can you sieve thru the trace to see if you can see any subsequent firings whose timestamp differences are big? (e.g. > 1s). You can try this awk script: awk ''{ if ($4 - lastts > 1000000000) { print last,($4 - lastts); print $0 } last=$0; lastts = $4}'' < trace_file This should zoom down on the exact function that is stalling the write. It does seem from your mdb traces that the issue is something to do with the intent log. On 9/13/06, Anantha N. Srirama <anantha.srirama at cdc.hhs.gov> wrote:> I ran the DTrace script and the resulting output is rather large (1 million lines and 65MB), so I won''t burden this forum with that much data. Here are the top 100 lines from the DTrace output. Let me know if you need the full output and I''ll figure out a way for the group to get it. > > dtrace: description ''fbt:zfs::'' matched 2404 probes > CPU FUNCTION > 520 -> zfs_lookup 2929705866442880 > 520 -> zfs_zaccess 2929705866448160 > 520 -> zfs_zaccess_common 2929705866451840 > 520 -> zfs_acl_node_read 2929705866455040... -- Just me, Wire ...