I''ve got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don''t know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We''re talking less than 10MB/s. NFS is pathetically slow. We''re using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I''ve been slamming my head against this problem for days and can''t make headway. I''ll post some of my notes below. Any thoughts or ideas are welcome! benr. == Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn''t. Indeed "dns" was specified in /etc/nsswitch.conf which won''t work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc 12/07/2006 10:38:05 %usr %sys %wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 <-- 10:38:23 0 94 0 6 <-- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04 %usr %sys %wio %idle Delta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) >25 seconds Looking at the congestion in the run queue: [private:/tmp] root# sar -q 5 100 10:45:43 runq-sz %runocc swpq-sz %swpocc 10:45:51 27.0 85 0.0 0 10:45:57 1.0 20 0.0 0 10:46:02 2.0 60 0.0 0 10:46:13 19.8 99 0.0 0 10:46:23 17.7 99 0.0 0 10:46:34 24.4 99 0.0 0 10:46:41 22.1 97 0.0 0 10:46:48 13.0 96 0.0 0 10:46:55 25.3 102 0.0 0 Looking at the per-CPU breakdown: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 324 224 0 0 0 154 0 0 0 100 0 0 1 0 0 0 114 0 226 0 10 13 0 86 0 1 0 99 2 0 0 0 162 138 149 0 5 4 0 0 0 1 0 99 3 0 0 0 55 6 46 0 4 3 0 0 0 1 0 99 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 310 210 34 0 17 171 7 5 0 100 0 0 1 0 0 0 152 1 200 0 17 26 5 59 1 65 0 34 2 0 0 0 271 197 175 1 13 20 2 0 0 66 0 34 3 0 0 0 120 6 76 0 6 14 2 0 0 66 0 34 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1 318 204 46 1 19 22 5 18 0 100 0 0 1 0 0 0 112 0 40 0 12 15 3 4 0 100 0 0 2 0 0 0 195 93 31 1 11 27 3 0 0 100 0 0 3 0 0 0 108 1 20 0 9 31 3 2 0 100 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 569 204 809 3 133 432 1 21 0 99 0 1 1 0 0 0 502 0 1159 1 201 453 1 8 0 97 0 3 2 0 0 0 2305 2197 147 6 51 279 0 0 0 100 0 0 3 0 0 0 242 2 433 10 75 301 1 8 0 100 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 371 227 110 0 31 1470 0 80 0 99 0 1 1 0 0 3 1034 0 2079 0 102 241 0 9 0 10 0 90 2 0 0 0 1892 1718 1558 3 129 214 0 36 0 11 0 89 3 0 0 0 303 6 549 0 50 329 0 5 0 8 0 92 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1 312 205 24 0 10 39 4 4 0 100 0 0 1 0 0 1 134 0 99 0 10 20 3 6 0 90 0 10 2 0 0 0 187 90 40 0 8 15 4 10 0 90 0 10 3 0 0 1 100 1 19 0 8 9 4 6 0 90 0 10 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 2 528 206 916 8 161 312 1 39 0 99 0 1 1 0 0 1 206 1 350 6 120 147 1 68 0 100 0 0 2 83 0 0 1929 1816 192 9 87 176 0 50 0 100 0 0 3 0 0 0 142 4 66 21 12 460 1 17 1 99 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 360 228 163 11 25 743 0 13 0 100 0 0 1 0 0 0 1587 0 4126 13 520 1435 0 122 0 24 0 76 2 3 0 0 7785 7669 1499 29 225 933 0 48 0 27 0 73 3 1 0 0 1856 6 4683 0 467 1912 0 37 0 23 0 77 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 323 223 0 0 0 244 0 0 0 100 0 0 1 0 0 0 141 0 277 0 14 11 0 1 0 0 0 100 2 0 0 0 307 259 312 0 16 15 0 3 0 1 0 99 3 0 0 0 113 6 165 0 5 5 0 91 0 0 0 100 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 324 224 0 0 0 203 0 0 0 100 0 0 1 0 0 0 143 0 283 0 11 2 0 13 0 1 0 99 2 0 0 0 312 278 298 0 8 11 0 3 0 1 0 99 3 0 0 0 119 6 175 0 3 5 0 69 0 1 0 99 Looking at prstat we can see the number of nfsd threads are low, much lower than the 1024 limit: [private:/tmp] root# prstat PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 22643 daemon 2228K 1776K sleep 60 -20 2:55:52 55% nfsd/18 Interestingly, using prstat -mL to monitor thread latency, we see that a handful of threads are the culprates for consuming mass CPU: [private:/tmp] root# prstat -mL PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560 Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86 Above SLP represents "The percentage of time the process has spent sleeping", LAT represents "percentage of time the process has spent waiting for CPU", VCX is "number of voluntary context switches" and ICX is for involuntary context switches. Using pstack to try and determine what these threads are doing we see: [private:/tmp] root# pstack 22643/1693 22643: /usr/lib/nfs/nfsd ----------------- lwp# 1693 / thread# 1693 -------------------- fee589b7 nfssys (e, feba0fc8) fee5839f _thr_setup (fed4d400) + 51 fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0) [private:/tmp] root# pstack 22643/1708 22643: /usr/lib/nfs/nfsd ----------------- lwp# 1708 -------------------------------- fee589b7 nfssys (e, febb0fc8) fee5839f _thr_setup (fed44c00) + 51 fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0) [private:/tmp] root# pstack 22643/1664 22643: /usr/lib/nfs/nfsd ----------------- lwp# 1664 / thread# 1664 -------------------- fee589b7 nfssys (e, fec90fc8) fee5839f _thr_setup (fed43800) + 51 fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0) Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what ZFS is doing: [private:/] root# dtrace -s zvop_times.d dtrace: script ''zvop_times.d'' matched 66 probes ^C CPU ID FUNCTION:NAME 1 2 :END ZFS COUNT zfs_getsecattr 4 zfs_space 70 zfs_rename 111 zfs_readdir 284 zfs_read 367 zfs_mkdir 670 zfs_setattr 1054 zfs_frlock 1562 zfs_putpage 3916 zfs_write 4110 zfs_create 4178 zfs_remove 7794 zfs_fid 14960 zfs_inactive 17637 zfs_access 20809 zfs_fsync 29668 zfs_lookup 31273 zfs_getattr 175457 ZFS AVG TIME zfs_fsync 2337 zfs_getattr 2735 zfs_access 2774 zfs_fid 2948 zfs_inactive 4081 zfs_lookup 6067 zfs_frlock 6750 zfs_putpage 6840 zfs_mkdir 9259 zfs_getsecattr 10940 zfs_setattr 15364 zfs_space 25214 zfs_readdir 25726 zfs_remove 26117 zfs_write 31942 zfs_rename 60504 zfs_read 144449 zfs_create 71215587 ZFS SUM TIME zfs_getsecattr 43762 zfs_space 1765026 zfs_mkdir 6204101 zfs_rename 6715987 zfs_readdir 7306233 zfs_frlock 10544902 zfs_setattr 16194261 zfs_putpage 26787465 zfs_fid 44111892 zfs_read 53013073 zfs_access 57740468 zfs_fsync 69354886 zfs_inactive 71990292 zfs_write 131282275 zfs_lookup 189748320 zfs_remove 203558835 zfs_getattr 479998059 zfs_create 297538724997 [private:/] root# zfs get all joyous NAME PROPERTY VALUE SOURCE joyous type filesystem - joyous creation Fri Jul 28 17:08 2006 - joyous used 255G - joyous available 15.7T - joyous referenced 81.7K - joyous compressratio 1.21x - joyous mounted yes - joyous quota none default joyous reservation none default joyous recordsize 128K default joyous mountpoint /joyous default joyous sharenfs off default joyous checksum off local joyous compression off local joyous atime off local joyous devices on default joyous exec on default joyous setuid on default joyous readonly off default joyous zoned off default joyous snapdir hidden default joyous aclmode groupmask default joyous aclinherit secure default So what exactly is nfsd doing??? [private:/] root# time dtrace -n ''syscall:::entry /execname == "nfsd"/ { @[probefunc] = count(); }'' dtrace: description ''syscall:::entry '' matched 225 probes ^C open 3 close 5 lwp_mutex_timedlock 7 lwp_mutex_wakeup 7 pollsys 7 putmsg 8 getmsg 11 fcntl 12 lwp_park 30 mmap 41 munmap 42 fxstat 47 ioctl 58 lwp_continue 100 lwp_create 100 setcontext 100 lwp_kill 101 lwp_exit 110 schedctl 110 lwp_sigmask 138 nfs 203 real 3m22.464s user 0m0.206s sys 0m0.222s And the answer is, not much of anything. Where is this CPU disappearing too!!! Looking at the stacks: [private:/] root# time dtrace -n ''syscall::nfs:entry /execname == "nfsd"/ { @[ustack()] = count(); }'' dtrace: description ''syscall::nfs:entry '' matched 1 probe ^C libc.so.1`_nfssys+0x7 libc.so.1`_thr_setup+0x51 libc.so.1`_lwp_start 22 Not very helpful... lets try this: [private:/] root# truss -c -p 22643 ^C syscall seconds calls errors close .000 1 ioctl .000 2 lwp_park .000 1 lwp_unpark .000 1 getmsg .000 1 putmsg .000 1 setustack .000 30 nfssys 106.146 56 mmap .000 20 munmap .000 3 fxstat .000 4 lwp_create .001 30 lwp_exit .000 12 lwp_continue .000 30 lwp_kill .000 13 13 lwp_sigmask .000 14 pollsys .000 1 schedctl .000 12 -------- ------ ---- sys totals: 106.149 232 13 usr time: .001 elapsed: 47.230 Better. Clearly nfssys is a problem, but why? [private:/] root# dtrace -n ''io:genunix::start { @[pid, execname] = count(); }'' dtrace: description ''io:genunix::start '' matched 3 probes ^C 3 fsflush 1 344 syslogd 2 22643 nfsd 59 0 sched 30346 The fucking scheduler is doing more IO than nfsd! Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10 second intervals, disktime is 1000000=1s: 2006 Dec 7 12:06:47, load: 8.08, disk_r: 1408 KB, disk_w: 14482 KB UID PID PPID CMD DEVICE MAJ MIN D DISKTIME 0 3 0 fsflush sd6 54 389 W 242 0 3 0 fsflush sd6 54 384 W 251 1 22643 1 nfsd sd17 54 1088 R 2733 1 22643 1 nfsd sd14 54 896 R 4420 1 22643 1 nfsd sd13 54 832 R 5296 1 22643 1 nfsd sd18 54 1152 R 7103 1 22643 1 nfsd sd47 54 3008 R 7240 1 22643 1 nfsd sd31 54 1984 R 14827 1 22643 1 nfsd sd32 54 2048 R 16653 1 22643 1 nfsd sd45 54 2880 R 17768 1 22643 1 nfsd sd30 54 1920 R 17962 1 22643 1 nfsd sd22 54 1408 R 18687 1 22643 1 nfsd sd8 54 512 R 19693 1 22643 1 nfsd sd33 54 2112 R 20659 1 22643 1 nfsd sd19 54 1216 R 22933 1 22643 1 nfsd sd43 54 2752 R 23602 1 22643 1 nfsd sd7 54 448 R 27458 1 22643 1 nfsd sd10 54 640 R 28715 1 22643 1 nfsd sd49 54 3136 R 29640 1 22643 1 nfsd sd5 54 320 R 46201 1 22643 1 nfsd sd11 54 704 R 46775 1 22643 1 nfsd sd24 54 1536 R 60011 <-- 0.06s 0 0 0 sched sd16 54 1024 W 140682 0 0 0 sched sd47 54 3008 W 142082 0 0 0 sched sd18 54 1152 W 151720 0 0 0 sched sd13 54 832 W 157107 0 0 0 sched sd14 54 896 W 159044 0 0 0 sched sd17 54 1088 W 164386 0 0 0 sched sd35 54 2240 W 167119 0 0 0 sched sd34 54 2176 W 168552 0 0 0 sched sd37 54 2368 W 171872 0 0 0 sched sd40 54 2560 W 193907 0 0 0 sched sd25 54 1600 W 195662 0 0 0 sched sd33 54 2112 W 196480 Disk response times seem really good actually. iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is doing a lot of writing, mostly 1k files. Testing on aeon suggests this is normal. truss -D also points to this nfssys slowness: [private:/tmp] root# truss -D -p `pgrep nfsd` ... /2: 0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED, 0xFED71F48) = 3266 /2: 0.0001 lwp_continue(3266) = 0 /3266: 260128.5978 lwp_create() (returning as new lwp ...) = 0 /3266: 0.0034 setustack(0xFED4BC60) /3266: nfssys(14, 0xFEC40FC8) (sleeping...) /3266: 43.1653 nfssys(14, 0xFEC40FC8) = 0 /3266: 0.0074 schedctl() = 0xFEF9F020 /3266: 0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF [0x0000FFFF] /3266: 0.0002 lwp_kill(3268, SIG#0) Err#3 ESRCH /3266: 0.0002 munmap(0xFEC20000, 4096) = 0 /3266: 0.0002 lwp_exit() Locks... [private:/tmp] root# lockstat -P sleep 10 Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 32 100% 100% 0.00 59963092 0xffffffffb282a1c8 dmu_object_alloc+0x75 \ This message posted from opensolaris.org
Hey Ben - I need more time to look at this and connect some dots, but real quick.... Some nfsstat data that we could use to potentially correlate to the local server activity would be interesting. zfs_create() seems to be the heavy hitter, but a periodic kernel profile (especially if we can catch a 97% SYS period) would help: #lockstat -i997 -Ik -s 10 sleep 60 Alternatively: #dtrace -n ''profile-997hz / arg0 != 0 / { @s[stack()]=count(); }'' It would also be interesting to see what the zfs_create()''s are doing. Perhaps a quick: #dtrace -n ''zfs_create:entry { printf("ZFS Create: %s\n", stringof(args[0]->v_path)); }'' It would also be interesting to see the network stats. Grab Brendan''s nicstat and collect some samples.... You''re reference to low traffic is in bandwidth, which, as you indicate, is really, really low. But the data, at least up to this point, suggests the workload is not data/bandwidth intensive, but more attribute intensive. Note again zfs_create() is the heavy ZFS function, along with zfs_getattr. Perhaps it''s the attribute-intensive nature of the load that is at the root of this. I can spend more time on this tomorrow (traveling today). Thanks, /jim Ben Rockwood wrote:> I''ve got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don''t know. Notice vmstat: > > 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 > 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 > 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 > 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 > 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 > 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 > 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 > 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 > 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 > 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 > 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 > 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 > 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 > > These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We''re talking less than 10MB/s. > > NFS is pathetically slow. We''re using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). > > I''ve been slamming my head against this problem for days and can''t make headway. I''ll post some of my notes below. Any thoughts or ideas are welcome! > > benr. > > ==> > Step 1 was to disable any ZFS features that might consume large amounts of CPU: > > # zfs set compression=off joyous > # zfs set atime=off joyous > # zfs set checksum=off joyous > > These changes had no effect. > > Next was to consider that perhaps NFS was doing name lookups when it shouldn''t. Indeed "dns" was specified in /etc/nsswitch.conf which won''t work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. > > Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: > > [private:/tmp] root# sar 1 100 > > SunOS private.thumper1 5.11 snv_43 i86pc 12/07/2006 > > 10:38:05 %usr %sys %wio %idle > 10:38:06 0 27 0 73 > 10:38:07 0 27 0 73 > 10:38:09 0 27 0 73 > 10:38:10 1 26 0 73 > 10:38:11 0 26 0 74 > 10:38:12 0 26 0 74 > 10:38:13 0 24 0 76 > 10:38:14 0 6 0 94 > 10:38:15 0 7 0 93 > 10:38:22 0 99 0 1 <-- > 10:38:23 0 94 0 6 <-- > 10:38:24 0 28 0 72 > 10:38:25 0 27 0 73 > 10:38:26 0 27 0 73 > 10:38:27 0 27 0 73 > 10:38:28 0 27 0 73 > 10:38:29 1 30 0 69 > 10:38:30 0 27 0 73 > > And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: > > 10:40:04 %usr %sys %wio %idle Delta > 10:40:11 0 97 0 3 > 10:40:45 0 98 0 2 34 seconds > 10:41:02 0 94 0 6 17 seconds > 10:41:26 0 100 0 0 24 seconds > 10:42:00 0 100 0 0 34 seconds > 10:42:25 (end of sample) >25 seconds > > Looking at the congestion in the run queue: > > [private:/tmp] root# sar -q 5 100 > > 10:45:43 runq-sz %runocc swpq-sz %swpocc > 10:45:51 27.0 85 0.0 0 > 10:45:57 1.0 20 0.0 0 > 10:46:02 2.0 60 0.0 0 > 10:46:13 19.8 99 0.0 0 > 10:46:23 17.7 99 0.0 0 > 10:46:34 24.4 99 0.0 0 > 10:46:41 22.1 97 0.0 0 > 10:46:48 13.0 96 0.0 0 > 10:46:55 25.3 102 0.0 0 > > Looking at the per-CPU breakdown: > > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 324 224 0 0 0 154 0 0 0 100 0 0 > 1 0 0 0 114 0 226 0 10 13 0 86 0 1 0 99 > 2 0 0 0 162 138 149 0 5 4 0 0 0 1 0 99 > 3 0 0 0 55 6 46 0 4 3 0 0 0 1 0 99 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 310 210 34 0 17 171 7 5 0 100 0 0 > 1 0 0 0 152 1 200 0 17 26 5 59 1 65 0 34 > 2 0 0 0 271 197 175 1 13 20 2 0 0 66 0 34 > 3 0 0 0 120 6 76 0 6 14 2 0 0 66 0 34 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 1 318 204 46 1 19 22 5 18 0 100 0 0 > 1 0 0 0 112 0 40 0 12 15 3 4 0 100 0 0 > 2 0 0 0 195 93 31 1 11 27 3 0 0 100 0 0 > 3 0 0 0 108 1 20 0 9 31 3 2 0 100 0 0 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 569 204 809 3 133 432 1 21 0 99 0 1 > 1 0 0 0 502 0 1159 1 201 453 1 8 0 97 0 3 > 2 0 0 0 2305 2197 147 6 51 279 0 0 0 100 0 0 > 3 0 0 0 242 2 433 10 75 301 1 8 0 100 0 0 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 371 227 110 0 31 1470 0 80 0 99 0 1 > 1 0 0 3 1034 0 2079 0 102 241 0 9 0 10 0 90 > 2 0 0 0 1892 1718 1558 3 129 214 0 36 0 11 0 89 > 3 0 0 0 303 6 549 0 50 329 0 5 0 8 0 92 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 1 312 205 24 0 10 39 4 4 0 100 0 0 > 1 0 0 1 134 0 99 0 10 20 3 6 0 90 0 10 > 2 0 0 0 187 90 40 0 8 15 4 10 0 90 0 10 > 3 0 0 1 100 1 19 0 8 9 4 6 0 90 0 10 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 2 528 206 916 8 161 312 1 39 0 99 0 1 > 1 0 0 1 206 1 350 6 120 147 1 68 0 100 0 0 > 2 83 0 0 1929 1816 192 9 87 176 0 50 0 100 0 0 > 3 0 0 0 142 4 66 21 12 460 1 17 1 99 0 0 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 360 228 163 11 25 743 0 13 0 100 0 0 > 1 0 0 0 1587 0 4126 13 520 1435 0 122 0 24 0 76 > 2 3 0 0 7785 7669 1499 29 225 933 0 48 0 27 0 73 > 3 1 0 0 1856 6 4683 0 467 1912 0 37 0 23 0 77 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 323 223 0 0 0 244 0 0 0 100 0 0 > 1 0 0 0 141 0 277 0 14 11 0 1 0 0 0 100 > 2 0 0 0 307 259 312 0 16 15 0 3 0 1 0 99 > 3 0 0 0 113 6 165 0 5 5 0 91 0 0 0 100 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 324 224 0 0 0 203 0 0 0 100 0 0 > 1 0 0 0 143 0 283 0 11 2 0 13 0 1 0 99 > 2 0 0 0 312 278 298 0 8 11 0 3 0 1 0 99 > 3 0 0 0 119 6 175 0 3 5 0 69 0 1 0 99 > > Looking at prstat we can see the number of nfsd threads are low, much lower than the 1024 limit: > > [private:/tmp] root# prstat > PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP > 22643 daemon 2228K 1776K sleep 60 -20 2:55:52 55% nfsd/18 > > Interestingly, using prstat -mL to monitor thread latency, we see that a handful of threads are the culprates for consuming mass CPU: > > [private:/tmp] root# prstat -mL > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID > 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506 > 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563 > 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554 > 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560 > Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86 > > Above SLP represents "The percentage of time the process has spent sleeping", LAT represents "percentage of time the process has spent waiting for CPU", VCX is "number of voluntary context switches" and ICX is for involuntary context switches. > > Using pstack to try and determine what these threads are doing we see: > > [private:/tmp] root# pstack 22643/1693 > 22643: /usr/lib/nfs/nfsd > ----------------- lwp# 1693 / thread# 1693 -------------------- > fee589b7 nfssys (e, feba0fc8) > fee5839f _thr_setup (fed4d400) + 51 > fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0) > [private:/tmp] root# pstack 22643/1708 > 22643: /usr/lib/nfs/nfsd > ----------------- lwp# 1708 -------------------------------- > fee589b7 nfssys (e, febb0fc8) > fee5839f _thr_setup (fed44c00) + 51 > fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0) > [private:/tmp] root# pstack 22643/1664 > 22643: /usr/lib/nfs/nfsd > ----------------- lwp# 1664 / thread# 1664 -------------------- > fee589b7 nfssys (e, fec90fc8) > fee5839f _thr_setup (fed43800) + 51 > fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0) > > Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what ZFS is doing: > > [private:/] root# dtrace -s zvop_times.d > dtrace: script ''zvop_times.d'' matched 66 probes > ^C > CPU ID FUNCTION:NAME > 1 2 :END > ZFS COUNT > > zfs_getsecattr 4 > zfs_space 70 > zfs_rename 111 > zfs_readdir 284 > zfs_read 367 > zfs_mkdir 670 > zfs_setattr 1054 > zfs_frlock 1562 > zfs_putpage 3916 > zfs_write 4110 > zfs_create 4178 > zfs_remove 7794 > zfs_fid 14960 > zfs_inactive 17637 > zfs_access 20809 > zfs_fsync 29668 > zfs_lookup 31273 > zfs_getattr 175457 > > > > ZFS AVG TIME > > zfs_fsync 2337 > zfs_getattr 2735 > zfs_access 2774 > zfs_fid 2948 > zfs_inactive 4081 > zfs_lookup 6067 > zfs_frlock 6750 > zfs_putpage 6840 > zfs_mkdir 9259 > zfs_getsecattr 10940 > zfs_setattr 15364 > zfs_space 25214 > zfs_readdir 25726 > zfs_remove 26117 > zfs_write 31942 > zfs_rename 60504 > zfs_read 144449 > zfs_create 71215587 > > > > ZFS SUM TIME > > zfs_getsecattr 43762 > zfs_space 1765026 > zfs_mkdir 6204101 > zfs_rename 6715987 > zfs_readdir 7306233 > zfs_frlock 10544902 > zfs_setattr 16194261 > zfs_putpage 26787465 > zfs_fid 44111892 > zfs_read 53013073 > zfs_access 57740468 > zfs_fsync 69354886 > zfs_inactive 71990292 > zfs_write 131282275 > zfs_lookup 189748320 > zfs_remove 203558835 > zfs_getattr 479998059 > zfs_create 297538724997 > > > [private:/] root# zfs get all joyous > NAME PROPERTY VALUE SOURCE > joyous type filesystem - > joyous creation Fri Jul 28 17:08 2006 - > joyous used 255G - > joyous available 15.7T - > joyous referenced 81.7K - > joyous compressratio 1.21x - > joyous mounted yes - > joyous quota none default > joyous reservation none default > joyous recordsize 128K default > joyous mountpoint /joyous default > joyous sharenfs off default > joyous checksum off local > joyous compression off local > joyous atime off local > joyous devices on default > joyous exec on default > joyous setuid on default > joyous readonly off default > joyous zoned off default > joyous snapdir hidden default > joyous aclmode groupmask default > joyous aclinherit secure default > > So what exactly is nfsd doing??? > > [private:/] root# time dtrace -n ''syscall:::entry /execname == "nfsd"/ { @[probefunc] = count(); }'' > dtrace: description ''syscall:::entry '' matched 225 probes > ^C > > open 3 > close 5 > lwp_mutex_timedlock 7 > lwp_mutex_wakeup 7 > pollsys 7 > putmsg 8 > getmsg 11 > fcntl 12 > lwp_park 30 > mmap 41 > munmap 42 > fxstat 47 > ioctl 58 > lwp_continue 100 > lwp_create 100 > setcontext 100 > lwp_kill 101 > lwp_exit 110 > schedctl 110 > lwp_sigmask 138 > nfs 203 > > real 3m22.464s > user 0m0.206s > sys 0m0.222s > > And the answer is, not much of anything. Where is this CPU disappearing too!!! Looking at the stacks: > > [private:/] root# time dtrace -n ''syscall::nfs:entry /execname == "nfsd"/ { @[ustack()] = count(); }'' > dtrace: description ''syscall::nfs:entry '' matched 1 probe > ^C > > > libc.so.1`_nfssys+0x7 > libc.so.1`_thr_setup+0x51 > libc.so.1`_lwp_start > 22 > > Not very helpful... lets try this: > > [private:/] root# truss -c -p 22643 > ^C > syscall seconds calls errors > close .000 1 > ioctl .000 2 > lwp_park .000 1 > lwp_unpark .000 1 > getmsg .000 1 > putmsg .000 1 > setustack .000 30 > nfssys 106.146 56 > mmap .000 20 > munmap .000 3 > fxstat .000 4 > lwp_create .001 30 > lwp_exit .000 12 > lwp_continue .000 30 > lwp_kill .000 13 13 > lwp_sigmask .000 14 > pollsys .000 1 > schedctl .000 12 > -------- ------ ---- > sys totals: 106.149 232 13 > usr time: .001 > elapsed: 47.230 > > Better. Clearly nfssys is a problem, but why? > > [private:/] root# dtrace -n ''io:genunix::start { @[pid, execname] = count(); }'' > dtrace: description ''io:genunix::start '' matched 3 probes > ^C > > 3 fsflush 1 > 344 syslogd 2 > 22643 nfsd 59 > 0 sched 30346 > > The fucking scheduler is doing more IO than nfsd! > > Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10 second intervals, disktime is 1000000=1s: > > 2006 Dec 7 12:06:47, load: 8.08, disk_r: 1408 KB, disk_w: 14482 KB > > UID PID PPID CMD DEVICE MAJ MIN D DISKTIME > 0 3 0 fsflush sd6 54 389 W 242 > 0 3 0 fsflush sd6 54 384 W 251 > 1 22643 1 nfsd sd17 54 1088 R 2733 > 1 22643 1 nfsd sd14 54 896 R 4420 > 1 22643 1 nfsd sd13 54 832 R 5296 > 1 22643 1 nfsd sd18 54 1152 R 7103 > 1 22643 1 nfsd sd47 54 3008 R 7240 > 1 22643 1 nfsd sd31 54 1984 R 14827 > 1 22643 1 nfsd sd32 54 2048 R 16653 > 1 22643 1 nfsd sd45 54 2880 R 17768 > 1 22643 1 nfsd sd30 54 1920 R 17962 > 1 22643 1 nfsd sd22 54 1408 R 18687 > 1 22643 1 nfsd sd8 54 512 R 19693 > 1 22643 1 nfsd sd33 54 2112 R 20659 > 1 22643 1 nfsd sd19 54 1216 R 22933 > 1 22643 1 nfsd sd43 54 2752 R 23602 > 1 22643 1 nfsd sd7 54 448 R 27458 > 1 22643 1 nfsd sd10 54 640 R 28715 > 1 22643 1 nfsd sd49 54 3136 R 29640 > 1 22643 1 nfsd sd5 54 320 R 46201 > 1 22643 1 nfsd sd11 54 704 R 46775 > 1 22643 1 nfsd sd24 54 1536 R 60011 <-- 0.06s > 0 0 0 sched sd16 54 1024 W 140682 > 0 0 0 sched sd47 54 3008 W 142082 > 0 0 0 sched sd18 54 1152 W 151720 > 0 0 0 sched sd13 54 832 W 157107 > 0 0 0 sched sd14 54 896 W 159044 > 0 0 0 sched sd17 54 1088 W 164386 > 0 0 0 sched sd35 54 2240 W 167119 > 0 0 0 sched sd34 54 2176 W 168552 > 0 0 0 sched sd37 54 2368 W 171872 > 0 0 0 sched sd40 54 2560 W 193907 > 0 0 0 sched sd25 54 1600 W 195662 > 0 0 0 sched sd33 54 2112 W 196480 > > Disk response times seem really good actually. > > iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is doing a lot of writing, mostly 1k files. Testing on aeon suggests this is normal. > > truss -D also points to this nfssys slowness: > > [private:/tmp] root# truss -D -p `pgrep nfsd` > ... > /2: 0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED, 0xFED71F48) = 3266 > /2: 0.0001 lwp_continue(3266) = 0 > /3266: 260128.5978 lwp_create() (returning as new lwp ...) = 0 > /3266: 0.0034 setustack(0xFED4BC60) > /3266: nfssys(14, 0xFEC40FC8) (sleeping...) > /3266: 43.1653 nfssys(14, 0xFEC40FC8) = 0 > /3266: 0.0074 schedctl() = 0xFEF9F020 > /3266: 0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF [0x0000FFFF] > /3266: 0.0002 lwp_kill(3268, SIG#0) Err#3 ESRCH > /3266: 0.0002 munmap(0xFEC20000, 4096) = 0 > /3266: 0.0002 lwp_exit() > > Locks... > > [private:/tmp] root# lockstat -P sleep 10 > > Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec) > > Count indv cuml rcnt spin Lock Caller > ------------------------------------------------------------------------------- > 32 100% 100% 0.00 59963092 0xffffffffb282a1c8 dmu_object_alloc+0x75 \ > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Hi Ben, Your sar output shows one core pegged pretty much constantly! From the Solaris Performance and Tools book that SLP state value has "The remainder of important events such as disk and network waits..... along with other kernel wait events...... kernel locks or condition variables also accumilate time in this state" ZFS COUNT zfs_create 4178 ZFS AVG TIME zfs_create 71215587 ZFS SUM TIME zfs_create 297538724997 I think it looks like the system must be spinning in zfs_create(), looking in usr/src/uts/common/fs/zfs/zfs_vnops.c there are a couple of places it could loop:- 1129 /* 1130 * Create a new file object and update the directory 1131 * to reference it. 1132 */ .... 1154 error = dmu_tx_assign(tx, zfsvfs->z_assign); 1155 if (error) { 1156 zfs_dirent_unlock(dl); 1157 if (error == ERESTART && 1158 zfsvfs->z_assign == TXG_NOWAIT) { 1159 dmu_tx_wait(tx); 1160 dmu_tx_abort(tx); 1161 goto top; 1162 } and 1201 /* 1202 * Truncate regular files if requested. 1203 */ 1204 if ((ZTOV(zp)->v_type == VREG) && 1205 (zp->z_phys->zp_size != 0) && 1206 (vap->va_mask & AT_SIZE) && (vap->va_size == 0)) { 1207 error = zfs_freesp(zp, 0, 0, mode, TRUE); 1208 if (error == ERESTART && 1209 zfsvfs->z_assign == TXG_NOWAIT) { 1210 /* NB: we already did dmu_tx_wait() */ 1211 zfs_dirent_unlock(dl); 1212 VN_RELE(ZTOV(zp)); 1213 goto top; I think the snoop would be very useful to pour over. Cheers, Alan This message posted from opensolaris.org
Ben Rockwood wrote: > I''ve got a Thumper doing nothing but serving NFS. Its using B43 with > zil_disabled. The system is being consumed in waves, but by what I > don''t know. Notice vmstat: We made several performance fixes in the NFS/ZFS area in recent builds, so if possible it would be great to upgrade you from snv_43. That said, there might be something else going on that we haven''t accounted for. ...> > Step 1 was to disable any ZFS features that might consume large amounts of CPU: > > # zfs set compression=off joyous > # zfs set atime=off joyous > # zfs set checksum=off joyousIn our performance testing, we haven''t found checksums to be anywhere near a large consumer of CPU, so i would recommend leaving that on (due to its benefits). I suspect your apps/clients don''t depend on atime, so i think its a good idea to turn that off. We''ve gotten better NFS performance with this off. More of a heads up as it sounds like compression on/off isn''t your problem. If you are not getting good I/O BW with compression turned on, its most likely due to: 6460622 zio_nowait() doesn''t live up to its name As Jim, mentioned, using lockstat to figure out where your CPU is being spent is the first step. I''ve been using ''lockstat -kgIW -D 60 sleep 60''. That collects data for the top 60 callers for a 1 minute period. If you see ''mutex_enter'' high up in the results, then we have at least mutex lock contention. ...> Interestingly, using prstat -mL to monitor thread latency, we see that a handful of threads are the culprates for consuming mass CPU: > > [private:/tmp] root# prstat -mL > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID > 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506 > 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563 > 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554 > 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560 > Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86The high SYS times being charged to the userland nfsd threads is representative of what the kernel threads are doing (which is most likely going to be NFS and ZFS).> Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what ZFS is doing: > > [private:/] root# dtrace -s zvop_times.d > dtrace: script ''zvop_times.d'' matched 66 probes > ^C > CPU ID FUNCTION:NAME > 1 2 :END > ZFS COUNT > > zfs_getsecattr 4 > zfs_space 70 > zfs_rename 111 > zfs_readdir 284 > zfs_read 367 > zfs_mkdir 670 > zfs_setattr 1054 > zfs_frlock 1562 > zfs_putpage 3916 > zfs_write 4110 > zfs_create 4178 > zfs_remove 7794 > zfs_fid 14960 > zfs_inactive 17637 > zfs_access 20809 > zfs_fsync 29668 > zfs_lookup 31273 > zfs_getattr 175457 > > > > ZFS AVG TIME > > zfs_fsync 2337 > zfs_getattr 2735 > zfs_access 2774 > zfs_fid 2948 > zfs_inactive 4081 > zfs_lookup 6067 > zfs_frlock 6750 > zfs_putpage 6840 > zfs_mkdir 9259 > zfs_getsecattr 10940 > zfs_setattr 15364 > zfs_space 25214 > zfs_readdir 25726 > zfs_remove 26117 > zfs_write 31942 > zfs_rename 60504 > zfs_read 144449 > zfs_create 71215587An average of 71 msecs per create seems high. Doing a quantize instead of just an average on zfs_create() would be interesting (should have included that in my script)... ...> Locks... > > [private:/tmp] root# lockstat -P sleep 10 > > Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec) > > Count indv cuml rcnt spin Lock Caller > ------------------------------------------------------------------------------- > 32 100% 100% 0.00 59963092 0xffffffffb282a1c8 dmu_object_alloc+0x75 \That''s interesting. This is build snv_43, right? Doesn''t exactly line up with current bits, but i would guess this is " mutex_enter(&osi->os_obj_lock);". Can you show us the output for a longer sampling period, such as ''lockstat -P -D 30 sleep 60''? The "Adaptive mutex spin" category is usually the most/only interesting one in my experience. I would be curious to know your memory usage. What is the physical memory on the machine? And can you run: # echo ''arc::print -d'' | mdb -k # echo ''::kmastat'' | mdb -k | sort -n -k 5 | tail -40 Matt has a nice little dscript that figures out how long txgs are taking (i''ll attach that). I believe it will work with old bits (i defined "OLD" the for that to work). Can you try running that? Can you run ''zpool list'' and ''zpool status''? I''d like to know how full your pool is. Let''s start with that info, eric -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: zfs_sync.d URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061207/9b11f986/attachment.ksh>
Ben, The attached dscript might help determining the zfs_create issue. It prints: - a count of all functions called from zfs_create - average wall count time of the 30 highest functions - average cpu time of the 30 highest functions Note, please ignore warnings of the following type: dtrace: 1346 dynamic variable drops with non-empty dirty list Neil. Ben Rockwood wrote On 12/07/06 06:01,:> I''ve got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don''t know. Notice vmstat: > > 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 > 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 > 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 > 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 > 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 > 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 > 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 > 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 > 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 > 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 > 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 > 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 > 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 > > These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We''re talking less than 10MB/s. > > NFS is pathetically slow. We''re using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). > > I''ve been slamming my head against this problem for days and can''t make headway. I''ll post some of my notes below. Any thoughts or ideas are welcome! > > benr. > > ==> > Step 1 was to disable any ZFS features that might consume large amounts of CPU: > > # zfs set compression=off joyous > # zfs set atime=off joyous > # zfs set checksum=off joyous > > These changes had no effect. > > Next was to consider that perhaps NFS was doing name lookups when it shouldn''t. Indeed "dns" was specified in /etc/nsswitch.conf which won''t work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. > > Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: > > [private:/tmp] root# sar 1 100 > > SunOS private.thumper1 5.11 snv_43 i86pc 12/07/2006 > > 10:38:05 %usr %sys %wio %idle > 10:38:06 0 27 0 73 > 10:38:07 0 27 0 73 > 10:38:09 0 27 0 73 > 10:38:10 1 26 0 73 > 10:38:11 0 26 0 74 > 10:38:12 0 26 0 74 > 10:38:13 0 24 0 76 > 10:38:14 0 6 0 94 > 10:38:15 0 7 0 93 > 10:38:22 0 99 0 1 <-- > 10:38:23 0 94 0 6 <-- > 10:38:24 0 28 0 72 > 10:38:25 0 27 0 73 > 10:38:26 0 27 0 73 > 10:38:27 0 27 0 73 > 10:38:28 0 27 0 73 > 10:38:29 1 30 0 69 > 10:38:30 0 27 0 73 > > And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: > > 10:40:04 %usr %sys %wio %idle Delta > 10:40:11 0 97 0 3 > 10:40:45 0 98 0 2 34 seconds > 10:41:02 0 94 0 6 17 seconds > 10:41:26 0 100 0 0 24 seconds > 10:42:00 0 100 0 0 34 seconds > 10:42:25 (end of sample) >25 seconds > > Looking at the congestion in the run queue: > > [private:/tmp] root# sar -q 5 100 > > 10:45:43 runq-sz %runocc swpq-sz %swpocc > 10:45:51 27.0 85 0.0 0 > 10:45:57 1.0 20 0.0 0 > 10:46:02 2.0 60 0.0 0 > 10:46:13 19.8 99 0.0 0 > 10:46:23 17.7 99 0.0 0 > 10:46:34 24.4 99 0.0 0 > 10:46:41 22.1 97 0.0 0 > 10:46:48 13.0 96 0.0 0 > 10:46:55 25.3 102 0.0 0 > > Looking at the per-CPU breakdown: > > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 324 224 0 0 0 154 0 0 0 100 0 0 > 1 0 0 0 114 0 226 0 10 13 0 86 0 1 0 99 > 2 0 0 0 162 138 149 0 5 4 0 0 0 1 0 99 > 3 0 0 0 55 6 46 0 4 3 0 0 0 1 0 99 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 310 210 34 0 17 171 7 5 0 100 0 0 > 1 0 0 0 152 1 200 0 17 26 5 59 1 65 0 34 > 2 0 0 0 271 197 175 1 13 20 2 0 0 66 0 34 > 3 0 0 0 120 6 76 0 6 14 2 0 0 66 0 34 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 1 318 204 46 1 19 22 5 18 0 100 0 0 > 1 0 0 0 112 0 40 0 12 15 3 4 0 100 0 0 > 2 0 0 0 195 93 31 1 11 27 3 0 0 100 0 0 > 3 0 0 0 108 1 20 0 9 31 3 2 0 100 0 0 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 569 204 809 3 133 432 1 21 0 99 0 1 > 1 0 0 0 502 0 1159 1 201 453 1 8 0 97 0 3 > 2 0 0 0 2305 2197 147 6 51 279 0 0 0 100 0 0 > 3 0 0 0 242 2 433 10 75 301 1 8 0 100 0 0 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 371 227 110 0 31 1470 0 80 0 99 0 1 > 1 0 0 3 1034 0 2079 0 102 241 0 9 0 10 0 90 > 2 0 0 0 1892 1718 1558 3 129 214 0 36 0 11 0 89 > 3 0 0 0 303 6 549 0 50 329 0 5 0 8 0 92 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 1 312 205 24 0 10 39 4 4 0 100 0 0 > 1 0 0 1 134 0 99 0 10 20 3 6 0 90 0 10 > 2 0 0 0 187 90 40 0 8 15 4 10 0 90 0 10 > 3 0 0 1 100 1 19 0 8 9 4 6 0 90 0 10 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 2 528 206 916 8 161 312 1 39 0 99 0 1 > 1 0 0 1 206 1 350 6 120 147 1 68 0 100 0 0 > 2 83 0 0 1929 1816 192 9 87 176 0 50 0 100 0 0 > 3 0 0 0 142 4 66 21 12 460 1 17 1 99 0 0 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 360 228 163 11 25 743 0 13 0 100 0 0 > 1 0 0 0 1587 0 4126 13 520 1435 0 122 0 24 0 76 > 2 3 0 0 7785 7669 1499 29 225 933 0 48 0 27 0 73 > 3 1 0 0 1856 6 4683 0 467 1912 0 37 0 23 0 77 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 323 223 0 0 0 244 0 0 0 100 0 0 > 1 0 0 0 141 0 277 0 14 11 0 1 0 0 0 100 > 2 0 0 0 307 259 312 0 16 15 0 3 0 1 0 99 > 3 0 0 0 113 6 165 0 5 5 0 91 0 0 0 100 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 324 224 0 0 0 203 0 0 0 100 0 0 > 1 0 0 0 143 0 283 0 11 2 0 13 0 1 0 99 > 2 0 0 0 312 278 298 0 8 11 0 3 0 1 0 99 > 3 0 0 0 119 6 175 0 3 5 0 69 0 1 0 99 > > Looking at prstat we can see the number of nfsd threads are low, much lower than the 1024 limit: > > [private:/tmp] root# prstat > PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP > 22643 daemon 2228K 1776K sleep 60 -20 2:55:52 55% nfsd/18 > > Interestingly, using prstat -mL to monitor thread latency, we see that a handful of threads are the culprates for consuming mass CPU: > > [private:/tmp] root# prstat -mL > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID > 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506 > 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563 > 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554 > 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543 > 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560 > Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86 > > Above SLP represents "The percentage of time the process has spent sleeping", LAT represents "percentage of time the process has spent waiting for CPU", VCX is "number of voluntary context switches" and ICX is for involuntary context switches. > > Using pstack to try and determine what these threads are doing we see: > > [private:/tmp] root# pstack 22643/1693 > 22643: /usr/lib/nfs/nfsd > ----------------- lwp# 1693 / thread# 1693 -------------------- > fee589b7 nfssys (e, feba0fc8) > fee5839f _thr_setup (fed4d400) + 51 > fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0) > [private:/tmp] root# pstack 22643/1708 > 22643: /usr/lib/nfs/nfsd > ----------------- lwp# 1708 -------------------------------- > fee589b7 nfssys (e, febb0fc8) > fee5839f _thr_setup (fed44c00) + 51 > fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0) > [private:/tmp] root# pstack 22643/1664 > 22643: /usr/lib/nfs/nfsd > ----------------- lwp# 1664 / thread# 1664 -------------------- > fee589b7 nfssys (e, fec90fc8) > fee5839f _thr_setup (fed43800) + 51 > fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0) > > Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what ZFS is doing: > > [private:/] root# dtrace -s zvop_times.d > dtrace: script ''zvop_times.d'' matched 66 probes > ^C > CPU ID FUNCTION:NAME > 1 2 :END > ZFS COUNT > > zfs_getsecattr 4 > zfs_space 70 > zfs_rename 111 > zfs_readdir 284 > zfs_read 367 > zfs_mkdir 670 > zfs_setattr 1054 > zfs_frlock 1562 > zfs_putpage 3916 > zfs_write 4110 > zfs_create 4178 > zfs_remove 7794 > zfs_fid 14960 > zfs_inactive 17637 > zfs_access 20809 > zfs_fsync 29668 > zfs_lookup 31273 > zfs_getattr 175457 > > > > ZFS AVG TIME > > zfs_fsync 2337 > zfs_getattr 2735 > zfs_access 2774 > zfs_fid 2948 > zfs_inactive 4081 > zfs_lookup 6067 > zfs_frlock 6750 > zfs_putpage 6840 > zfs_mkdir 9259 > zfs_getsecattr 10940 > zfs_setattr 15364 > zfs_space 25214 > zfs_readdir 25726 > zfs_remove 26117 > zfs_write 31942 > zfs_rename 60504 > zfs_read 144449 > zfs_create 71215587 > > > > ZFS SUM TIME > > zfs_getsecattr 43762 > zfs_space 1765026 > zfs_mkdir 6204101 > zfs_rename 6715987 > zfs_readdir 7306233 > zfs_frlock 10544902 > zfs_setattr 16194261 > zfs_putpage 26787465 > zfs_fid 44111892 > zfs_read 53013073 > zfs_access 57740468 > zfs_fsync 69354886 > zfs_inactive 71990292 > zfs_write 131282275 > zfs_lookup 189748320 > zfs_remove 203558835 > zfs_getattr 479998059 > zfs_create 297538724997 > > > [private:/] root# zfs get all joyous > NAME PROPERTY VALUE SOURCE > joyous type filesystem - > joyous creation Fri Jul 28 17:08 2006 - > joyous used 255G - > joyous available 15.7T - > joyous referenced 81.7K - > joyous compressratio 1.21x - > joyous mounted yes - > joyous quota none default > joyous reservation none default > joyous recordsize 128K default > joyous mountpoint /joyous default > joyous sharenfs off default > joyous checksum off local > joyous compression off local > joyous atime off local > joyous devices on default > joyous exec on default > joyous setuid on default > joyous readonly off default > joyous zoned off default > joyous snapdir hidden default > joyous aclmode groupmask default > joyous aclinherit secure default > > So what exactly is nfsd doing??? > > [private:/] root# time dtrace -n ''syscall:::entry /execname == "nfsd"/ { @[probefunc] = count(); }'' > dtrace: description ''syscall:::entry '' matched 225 probes > ^C > > open 3 > close 5 > lwp_mutex_timedlock 7 > lwp_mutex_wakeup 7 > pollsys 7 > putmsg 8 > getmsg 11 > fcntl 12 > lwp_park 30 > mmap 41 > munmap 42 > fxstat 47 > ioctl 58 > lwp_continue 100 > lwp_create 100 > setcontext 100 > lwp_kill 101 > lwp_exit 110 > schedctl 110 > lwp_sigmask 138 > nfs 203 > > real 3m22.464s > user 0m0.206s > sys 0m0.222s > > And the answer is, not much of anything. Where is this CPU disappearing too!!! Looking at the stacks: > > [private:/] root# time dtrace -n ''syscall::nfs:entry /execname == "nfsd"/ { @[ustack()] = count(); }'' > dtrace: description ''syscall::nfs:entry '' matched 1 probe > ^C > > > libc.so.1`_nfssys+0x7 > libc.so.1`_thr_setup+0x51 > libc.so.1`_lwp_start > 22 > > Not very helpful... lets try this: > > [private:/] root# truss -c -p 22643 > ^C > syscall seconds calls errors > close .000 1 > ioctl .000 2 > lwp_park .000 1 > lwp_unpark .000 1 > getmsg .000 1 > putmsg .000 1 > setustack .000 30 > nfssys 106.146 56 > mmap .000 20 > munmap .000 3 > fxstat .000 4 > lwp_create .001 30 > lwp_exit .000 12 > lwp_continue .000 30 > lwp_kill .000 13 13 > lwp_sigmask .000 14 > pollsys .000 1 > schedctl .000 12 > -------- ------ ---- > sys totals: 106.149 232 13 > usr time: .001 > elapsed: 47.230 > > Better. Clearly nfssys is a problem, but why? > > [private:/] root# dtrace -n ''io:genunix::start { @[pid, execname] = count(); }'' > dtrace: description ''io:genunix::start '' matched 3 probes > ^C > > 3 fsflush 1 > 344 syslogd 2 > 22643 nfsd 59 > 0 sched 30346 > > The fucking scheduler is doing more IO than nfsd! > > Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10 second intervals, disktime is 1000000=1s: > > 2006 Dec 7 12:06:47, load: 8.08, disk_r: 1408 KB, disk_w: 14482 KB > > UID PID PPID CMD DEVICE MAJ MIN D DISKTIME > 0 3 0 fsflush sd6 54 389 W 242 > 0 3 0 fsflush sd6 54 384 W 251 > 1 22643 1 nfsd sd17 54 1088 R 2733 > 1 22643 1 nfsd sd14 54 896 R 4420 > 1 22643 1 nfsd sd13 54 832 R 5296 > 1 22643 1 nfsd sd18 54 1152 R 7103 > 1 22643 1 nfsd sd47 54 3008 R 7240 > 1 22643 1 nfsd sd31 54 1984 R 14827 > 1 22643 1 nfsd sd32 54 2048 R 16653 > 1 22643 1 nfsd sd45 54 2880 R 17768 > 1 22643 1 nfsd sd30 54 1920 R 17962 > 1 22643 1 nfsd sd22 54 1408 R 18687 > 1 22643 1 nfsd sd8 54 512 R 19693 > 1 22643 1 nfsd sd33 54 2112 R 20659 > 1 22643 1 nfsd sd19 54 1216 R 22933 > 1 22643 1 nfsd sd43 54 2752 R 23602 > 1 22643 1 nfsd sd7 54 448 R 27458 > 1 22643 1 nfsd sd10 54 640 R 28715 > 1 22643 1 nfsd sd49 54 3136 R 29640 > 1 22643 1 nfsd sd5 54 320 R 46201 > 1 22643 1 nfsd sd11 54 704 R 46775 > 1 22643 1 nfsd sd24 54 1536 R 60011 <-- 0.06s > 0 0 0 sched sd16 54 1024 W 140682 > 0 0 0 sched sd47 54 3008 W 142082 > 0 0 0 sched sd18 54 1152 W 151720 > 0 0 0 sched sd13 54 832 W 157107 > 0 0 0 sched sd14 54 896 W 159044 > 0 0 0 sched sd17 54 1088 W 164386 > 0 0 0 sched sd35 54 2240 W 167119 > 0 0 0 sched sd34 54 2176 W 168552 > 0 0 0 sched sd37 54 2368 W 171872 > 0 0 0 sched sd40 54 2560 W 193907 > 0 0 0 sched sd25 54 1600 W 195662 > 0 0 0 sched sd33 54 2112 W 196480 > > Disk response times seem really good actually. > > iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is doing a lot of writing, mostly 1k files. Testing on aeon suggests this is normal. > > truss -D also points to this nfssys slowness: > > [private:/tmp] root# truss -D -p `pgrep nfsd` > ... > /2: 0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED, 0xFED71F48) = 3266 > /2: 0.0001 lwp_continue(3266) = 0 > /3266: 260128.5978 lwp_create() (returning as new lwp ...) = 0 > /3266: 0.0034 setustack(0xFED4BC60) > /3266: nfssys(14, 0xFEC40FC8) (sleeping...) > /3266: 43.1653 nfssys(14, 0xFEC40FC8) = 0 > /3266: 0.0074 schedctl() = 0xFEF9F020 > /3266: 0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF [0x0000FFFF] > /3266: 0.0002 lwp_kill(3268, SIG#0) Err#3 ESRCH > /3266: 0.0002 munmap(0xFEC20000, 4096) = 0 > /3266: 0.0002 lwp_exit() > > Locks... > > [private:/tmp] root# lockstat -P sleep 10 > > Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec) > > Count indv cuml rcnt spin Lock Caller > ------------------------------------------------------------------------------- > 32 100% 100% 0.00 59963092 0xffffffffb282a1c8 dmu_object_alloc+0x75 \ > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: zfs_create.d URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061207/34b9112f/attachment.ksh>
0.4 seconds of CPU on average to do a dmu_object_alloc is a wee sluggish! I suspect, however, it''s contention on osi->os_obj_lock as we don''t seem to be looping in dmu_object_alloc. I''m still confused though, I believe that locking an adaptive mutex will spin for a short period then context switch and so they shouldn''t be burning CPU - at least not .4s worth! EricK suggested running "lockstat -kgIW -D 60 sleep 60" which I think is the next step. Looking back at your original email, I see you had observed the same contention on os_obj_lock. I don''t enough much about the dmu to know if anything has been fixed in that area. Neil. Ben Rockwood wrote On 12/07/06 13:53,:> Neil Perrin wrote: > >> Ben, >> >> The attached dscript might help determining the zfs_create issue. >> It prints: >> - a count of all functions called from zfs_create >> - average wall count time of the 30 highest functions >> - average cpu time of the 30 highest functions >> >> Note, please ignore warnings of the following type: >> >> dtrace: 1346 dynamic variable drops with non-empty dirty list > > > The results are below. I keep bumping into dmu_object_alloc, but I''m > uncertain what is or isn''t normal in that reguard. > > benr. > > > [private:/tmp] root# dtrace -s zfs_create.d > dtrace: script ''zfs_create.d'' matched 2524 probes > dtrace: 508 dynamic variable drops with non-empty dirty list > ^C > CPU ID FUNCTION:NAME > 1 2 :END > counts > arc_read 1 > arc_set_callback 1 > buf_hash_find 1 > dbuf_read_done 1 > dmu_object_next 2 > dmu_zfetch_dofetch 2 > dmu_zfetch_fetch 2 > dmu_zfetch_fetchsz 2 > dnode_next_offset 2 > dnode_next_offset_level 2 > dbuf_hash_insert 3 > dbuf_set_data 3 > dmu_zfetch 3 > dmu_zfetch_find 3 > arc_buf_alloc 6 > zfs_v4_to_unix 6 > dbuf_prefetch 10 > mzap_addent 11 > mze_find 11 > mze_find_unused_cd 11 > mze_insert 11 > dbuf_findbp 13 > dmu_tx_check_ioerr 13 > dnode_block_freed 13 > spa_get_dsl 13 > buf_hash_remove 36 > arc_change_state 37 > arc_release 40 > dbuf_fix_old_data 40 > fzap_add 41 > fzap_add_cd 41 > zap_entry_create 41 > zap_increment_num_entries 41 > zap_put_leaf_maybe_grow_ptrtbl 41 > arc_tempreserve_clear 52 > dmu_tx_commit 52 > dsl_dir_tempreserve_clear 52 > txg_rele_to_sync 52 > zap_add 52 > zfs_dirent_unlock 52 > zfs_link_create 52 > zfs_log_create 52 > zil_commit 52 > zfs_mknode 55 > zfs_acl_chmod 56 > zfs_acl_fixup_canonical_six 56 > zfs_aclset_common 56 > zfs_mode_compute 56 > zfs_perm_init 56 > zfs_znode_dmu_init 56 > dbuf_create_bonus 57 > dmu_bonus_hold 57 > dmu_object_alloc 57 > zfs_acl_append 57 > zfs_znode_alloc 57 > dmu_tx_add_new_object 58 > dnode_allocate 58 > dmu_buf_set_user 59 > dmu_buf_update_user 59 > dbuf_create 60 > dbuf_read_impl 60 > dmu_zfetch_init 61 > dnode_create 61 > dnode_new_blkid 61 > dnode_setdblksz 61 > dnode_clear_range 62 > zio_buf_alloc 67 > zap_table_load 68 > dsl_dataset_dirty 69 > dbuf_fix_old_bonus_data 71 > fzap_lookup 74 > zap_lookup 74 > zap_leaf_array_create 82 > dmu_tx_hold_zap 85 > arc_tempreserve_space 86 > dmu_tx_assign 86 > dmu_tx_count_write 86 > dmu_tx_get_txg 86 > dmu_tx_try_assign 86 > dsl_dir_tempreserve_impl 86 > dsl_dir_tempreserve_space 86 > dsl_pool_adjustedsize 86 > spa_get_dspace 86 > txg_hold_open 86 > txg_rele_to_quiesce 86 > dmu_tx_create 95 > dmu_tx_create_dd 95 > zfs_acl_node_read 95 > zfs_acl_node_read_internal 95 > zfs_dirent_lock 95 > zfs_zaccess 95 > zfs_zaccess_common 95 > zfs_time_stamper_locked 109 > dbuf_hold_level 112 > bp_get_dasize 115 > dbuf_new_block 115 > dmu_tx_is_syncing 115 > fzap_checksize 115 > zap_deref_leaf 115 > zap_get_leaf_byblk 115 > zap_idx_to_blk 115 > zap_leaf_lookup 115 > zap_put_leaf 115 > dnode_cons 122 > zap_hash 126 > zap_lockdir 126 > zap_unlockdir 126 > dnode_add_ref 130 > zio_next_stage_async 133 > zio_pop_transform 133 > zio_ready 133 > zio_wait 133 > zio_wait_children_ready 133 > spa_config_enter 134 > spa_config_exit 134 > zio_clear_transform_stack 134 > zio_create 134 > zio_done 134 > zio_null 134 > zio_push_transform 134 > zio_root 134 > zio_wait_children_done 134 > zfs_acl_free 151 > zfs_acl_alloc 152 > zap_leaf_chunk_alloc 165 > mze_compare 168 > arc_buf_remove_ref 170 > remove_reference 170 > arc_released 173 > zfs_ace_access 184 > dmu_tx_hold_bonus 187 > arc_buf_add_ref 206 > add_reference 207 > arc_access_and_exit 207 > dmu_tx_willuse_space 229 > dnode_willuse_space 229 > dsl_dir_willuse_space 229 > dnode_setdirty 233 > dbuf_add_ref 256 > zio_wait_for_children 267 > dmu_tx_count_dnode 276 > spa_get_asize 287 > dsl_dataset_block_freeable 296 > dmu_buf_hold 309 > dmu_tx_hold_object_impl 333 > zfs_set_ace 336 > vdev_lookup_top 345 > zio_next_stage 401 > dbuf_will_dirty 424 > buf_hash 453 > dsl_dataset_prev_snap_txg 477 > dnode_rele 580 > dbuf_dirty 588 > dnode_hold 604 > dnode_hold_impl 664 > dsl_dir_dirty 817 > txg_list_add 886 > dmu_buf_get_user 907 > dbuf_hold 975 > dbuf_whichblock 978 > dbuf_rele 1099 > dbuf_hold_impl 1103 > dbuf_find 1113 > dbuf_hash 1116 > parent_delta 1119 > dsl_dir_estimated_space 1120 > dbuf_update_data 1222 > ldv 1301 > dbuf_read 1487 > > av wall clock > dnode_allocate 185006 > zap_table_load 193232 > dbuf_will_dirty 231433 > zap_entry_create 255027 > zap_lockdir 258613 > dmu_zfetch_find 259608 > dmu_zfetch 269971 > dnode_setdirty 309299 > dmu_zfetch_fetch 367224 > dmu_buf_hold 368589 > dmu_zfetch_dofetch 376191 > dbuf_dirty 386128 > dnode_hold_impl 535814 > dnode_hold 566143 > dmu_tx_hold_bonus 617734 > dmu_tx_hold_object_impl 663582 > dmu_tx_add_new_object 673663 > dmu_tx_hold_zap 706378 > dnode_new_blkid 761775 > fzap_lookup 782656 > dmu_bonus_hold 791209 > zap_get_leaf_byblk 841362 > zap_deref_leaf 973677 > zap_lookup 1085355 > zap_add 1837000 > fzap_add_cd 1916822 > fzap_add 1932422 > zfs_link_create 2353198 > dmu_object_alloc 403070558 > zfs_mknode 417633571 > > av cpu > dmu_tx_add_new_object 14083 > dbuf_will_dirty 14796 > dmu_buf_hold 17393 > dmu_tx_count_write 19129 > zfs_acl_chmod 19416 > dbuf_dirty 19525 > dmu_tx_try_assign 20067 > zap_lockdir 20951 > zap_table_load 21232 > dmu_tx_assign 21946 > zfs_perm_init 23304 > zap_increment_num_entries 25355 > dnode_allocate 25526 > zap_entry_create 28102 > dmu_zfetch_find 28629 > dmu_zfetch 30387 > dmu_bonus_hold 37614 > dmu_zfetch_fetch 38222 > dmu_zfetch_dofetch 39842 > zap_get_leaf_byblk 40255 > fzap_lookup 40518 > zap_deref_leaf 54850 > zap_lookup 65315 > dmu_tx_hold_zap 86567 > zap_add 157876 > fzap_add_cd 158310 > fzap_add 160400 > zfs_link_create 191034 > dmu_object_alloc 401750326 > zfs_mknode 416439036 >
Anton B. Rang
2006-Dec-07 23:11 UTC
[zfs-discuss] Re: A Plea for Help: Thumper/ZFS/NFS/B43
> I''m still confused though, I believe that locking an adaptive mutex will spin for a short > period then context switch and so they shouldn''t be burning CPU - at least not .4s worth!An adaptive mutex will spin as long as the thread which holds the mutex is on CPU. If the lock is moderately contended, you can wind up with threads spinning for quite a while as ownership of the lock passes from thread to thread across CPUs. Mutexes in Solaris tend to be most useful when they''re held for very short periods of time; they also work pretty well if the owning thread blocks. If somebody is computing for quite a while while holding them (e.g. if dnode_next_offset is repeatedly called and is slow), they can waste a lot of time on other CPUs. In this case an rwlock usually works better. Anton This message posted from opensolaris.org
Ben Rockwood wrote:> Eric Kustarz wrote: > >> Ben Rockwood wrote: >> > I''ve got a Thumper doing nothing but serving NFS. Its using B43 with >> > zil_disabled. The system is being consumed in waves, but by what I >> > don''t know. Notice vmstat: >> >> We made several performance fixes in the NFS/ZFS area in recent >> builds, so if possible it would be great to upgrade you from snv_43. >> That said, there might be something else going on that we haven''t >> accounted for. >> ... >> >>> >>> Step 1 was to disable any ZFS features that might consume large >>> amounts of CPU: >>> >>> # zfs set compression=off joyous >>> # zfs set atime=off joyous >>> # zfs set checksum=off joyous >> >> >> In our performance testing, we haven''t found checksums to be anywhere >> near a large consumer of CPU, so i would recommend leaving that on >> (due to its benefits). >> >> I suspect your apps/clients don''t depend on atime, so i think its a >> good idea to turn that off. We''ve gotten better NFS performance with >> this off. >> >> More of a heads up as it sounds like compression on/off isn''t your >> problem. If you are not getting good I/O BW with compression turned >> on, its most likely due to: >> 6460622 zio_nowait() doesn''t live up to its name >> >> As Jim, mentioned, using lockstat to figure out where your CPU is >> being spent is the first step. I''ve been using ''lockstat -kgIW -D 60 >> sleep 60''. That collects data for the top 60 callers for a 1 minute >> period. If you see ''mutex_enter'' high up in the results, then we have >> at least mutex lock contention. >> >> >> ... >> >> >>> Interestingly, using prstat -mL to monitor thread latency, we see >>> that a handful of threads are the culprates for consuming mass CPU: >>> >>> [private:/tmp] root# prstat -mL >>> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG >>> PROCESS/LWPID 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 >>> 1 0 0 nfsd/1506 >>> 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 >>> nfsd/1563 >>> 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 >>> nfsd/1554 >>> 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 >>> nfsd/1551 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 >>> nfsd/1553 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >>> nfsd/1536 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >>> nfsd/1555 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 >>> nfsd/1539 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >>> nfsd/1562 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 >>> nfsd/1545 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >>> nfsd/1559 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 >>> nfsd/1541 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >>> nfsd/1546 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >>> nfsd/1543 >>> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 >>> nfsd/1560 >>> Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86 >> >> >> The high SYS times being charged to the userland nfsd threads is >> representative of what the kernel threads are doing (which is most >> likely going to be NFS and ZFS). >> >> >>> Running zvop_times.d >>> (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an >>> idea of what ZFS is doing: >>> >>> [private:/] root# dtrace -s zvop_times.d >>> dtrace: script ''zvop_times.d'' matched 66 probes >>> ^C >>> CPU ID FUNCTION:NAME >>> 1 2 :END ZFS COUNT >>> >>> zfs_getsecattr 4 >>> zfs_space 70 >>> zfs_rename 111 >>> zfs_readdir 284 >>> zfs_read 367 >>> zfs_mkdir 670 >>> zfs_setattr 1054 >>> zfs_frlock 1562 >>> zfs_putpage 3916 >>> zfs_write 4110 >>> zfs_create 4178 >>> zfs_remove 7794 >>> zfs_fid 14960 >>> zfs_inactive 17637 >>> zfs_access 20809 >>> zfs_fsync 29668 >>> zfs_lookup 31273 >>> zfs_getattr 175457 >>> >>> >>> >>> ZFS AVG TIME >>> >>> zfs_fsync 2337 >>> zfs_getattr 2735 >>> zfs_access 2774 >>> zfs_fid 2948 >>> zfs_inactive 4081 >>> zfs_lookup 6067 >>> zfs_frlock 6750 >>> zfs_putpage 6840 >>> zfs_mkdir 9259 >>> zfs_getsecattr 10940 >>> zfs_setattr 15364 >>> zfs_space 25214 >>> zfs_readdir 25726 >>> zfs_remove 26117 >>> zfs_write 31942 >>> zfs_rename 60504 >>> zfs_read 144449 >>> zfs_create 71215587 >> >> >> An average of 71 msecs per create seems high. Doing a quantize >> instead of just an average on zfs_create() would be interesting >> (should have included that in my script)... >> >> ... >> >> >>> Locks... >>> >>> [private:/tmp] root# lockstat -P sleep 10 >>> >>> Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec) >>> >>> Count indv cuml rcnt spin Lock >>> Caller >>> ------------------------------------------------------------------------------- >>> >>> 32 100% 100% 0.00 59963092 0xffffffffb282a1c8 >>> dmu_object_alloc+0x75 \ >> >> >> That''s interesting. This is build snv_43, right? Doesn''t exactly >> line up with current bits, but i would guess this is " >> mutex_enter(&osi->os_obj_lock);". >> >> Can you show us the output for a longer sampling period, such as >> ''lockstat -P -D 30 sleep 60''? The "Adaptive mutex spin" category is >> usually the most/only interesting one in my experience. >> >> I would be curious to know your memory usage. What is the physical >> memory on the machine? >> >> And can you run: >> # echo ''arc::print -d'' | mdb -k >> # echo ''::kmastat'' | mdb -k | sort -n -k 5 | tail -40 >> >> Matt has a nice little dscript that figures out how long txgs are >> taking (i''ll attach that). I believe it will work with old bits (i >> defined "OLD" the for that to work). Can you try running that? >> >> Can you run ''zpool list'' and ''zpool status''? I''d like to know how >> full your pool is. >> >> Let''s start with that info, >> eric > > > > Here is a bunch of the data you asked for. I can''t tell you how much I > appreciate the help from everyone. > > Lockstat: > > [private:/tmp] root# lockstat -kgIW -D 60 sleep 60Hmm, looks like we caught the wrong 60 seconds :/ ... from the below, we have 78% idle CPU, 20% at and "below" nfssys. Though we still have good data here to go on. Most of the time of what the NFS server is doing CPU-wise is in zfs_mknode()/dmu_object_alloc() - 17%. Also, seeing 8% of the CPU in mutex_vector_enter() suggests lock contention (which from your previous posts is probably ''osi->os_obj_lock''). dmu_object_alloc() is being a pig at 17% because each of the three main functions called under the ''osi->os_obj_lock'' are being mini-pigs: each at 4%. Those three functions are: dnode_next_offset(), dnode_hold_impl(), dmu_object_next(). So i''m guessing there''s lots of files being created over NFS in one particular dataset? We should figure out how many creates/second you are doing over NFS (i should have put a timeout on the script). Here''s a real simple one (from your snoop it looked like you''re only doing NFSv3, so i''m not tracking NFSv4): " #!/usr/sbin/dtrace -s rfs3_create:entry, zfs_create:entry { @creates[probefunc] = count(); } tick-60s { exit(0); } " If i''m reading the code right and if you are just creating a lot of files in just one dataset, perhaps you could create another dataset or two in place of just the one - not sure if that is possible for you. It would be interesting data to see if this alleviates pain.> > Profiling interrupt: 23296 events in 60.046 seconds (388 events/sec) > > Count genr cuml rcnt nsec Hottest CPU+PIL > Caller > ------------------------------------------------------------------------------- > > 18290 79% ---- 0.00 21994 cpu[3] > thread_start 18102 78% ---- 0.00 21988 > cpu[3] idle 18024 77% ---- 0.00 > 21983 cpu[3] cpu_halt 4626 20% ---- > 0.00 25530 cpu[2] sys_syscall32 4626 20% > ---- 0.00 25530 cpu[2] nfssys 4626 > 20% ---- 0.00 25530 cpu[2] svc_do_run > 4624 20% ---- 0.00 25539 cpu[2] > svc_run 4522 19% ---- 0.00 25642 > cpu[2] svc_getreq 4502 19% ---- 0.00 > 25662 cpu[2] rfs_dispatch 4495 19% ---- > 0.00 25660 cpu[2] common_dispatch 3951 17% > ---- 0.00 26143 cpu[2] rfs3_create 3942 > 17% ---- 0.00 26156 cpu[2] fop_create > 3941 17% ---- 0.00 26154 cpu[2] > zfs_create 3922 17% ---- 0.00 26150 > cpu[2] zfs_mknode 3913 17% ---- 0.00 > 26163 cpu[2] dmu_object_alloc 1845 8% ---- > 0.00 20288 cpu[0] mutex_vector_enter 970 4% > ---- 0.00 31591 cpu[2] dmu_object_next 917 > 4% ---- 0.00 31673 cpu[2] dnode_next_offset > 899 4% ---- 0.00 30573 cpu[2] > dnode_hold_impl 883 4% ---- 0.00 31639 > cpu[2] dnode_next_offset_level > 725 3% ---- 0.00 30896 cpu[2] > dbuf_hold_impl 627 3% ---- 0.00 28908 > cpu[2] mutex_enter 361 2% ---- 0.00 > 31135 cpu[2] dmu_buf_rele 350 2% ---- > 0.00 29945 cpu[2] dbuf_hold 350 2% > ---- 0.00 31147 cpu[2]+6 mac_rx 341 > 1% ---- 0.00 31344 cpu[2] arc_buf_add_ref > 303 1% ---- 0.00 31154 cpu[2]+6 > intr_thread 290 1% ---- 0.00 31074 > cpu[2]+6 av_dispatch_autovect 259 1% ---- 0.00 > 24534 cpu[1] zio_next_stage 257 1% ---- > 0.00 31007 cpu[2] arc_buf_remove_ref 244 1% > ---- 0.00 31165 cpu[2]+6 e1000g_intr 243 > 1% ---- 0.00 31507 cpu[2] buf_hash > 218 1% ---- 0.00 25154 cpu[0] > putnext 186 1% ---- 0.00 21154 > cpu[0] svc_sendreply 184 1% ---- 0.00 > 21194 cpu[0] svc_cots_ksend 175 1% ---- > 0.00 31147 cpu[2]+6 aggr_recv_cb 172 1% > ---- 0.00 31151 cpu[2]+6 i_dls_link_ether_rx 159 > 1% ---- 0.00 31155 cpu[2]+6 ip_input > 158 1% ---- 0.00 21275 cpu[0] > put 155 1% ---- 0.00 30127 > cpu[2] dbuf_find 152 1% ---- 0.00 > 21597 cpu[0] rmm_wput 147 1% ---- > 0.00 21688 cpu[0] mir_wput 141 1% > ---- 0.00 31153 cpu[2]+6 squeue_enter_chain 138 > 1% ---- 0.00 26152 cpu[3] zio_next_stage_async > 137 1% ---- 0.00 31165 cpu[2]+6 > tcp_rput_data 134 1% ---- 0.00 21688 > cpu[0] tcp_wput 128 1% ---- 0.00 > 21804 cpu[0] squeue_enter 123 1% ---- > 0.00 21659 cpu[0] tcp_output 104 0% > ---- 0.00 25306 cpu[1] zio_nowait 102 > 0% ---- 0.00 30574 cpu[2] atomic_add_ptr_nv > 91 0% ---- 0.00 22605 cpu[0] tcp_send_data > 89 0% ---- 0.00 21472 cpu[1] > nfs3_fhtovp 87 0% ---- 0.00 27427 > cpu[2] rw_enter 86 0% ---- 0.00 > 21176 cpu[1] fsop_vget 84 0% ---- > 0.00 24945 cpu[1] zio_vdev_io_start 81 0% > ---- 0.00 21110 cpu[1] zfs_vget 81 > 0% ---- 0.00 25006 cpu[3] zio_wait_for_children > 78 0% ---- 0.00 31423 cpu[2] atomic_add_ptr > 74 0% ---- 0.00 31148 cpu[2]+11 > rmm_rput 73 0% ---- 0.00 22876 > cpu[0] dls_tx > ------------------------------------------------------------------------------- > > > > NFS Stats: > > [private:/tmp] root# nfsstat -s -v 3 > > Server rpc: > Connection oriented: > calls badcalls nullrecv badlen xdrcall dupchecks > dupreqs 154161041 0 0 0 0 > 35669015 0 Connectionless: > calls badcalls nullrecv badlen xdrcall dupchecks > dupreqs 0 0 0 0 0 > 0 0 > Server NFSv3: > calls badcalls 153760861 0 Version 3: (153482035 calls) > null getattr setattr lookup access > readlink 1 0% 68265095 44% 2334522 1% 33186002 21% 9373979 > 6% 4430 0% read write create mkdir > symlink mknod 776093 0% 6496524 4% 6353313 4% 1883254 > 1% 814 0% 2 0% remove rmdir rename > link readdir readdirplus 12084296 7% 2692 0% 208683 > 0% 5969808 3% 23751 0% 335079 0% fsstat fsinfo > pathconf commit 6 0% 1 0% 2 0% 6183688 4% > Server nfs_acl: > Version 3: (278383 calls) > null getacl setacl getxattrdir > 0 0% 32060 11% 5 0% 246318 88% > > Memory: > > [private:/tmp] root# echo ''arc::print -d'' | mdb -k > { > anon = ARC_anon > mru = ARC_mru > mru_ghost = ARC_mru_ghost > mfu = ARC_mfu > mfu_ghost = ARC_mfu_ghost > size = 0t73377331206.8GB in the ARC out of a possible 16GB (i think thats how much you have) is pretty decent.> p = 0t3673915285 > c = 0t7338496468 > c_min = 0t533448320 > c_max = 0t15996604416 > hits = 0t6864679166 > misses = 0t518711A 99% cache hit ratio - impressive!> deleted = 0t295199 > skipped = 0t1701 > hash_elements = 0t500010 > hash_elements_max = 0t511362 > hash_collisions = 0t1820702 > hash_chains = 0t149161 > hash_chain_max = 0t10 > no_grow = 0 > } > [private:/tmp] root# echo ''::kmastat'' | mdb -k | sort -n -k 5 | tail -40 > zio_buf_10240 10240 1524 1642 16814080 46367 0 > zio_buf_3072 3072 4663 5604 17215488 354215 0 > streams_dblk_1936 2048 43 8418 17240064 159700970 0 > zio_buf_77824 77824 221 226 17588224 1464 0 > zio_buf_28672 28672 660 664 19038208 9233 0 > zio_buf_32768 32768 578 595 19496960 74273 0 > zio_buf_3584 3584 4492 5952 21331968 186767 0 > zio_buf_24576 24576 876 885 21749760 15378 0 > zio_buf_14336 14336 1008 1556 22306816 2101355 0 > kmem_va_8192 8192 2361 2752 22544384 4067 0 > kmem_alloc_128 128 161894 173042 22863872 41686966 0 > zio_buf_4096 4096 6002 6016 24641536 193926 0 > kmem_magazine_15 128 207407 212939 28135424 237076 0 > zio_buf_5120 5120 5031 6144 31457280 126567 0 > zio_buf_20480 20480 1557 1567 32092160 39694 0 > kmem_alloc_8 8 4064087 4064240 32964608 15892691 0 > kmem_va_32768 32768 595 1028 33685504 1731 0 > kmem_va_24576 24576 885 1560 40894464 2278 0 > kmem_alloc_112 112 289900 366192 41664512 7743803 0 > kmem_alloc_40 40 806400 1048077 42504192 14060780 0 > kmem_alloc_256 256 187308 191970 52420608 33326285 0 > arc_buf_hdr_t 128 500219 512089 67661824 14104876 0 > kmem_slab_cache 56 1239939 1250640 71147520 1612006 0 > kmem_va_12288 12288 4609 5690 74579968 8137 0 > zfs_znode_cache 192 544134 545643 106426368 7771071 0ok, cool, the ''znode_cache'' is not a problem (we''re actually using most of the buffers allocated to the cache). That''s good news.> kmem_va_28672 28672 2651 3292 107872256 6270 0 > rootnex_dmahdl 2504 42668 42832 109649920 32118262 0 > kmem_io_64G_4096 4096 32768 32768 134217728 32768 0 > vn_cache 240 552163 552255 150802432 634861 0 > kmem_bufctl_cache 24 7085442 7085568 172752896 8534442 0 > kmem_va_20480 20480 9939 11568 252706816 14227 0 > zio_buf_512 512 578348 604080 309288960 29468142 0 > dmu_buf_impl_t 328 874033 908160 309985280 21869642 0Same thing for the ''dbuf_cache''.> kmem_alloc_96 96 3879869 3879918 378384384 21906309 0 > zio_buf_65536 65536 7269 7273 476643328 1368534 0 > kmem_va_4096 4096 1178724 1178752 533200896 1485372 0 > dnode_t 640 4641005 4641006 3168260096 8165695 0And the ''dnode_cache''.> zio_buf_131072 131072 25251 25256 3310354432 245308 0 > zio_buf_16384 16384 214643 214667 3517104128 747095 0 > kmem_va_16384 16384 214783 217944 3570794496 375400 0 > ZPool (I''m omit''ing zfs list output because the filesystems are named by > customer numbers):Ooops, i think the ''zpool list'' command got cut out. Mostly curious at what percentage of capacity you''re at.> > [private:/tmp] root# zpool status > pool: joyous > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > joyous ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c5t4d0 ONLINE 0 0 0 > c4t4d0 ONLINE 0 0 0 > c7t4d0 ONLINE 0 0 0 > c6t4d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > c0t4d0 ONLINE 0 0 0 > c4t0d0 ONLINE 0 0 0 > c7t0d0 ONLINE 0 0 0 > c6t0d0 ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c5t5d0 ONLINE 0 0 0 > c4t5d0 ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c6t5d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 0 > c4t1d0 ONLINE 0 0 0 > c7t1d0 ONLINE 0 0 0 > c6t1d0 ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c5t6d0 ONLINE 0 0 0 > c4t6d0 ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 0 > c6t6d0 ONLINE 0 0 0 > c1t6d0 ONLINE 0 0 0 > c0t6d0 ONLINE 0 0 0 > c4t2d0 ONLINE 0 0 0 > c7t2d0 ONLINE 0 0 0 > c6t2d0 ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 > c0t2d0 ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c5t7d0 ONLINE 0 0 0 > c4t7d0 ONLINE 0 0 0 > c7t7d0 ONLINE 0 0 0 > c6t7d0 ONLINE 0 0 0 > c1t7d0 ONLINE 0 0 0 > c0t7d0 ONLINE 0 0 0 > c4t3d0 ONLINE 0 0 0 > c7t3d0 ONLINE 0 0 0 > c6t3d0 ONLINE 0 0 0 > c1t3d0 ONLINE 0 0 0 > c0t3d0 ONLINE 0 0 0 > spares > c5t1d0 AVAIL c5t2d0 AVAIL c5t3d0 > AVAIL > errors: No known data errors > > > ZFS_Sync Dtrace: > > [private:/tmp] root# ./zfs_sync.d > txg time sync-write async-write sync-read > async-read > 0 5206ms. 0/s 0MB/s; 0/s 0MB/s; 10/s 0MB/s; 0/s > 0MB/s > 2234681 281ms. 156/s 0MB/s; 32249/s 29MB/s; 14/s 0MB/s; > 0/s 0MB/s > 2234682 249ms. 176/s 0MB/s; 23216/s 23MB/s; 40/s 2MB/s; > 0/s 0MB/s > 0 5840ms. 0/s 0MB/s; 0/s 0MB/s; 24/s 1MB/s; 0/s > 0MB/s > 0 6239ms. 0/s 0MB/s; 0/s 0MB/s; 67/s 4MB/s; 0/s > 0MB/s > 2234683 427ms. 103/s 0MB/s; 22346/s 24MB/s; 117/s 7MB/s; > 0/s 0MB/s > 0 6729ms. 0/s 0MB/s; 0/s 0MB/s; 6/s 0MB/s; 0/s > 0MB/s > 2234684 228ms. 192/s 0MB/s; 24052/s 23MB/s; 17/s 1MB/s; > 0/s 0MB/s > 0 4770ms. 0/s 0MB/s; 0/s 0MB/s; 7/s 0MB/s; 0/s > 0MB/s > 2234685 212ms. 207/s 0MB/s; 25382/s 20MB/s; 0/s 0MB/s; > 0/s 0MB/s > 0 4787ms. 0/s 0MB/s; 0/s 0MB/s; 31/s 1MB/s; 0/s > 0MB/s > 2234686 215ms. 200/s 0MB/s; 19544/s 17MB/s; 18/s 1MB/s; > 0/s 0MB/s > 2234687 197ms. 223/s 0MB/s; 25253/s 23MB/s; 15/s 0MB/s; > 0/s 0MB/s > 0 8010ms. 0/s 0MB/s; 0/s 0MB/s; 3/s 0MB/s; 0/s > 0MB/s > 0 5611ms. 0/s 0MB/s; 0/s 0MB/s; 4/s 0MB/s; 0/s > 0MB/s > 2234688 255ms. 172/s 0MB/s; 23250/s 21MB/s; 19/s 1MB/s; > 0/s 0MB/s > 0 5357ms. 0/s 0MB/s; 0/s 0MB/s; 8/s 0MB/s; 0/s > 0MB/s > 2234689 249ms. 176/s 0MB/s; 21678/s 27MB/s; 100/s 6MB/s; > 0/s 0MB/s > 0 6465ms. 0/s 0MB/s; 0/s 0MB/s; 1/s 0MB/s; 0/s > 0MB/s > 2234690 240ms. 183/s 0MB/s; 20162/s 16MB/s; 25/s 1MB/s; > 0/s 0MB/sOk, the txgs look really good. Its taking them about 1/4 of a second to push out all the dirty data. We like txgs to take 5 seconds or under. In summary, at this data collection point everything looked good except dmu_object_alloc(). If you can grab the same lockstat data as before (''lockstat -kgIW -D 30 sleep 60'' and ''lockstat -P -D 30 sleep 60'') at the sametime ''vmstat 1'' shows no idle CPU, that would be great as we could verify. Perhaps the sleep times for the lockstat collection need to be bumped, or a couple successive runs grabbed, as i believe you were seeing spikey CPU usage - so we might have grabbed data in between the spikes. eric> > > And finally, 30 seconds of nicstat: > > Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs > %Util Sat > 21:00:58 aggr0/1 130.3 136.2 657.5 629.0 203.0 221.8 > 0.22 0.00 > 21:00:58 e1000g1/0 19.00 17.87 112.6 103.4 172.7 177.0 > 0.03 0.00 > 21:00:58 aggr1 130.3 136.2 657.4 629.0 203.0 221.8 > 0.22 0.00 > 21:00:58 e1000g3/0 51.48 27.27 272.0 156.1 193.8 178.9 > 0.06 0.00 > 21:00:58 e1000g0 0.16 0.47 2.39 3.99 67.50 120.6 > 0.00 0.00 > 21:00:58 e1000g2/0 59.83 91.11 272.8 369.6 224.6 252.4 > 0.12 0.00 > 21:00:58 e1000g0/0 0.16 0.47 2.39 3.99 67.50 120.5 > 0.00 0.00 > Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs > %Util Sat > 21:01:08 aggr0/1 1158.0 1213.2 6497.7 5838.5 182.5 212.8 > 1.94 0.00 > 21:01:08 e1000g1/0 506.0 158.4 2851.2 368.6 181.7 440.0 > 0.54 0.00 > 21:01:08 aggr1 1158.0 1213.1 6497.3 5838.0 182.5 212.8 > 1.94 0.00 > 21:01:08 e1000g3/0 567.9 31.65 3296.7 175.4 176.4 184.8 > 0.49 0.00 > 21:01:08 e1000g0 0.12 0.42 1.99 3.49 64.00 123.4 > 0.00 0.00 > 21:01:08 e1000g2/0 83.36 1023.5 345.1 5296.2 247.4 197.9 > 0.91 0.00 > 21:01:08 e1000g0/0 0.12 0.42 1.99 3.49 64.00 123.4 > 0.00 0.00 > Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs > %Util Sat > 21:01:18 aggr0/1 762.1 876.4 4133.2 3775.1 188.8 237.7 > 1.34 0.00 > 21:01:18 e1000g1/0 492.2 171.6 2733.4 671.8 184.4 261.6 > 0.54 0.00 > 21:01:18 aggr1 762.1 876.5 4133.6 3775.6 188.8 237.7 > 1.34 0.00 > 21:01:18 e1000g3/0 191.2 25.95 1081.8 141.3 181.0 188.1 > 0.18 0.00 > 21:01:18 e1000g0 0.14 0.41 2.19 3.39 64.00 123.9 > 0.00 0.00 > 21:01:18 e1000g2/0 79.32 678.5 322.3 2960.3 252.0 234.7 > 0.62 0.00 > 21:01:18 e1000g0/0 0.14 0.41 2.19 3.39 64.00 123.9 > 0.00 0.00 > > > > We know that there are things that can be improved in our network and > we''re rushing to fix those things, but more and more I''ve become less > certain that this is a network issue at its heart. > > Again, all the help is greatly appreciated. > > benr. > >
eric kustarz wrote:> So i''m guessing there''s lots of files being created over NFS in one > particular dataset? > > We should figure out how many creates/second you are doing over NFS (i > should have put a timeout on the script). Here''s a real simple one > (from your snoop it looked like you''re only doing NFSv3, so i''m not > tracking NFSv4): > " > #!/usr/sbin/dtrace -s > > rfs3_create:entry, > zfs_create:entry > { > @creates[probefunc] = count(); > } > > tick-60s > { > exit(0); > } > "Eric, I love you. Running this bit of DTrace reveled more than 4,000 files being created in almost any given 60 second window. And I''ve only got one system that would fit that sort of mass file creation: our Joyent Connector products Courier IMAP server which uses Maildir. As a test I simply shutdown Courier and unmounted the mail NFS share for good measure and sure enough the problem vanished and could not be reproduced. 10 minutes later I re-enabled Courier and our problem came back. Clearly ZFS file creation is just amazingly heavy even with ZIL disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron cores we''re in big trouble in the longer term. In the meantime I''m going to find a new home for our IMAP Mail so that the other things served from that NFS server at least aren''t effected. You asked for the zpool and zfs info, which I don''t want to share because its confidential (if you want it privately I''ll do so, but not on a public list), but I will say that its a single massive Zpool in which we''re using less than 2% of the capacity. But in thinking about this problem, even if we used 2 or more pools, the CPU consumption still would have choked the system, right? This leaves me really nervous about what we''ll do when its not an internal mail server thats creating all those files but a customer. Oddly enough, this might be a very good reason to use iSCSI instead of NFS on the Thumper. Eric, I owe you a couple cases of beer for sure. I can''t tell you how much I appreciate your help. Thanks to everyone else who chimed in with ideas and suggestions, all of you guys are the best! benr.
On Fri, Ben Rockwood wrote:> eric kustarz wrote: > >So i''m guessing there''s lots of files being created over NFS in one > >particular dataset? > > > >We should figure out how many creates/second you are doing over NFS (i > >should have put a timeout on the script). Here''s a real simple one > >(from your snoop it looked like you''re only doing NFSv3, so i''m not > >tracking NFSv4): > >" > >#!/usr/sbin/dtrace -s > > > >rfs3_create:entry, > >zfs_create:entry > >{ > > @creates[probefunc] = count(); > >} > > > >tick-60s > >{ > > exit(0); > >} > >" > > > Eric, I love you. > > Running this bit of DTrace reveled more than 4,000 files being created > in almost any given 60 second window. And I''ve only got one system that > would fit that sort of mass file creation: our Joyent Connector products > Courier IMAP server which uses Maildir. As a test I simply shutdown > Courier and unmounted the mail NFS share for good measure and sure > enough the problem vanished and could not be reproduced. 10 minutes > later I re-enabled Courier and our problem came back. > > Clearly ZFS file creation is just amazingly heavy even with ZIL > disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron > cores we''re in big trouble in the longer term. In the meantime I''m > going to find a new home for our IMAP Mail so that the other things > served from that NFS server at least aren''t effected. > > You asked for the zpool and zfs info, which I don''t want to share > because its confidential (if you want it privately I''ll do so, but not > on a public list), but I will say that its a single massive Zpool in > which we''re using less than 2% of the capacity. But in thinking about > this problem, even if we used 2 or more pools, the CPU consumption still > would have choked the system, right? This leaves me really nervous > about what we''ll do when its not an internal mail server thats creating > all those files but a customer. > > Oddly enough, this might be a very good reason to use iSCSI instead of > NFS on the Thumper. > > Eric, I owe you a couple cases of beer for sure. I can''t tell you how > much I appreciate your help. Thanks to everyone else who chimed in with > ideas and suggestions, all of you guys are the best!Good to hear that you have figured out what is happening, Ben. For future reference, there are two commands that you may want to make use of in observing the behavior of the NFS server and individual filesystems. There is the trusty, nfsstat command. In this case, you would have been able to do something like: nfsstat -s -v3 60 This will provide all of the server side NFSv3 statistics on 60 second intervals. Then there is a new command fsstat that will provide vnode level activity on a per filesystem basis. Therefore, if the NFS server has multiple filesystems active and you want ot look at just one something like this can be helpful: fsstat /export/foo 60 Fsstat has a ''full'' option that will list all of the vnode operations or just certain types. It also will watch a filesystem type (e.g. zfs, nfs). Very useful. Spencer
On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote:> Clearly ZFS file creation is just amazingly heavy even with ZIL > disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron > cores we''re in big trouble in the longer term. In the meantime I''m > going to find a new home for our IMAP Mail so that the other things > served from that NFS server at least aren''t effected.For local tests, this is not true of ZFS. It seems that file creation only swamps us when coming over NFS. We can do thousands of files a second on a Thumper with room to spare if NFS isn''t involved. Next step is to figure out why NFS kills us. --Bill
Spencer Shepler wrote:> Good to hear that you have figured out what is happening, Ben. > > For future reference, there are two commands that you may want to > make use of in observing the behavior of the NFS server and individual > filesystems. > > There is the trusty, nfsstat command. In this case, you would have been > able to do something like: > nfsstat -s -v3 60 > > This will provide all of the server side NFSv3 statistics on 60 second > intervals. > > Then there is a new command fsstat that will provide vnode level > activity on a per filesystem basis. Therefore, if the NFS server > has multiple filesystems active and you want ot look at just one > something like this can be helpful: > > fsstat /export/foo 60 > > Fsstat has a ''full'' option that will list all of the vnode operations > or just certain types. It also will watch a filesystem type (e.g. zfs, nfs). > Very useful. >NFSstat I''ve been using, but fsstat I was unaware of. Which I''d used it rather than duplicated most of its functionality with D script. :) Thanks for the tip. benr.
Bill Moore wrote:> On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote: > >> Clearly ZFS file creation is just amazingly heavy even with ZIL >> disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron >> cores we''re in big trouble in the longer term. In the meantime I''m >> going to find a new home for our IMAP Mail so that the other things >> served from that NFS server at least aren''t effected. >> > > For local tests, this is not true of ZFS. It seems that file creation > only swamps us when coming over NFS. We can do thousands of files a > second on a Thumper with room to spare if NFS isn''t involved. > > Next step is to figure out why NFS kills us. >Agreed. If mass file creation was a problem locally I''d think that we''d have people beating down the doors with complaints. One thought I had as a work around was to move all my mail on NFS to an iSCSI LUN and then put a Zpool on that. I''m willing to bet that''d work fine. Hopefully I can try it. To round out the discussion, the root cause of this whole mess was Courier IMAP Locking. After isolating the problem last night and writing a little d script to find out what files were being create it was obviously lock files, turn off locking and file creations dropped to a reasonable level and our problem vanished. If I can help at all with testing or analysis please let me know. benr.
Ben Rockwood wrote:> Bill Moore wrote: > >> On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote: >> >> >>> Clearly ZFS file creation is just amazingly heavy even with ZIL >>> disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz >>> Opteron cores we''re in big trouble in the longer term. In the >>> meantime I''m going to find a new home for our IMAP Mail so that the >>> other things served from that NFS server at least aren''t effected. >>> >> >> >> For local tests, this is not true of ZFS. It seems that file creation >> only swamps us when coming over NFS. We can do thousands of files a >> second on a Thumper with room to spare if NFS isn''t involved. >> >> Next step is to figure out why NFS kills us. >> > > > Agreed. If mass file creation was a problem locally I''d think that we''d > have people beating down the doors with complaints. > > One thought I had as a work around was to move all my mail on NFS to an > iSCSI LUN and then put a Zpool on that. I''m willing to bet that''d work > fine. Hopefully I can try it.Could be NFS synchronous semantics on file create (followed by repeated flushing of the write cache). What kind of storage are you using (feel free to send privately if you need to) - is it a thumper?> > > To round out the discussion, the root cause of this whole mess was > Courier IMAP Locking. After isolating the problem last night and > writing a little d script to find out what files were being create it > was obviously lock files, turn off locking and file creations dropped to > a reasonable level and our problem vanished.Gotta love that D.> > If I can help at all with testing or analysis please let me know.I want to create a test case in-house first and see if its reproducable. eric ps: i like lots of different types of beers
> Could be NFS synchronous semantics on file create (followed by > repeated flushing of the write cache). What kind of storage are you > using (feel free to send privately if you need to) - is it a thumper?It''s not clear why NFS-enforced synchronous semantics would induce different behavior than the same load to a local ZFS. File creates are metadata intensive, right? And these operations need to be synchronous to guarantee file system consistency (yes, I am familiar with the ZFS COW model). Anyway....I''m feeling rather naive'' here, but I''ve seen the "NFS enforced synchronous semantics" phrase kicked around many times as the explanation for suboptimal performance for metadata-intensive operations when ZFS is the underlying file system, but I never really understood what''s "unsynchronous" about doing the same thing to a local ZFS. And yes, there is certainly a network latency component to the NFS configuration, so for any synchronous operation, I would expect things to be slower when done over NFS. Awaiting enlightment.... :^) /jim
Jim Mauro wrote:> >> Could be NFS synchronous semantics on file create (followed by >> repeated flushing of the write cache). What kind of storage are you >> using (feel free to send privately if you need to) - is it a thumper? > > It''s not clear why NFS-enforced synchronous semantics would induce > different behavior than the same > load to a local ZFS.Actually i forgot he had ''zil_disable'' turned on, so it won''t matter in this case. The big difference is that the NFS server does a VOP_FSYNC on each file create as it needs to guarantee that the changes are on stable storage. Locally, we don''t need that same guarantee so you don''t have to wait for a disk I/O to complete on each create. eric> > File creates are metadata intensive, right? And these operations need to > be synchronous to guarantee > file system consistency (yes, I am familiar with the ZFS COW model). > > Anyway....I''m feeling rather naive'' here, but I''ve seen the "NFS > enforced synchronous semantics" phrase > kicked around many times as the explanation for suboptimal performance > for metadata-intensive > operations when ZFS is the underlying file system, but I never really > understood what''s "unsynchronous" > about doing the same thing to a local ZFS. > > And yes, there is certainly a network latency component to the NFS > configuration, so for any > synchronous operation, I would expect things to be slower when done over > NFS. > > Awaiting enlightment.... > > :^) > > /jim >
On Dec 9, 2006, at 8:59 , Jim Mauro wrote:> Anyway....I''m feeling rather naive'' here, but I''ve seen the "NFS > enforced synchronous semantics" phrase > kicked around many times as the explanation for suboptimal > performance for metadata-intensive > operations when ZFS is the underlying file system, but I never > really understood what''s "unsynchronous" > about doing the same thing to a local ZFSIf I remember correctly, the difference is that NFS requires that the operation be committed to stable storage before the return to the client. This is definitely a heavier operation than the local case, where the return to the caller may happen as soon as the operation is cached. If there''s a crash, the local system does not guarantee to the caller that the operation is on disk, but NFS does. Both guarantee consistency but NFS makes stronger guarantees of completeness. --Ed
Casper.Dik at Sun.COM
2006-Dec-09 19:57 UTC
[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
>On Dec 9, 2006, at 8:59 , Jim Mauro wrote: >> Anyway....I''m feeling rather naive'' here, but I''ve seen the "NFS >> enforced synchronous semantics" phrase >> kicked around many times as the explanation for suboptimal >> performance for metadata-intensive >> operations when ZFS is the underlying file system, but I never >> really understood what''s "unsynchronous" >> about doing the same thing to a local ZFS > >If I remember correctly, the difference is that NFS requires that the >operation be committed to stable storage before the return to the >client. This is definitely a heavier operation than the local case, >where the return to the caller may happen as soon as the operation is >cached. If there''s a crash, the local system does not guarantee to >the caller that the operation is on disk, but NFS does. > >Both guarantee consistency but NFS makes stronger guarantees of >completeness.The currenet way to solve this from the NFS server''s perspective is to call something equivalent to "sync()". It was my understanding that disable ZIL would make such operations more or less "asynchronous" but it still seems that NFS is doing a sync() like thing. Perhaps there should be a different mechanism for making VOP_* calls, calls which do not return until the specific call was committed to stable storage. (Interested parties get notified when such events happen). Unfortunately, that would cause the NFS client system calls to sleep until the calls are completed (as they do now), but there doesn''t seem to be a need for that; writes currently need a specific commit and it would perhaps be worthwhile to investigate more asynchronous behaviour. As long as the client remembers the non-committed calls, there is no "data loss" issue; when the client crashes, you expect data loss; when the server crashes, the client resends. Casper
Robert Milkowski
2006-Dec-11 00:33 UTC
[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Hello eric, Saturday, December 9, 2006, 7:07:49 PM, you wrote: ek> Jim Mauro wrote:>> >>> Could be NFS synchronous semantics on file create (followed by >>> repeated flushing of the write cache). What kind of storage are you >>> using (feel free to send privately if you need to) - is it a thumper? >> >> It''s not clear why NFS-enforced synchronous semantics would induce >> different behavior than the same >> load to a local ZFS.ek> Actually i forgot he had ''zil_disable'' turned on, so it won''t matter in ek> this case. Ben, are you sure zil_disable was set to 1 BEFORE pool was imported? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Ben Rockwood
2006-Dec-11 20:34 UTC
[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Robert Milkowski wrote:> Hello eric, > > Saturday, December 9, 2006, 7:07:49 PM, you wrote: > > ek> Jim Mauro wrote: > >>>> Could be NFS synchronous semantics on file create (followed by >>>> repeated flushing of the write cache). What kind of storage are you >>>> using (feel free to send privately if you need to) - is it a thumper? >>>> >>> It''s not clear why NFS-enforced synchronous semantics would induce >>> different behavior than the same >>> load to a local ZFS. >>> > > ek> Actually i forgot he had ''zil_disable'' turned on, so it won''t matter in > ek> this case. > > > Ben, are you sure zil_disable was set to 1 BEFORE pool was imported? >Yes, absolutely. Set var in /etc/system, reboot, system come up. That happened almost 2 months ago, long before this lock insanity problem popped up. To be clear, the ZIL issue was a problem for creation of a handful of files of any size. Untar''ing a file was a massive performance drain. This issue, other the other hand, deals with thousands of little files being created all the time (IMAP Locks). These are separate issues from my point of view. With ZIL slowness NFS performance was just slow but we didn''t see massive CPU usage, with this issue on the other hand we were seeing waves in 10 second-ish cycles where the run queue would go sky high with 0 idle. Please see the earlier mails for examples of the symptoms. benr.
Robert Milkowski
2006-Dec-12 01:56 UTC
[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Hello Ben, Monday, December 11, 2006, 9:34:18 PM, you wrote: BR> Robert Milkowski wrote:>> Hello eric, >> >> Saturday, December 9, 2006, 7:07:49 PM, you wrote: >> >> ek> Jim Mauro wrote: >> >>>>> Could be NFS synchronous semantics on file create (followed by >>>>> repeated flushing of the write cache). What kind of storage are you >>>>> using (feel free to send privately if you need to) - is it a thumper? >>>>> >>>> It''s not clear why NFS-enforced synchronous semantics would induce >>>> different behavior than the same >>>> load to a local ZFS. >>>> >> >> ek> Actually i forgot he had ''zil_disable'' turned on, so it won''t matter in >> ek> this case. >> >> >> Ben, are you sure zil_disable was set to 1 BEFORE pool was imported? >>BR> Yes, absolutely. Set var in /etc/system, reboot, system come up. That BR> happened almost 2 months ago, long before this lock insanity problem BR> popped up. BR> To be clear, the ZIL issue was a problem for creation of a handful of BR> files of any size. Untar''ing a file was a massive performance drain. BR> This issue, other the other hand, deals with thousands of little files BR> being created all the time (IMAP Locks). These are separate issues from BR> my point of view. With ZIL slowness NFS performance was just slow but BR> we didn''t see massive CPU usage, with this issue on the other hand we BR> were seeing waves in 10 second-ish cycles where the run queue would go BR> sky high with 0 idle. Please see the earlier mails for examples of the BR> symptoms. Ok. And just another question - is nfs file system mounted with noac options on imap server and application is doing chdir() to nfs directories? I''m not sure if it''s fixed - if not your nfs client on every chdir() will generate lot of small traffic to nfs server starving it of cpu. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Richard Elling
2006-Dec-12 05:15 UTC
[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
BR> Yes, absolutely. Set var in /etc/system, reboot, system come up. That BR> happened almost 2 months ago, long before this lock insanity problem BR> popped up. For the archives, a high level of lock activity can always be a problem. The worst cases I''ve experienced were with record locking over NFS. The worst offenders were programs written with a local file system in mind, especially PC-based applications. This has become one of the first things I check in such instances, and the network traffic will be consistently full of lock activity. Just one of those things you need to watch out for in a networked world :-) -- richard
Ben Rockwood
2007-Jan-10 16:26 UTC
[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Eric Kustarz wrote:> Ben Rockwood wrote: > > I''ve got a Thumper doing nothing but serving NFS. Its using B43 with > > zil_disabled. The system is being consumed in waves, but by what I > > don''t know. Notice vmstat: > > We made several performance fixes in the NFS/ZFS area in recent > builds, so if possible it would be great to upgrade you from snv_43. > That said, there might be something else going on that we haven''t > accounted for. > ... > >> >> Step 1 was to disable any ZFS features that might consume large >> amounts of CPU: >> >> # zfs set compression=off joyous >> # zfs set atime=off joyous >> # zfs set checksum=off joyous > > In our performance testing, we haven''t found checksums to be anywhere > near a large consumer of CPU, so i would recommend leaving that on > (due to its benefits). > > I suspect your apps/clients don''t depend on atime, so i think its a > good idea to turn that off. We''ve gotten better NFS performance with > this off. > > More of a heads up as it sounds like compression on/off isn''t your > problem. If you are not getting good I/O BW with compression turned > on, its most likely due to: > 6460622 zio_nowait() doesn''t live up to its name > > As Jim, mentioned, using lockstat to figure out where your CPU is > being spent is the first step. I''ve been using ''lockstat -kgIW -D 60 > sleep 60''. That collects data for the top 60 callers for a 1 minute > period. If you see ''mutex_enter'' high up in the results, then we have > at least mutex lock contention. > > > ... > > >> Interestingly, using prstat -mL to monitor thread latency, we see >> that a handful of threads are the culprates for consuming mass CPU: >> >> [private:/tmp] root# prstat -mL >> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG >> PROCESS/LWPID 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 >> 1 0 0 nfsd/1506 >> 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 >> nfsd/1563 >> 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 >> nfsd/1554 >> 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 >> nfsd/1551 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 >> nfsd/1553 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >> nfsd/1536 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >> nfsd/1555 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 >> nfsd/1539 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >> nfsd/1562 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 >> nfsd/1545 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >> nfsd/1559 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 >> nfsd/1541 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >> nfsd/1546 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 >> nfsd/1543 >> 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 >> nfsd/1560 >> Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86 > > The high SYS times being charged to the userland nfsd threads is > representative of what the kernel threads are doing (which is most > likely going to be NFS and ZFS). > > >> Running zvop_times.d >> (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an >> idea of what ZFS is doing: >> >> [private:/] root# dtrace -s zvop_times.d >> dtrace: script ''zvop_times.d'' matched 66 probes >> ^C >> CPU ID FUNCTION:NAME >> 1 2 :END ZFS COUNT >> >> zfs_getsecattr 4 >> zfs_space 70 >> zfs_rename 111 >> zfs_readdir 284 >> zfs_read 367 >> zfs_mkdir 670 >> zfs_setattr 1054 >> zfs_frlock 1562 >> zfs_putpage 3916 >> zfs_write 4110 >> zfs_create 4178 >> zfs_remove 7794 >> zfs_fid 14960 >> zfs_inactive 17637 >> zfs_access 20809 >> zfs_fsync 29668 >> zfs_lookup 31273 >> zfs_getattr 175457 >> >> >> >> ZFS AVG TIME >> >> zfs_fsync 2337 >> zfs_getattr 2735 >> zfs_access 2774 >> zfs_fid 2948 >> zfs_inactive 4081 >> zfs_lookup 6067 >> zfs_frlock 6750 >> zfs_putpage 6840 >> zfs_mkdir 9259 >> zfs_getsecattr 10940 >> zfs_setattr 15364 >> zfs_space 25214 >> zfs_readdir 25726 >> zfs_remove 26117 >> zfs_write 31942 >> zfs_rename 60504 >> zfs_read 144449 >> zfs_create 71215587 > > An average of 71 msecs per create seems high. Doing a quantize > instead of just an average on zfs_create() would be interesting > (should have included that in my script)... > > ... > > >> Locks... >> >> [private:/tmp] root# lockstat -P sleep 10 >> >> Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec) >> >> Count indv cuml rcnt spin Lock >> Caller >> ------------------------------------------------------------------------------- >> >> 32 100% 100% 0.00 59963092 0xffffffffb282a1c8 >> dmu_object_alloc+0x75 \ > > That''s interesting. This is build snv_43, right? Doesn''t exactly > line up with current bits, but i would guess this is " > mutex_enter(&osi->os_obj_lock);". > > Can you show us the output for a longer sampling period, such as > ''lockstat -P -D 30 sleep 60''? The "Adaptive mutex spin" category is > usually the most/only interesting one in my experience. > > I would be curious to know your memory usage. What is the physical > memory on the machine? > > And can you run: > # echo ''arc::print -d'' | mdb -k > # echo ''::kmastat'' | mdb -k | sort -n -k 5 | tail -40 > > Matt has a nice little dscript that figures out how long txgs are > taking (i''ll attach that). I believe it will work with old bits (i > defined "OLD" the for that to work). Can you try running that? > > Can you run ''zpool list'' and ''zpool status''? I''d like to know how > full your pool is. > > Let''s start with that info, > ericHere is a bunch of the data you asked for. I can''t tell you how much I appreciate the help from everyone. Lockstat: [private:/tmp] root# lockstat -kgIW -D 60 sleep 60 Profiling interrupt: 23296 events in 60.046 seconds (388 events/sec) Count genr cuml rcnt nsec Hottest CPU+PIL Caller ------------------------------------------------------------------------------- 18290 79% ---- 0.00 21994 cpu[3] thread_start 18102 78% ---- 0.00 21988 cpu[3] idle 18024 77% ---- 0.00 21983 cpu[3] cpu_halt 4626 20% ---- 0.00 25530 cpu[2] sys_syscall32 4626 20% ---- 0.00 25530 cpu[2] nfssys 4626 20% ---- 0.00 25530 cpu[2] svc_do_run 4624 20% ---- 0.00 25539 cpu[2] svc_run 4522 19% ---- 0.00 25642 cpu[2] svc_getreq 4502 19% ---- 0.00 25662 cpu[2] rfs_dispatch 4495 19% ---- 0.00 25660 cpu[2] common_dispatch 3951 17% ---- 0.00 26143 cpu[2] rfs3_create 3942 17% ---- 0.00 26156 cpu[2] fop_create 3941 17% ---- 0.00 26154 cpu[2] zfs_create 3922 17% ---- 0.00 26150 cpu[2] zfs_mknode 3913 17% ---- 0.00 26163 cpu[2] dmu_object_alloc 1845 8% ---- 0.00 20288 cpu[0] mutex_vector_enter 970 4% ---- 0.00 31591 cpu[2] dmu_object_next 917 4% ---- 0.00 31673 cpu[2] dnode_next_offset 899 4% ---- 0.00 30573 cpu[2] dnode_hold_impl 883 4% ---- 0.00 31639 cpu[2] dnode_next_offset_level 725 3% ---- 0.00 30896 cpu[2] dbuf_hold_impl 627 3% ---- 0.00 28908 cpu[2] mutex_enter 361 2% ---- 0.00 31135 cpu[2] dmu_buf_rele 350 2% ---- 0.00 29945 cpu[2] dbuf_hold 350 2% ---- 0.00 31147 cpu[2]+6 mac_rx 341 1% ---- 0.00 31344 cpu[2] arc_buf_add_ref 303 1% ---- 0.00 31154 cpu[2]+6 intr_thread 290 1% ---- 0.00 31074 cpu[2]+6 av_dispatch_autovect 259 1% ---- 0.00 24534 cpu[1] zio_next_stage 257 1% ---- 0.00 31007 cpu[2] arc_buf_remove_ref 244 1% ---- 0.00 31165 cpu[2]+6 e1000g_intr 243 1% ---- 0.00 31507 cpu[2] buf_hash 218 1% ---- 0.00 25154 cpu[0] putnext 186 1% ---- 0.00 21154 cpu[0] svc_sendreply 184 1% ---- 0.00 21194 cpu[0] svc_cots_ksend 175 1% ---- 0.00 31147 cpu[2]+6 aggr_recv_cb 172 1% ---- 0.00 31151 cpu[2]+6 i_dls_link_ether_rx 159 1% ---- 0.00 31155 cpu[2]+6 ip_input 158 1% ---- 0.00 21275 cpu[0] put 155 1% ---- 0.00 30127 cpu[2] dbuf_find 152 1% ---- 0.00 21597 cpu[0] rmm_wput 147 1% ---- 0.00 21688 cpu[0] mir_wput 141 1% ---- 0.00 31153 cpu[2]+6 squeue_enter_chain 138 1% ---- 0.00 26152 cpu[3] zio_next_stage_async 137 1% ---- 0.00 31165 cpu[2]+6 tcp_rput_data 134 1% ---- 0.00 21688 cpu[0] tcp_wput 128 1% ---- 0.00 21804 cpu[0] squeue_enter 123 1% ---- 0.00 21659 cpu[0] tcp_output 104 0% ---- 0.00 25306 cpu[1] zio_nowait 102 0% ---- 0.00 30574 cpu[2] atomic_add_ptr_nv 91 0% ---- 0.00 22605 cpu[0] tcp_send_data 89 0% ---- 0.00 21472 cpu[1] nfs3_fhtovp 87 0% ---- 0.00 27427 cpu[2] rw_enter 86 0% ---- 0.00 21176 cpu[1] fsop_vget 84 0% ---- 0.00 24945 cpu[1] zio_vdev_io_start 81 0% ---- 0.00 21110 cpu[1] zfs_vget 81 0% ---- 0.00 25006 cpu[3] zio_wait_for_children 78 0% ---- 0.00 31423 cpu[2] atomic_add_ptr 74 0% ---- 0.00 31148 cpu[2]+11 rmm_rput 73 0% ---- 0.00 22876 cpu[0] dls_tx ------------------------------------------------------------------------------- NFS Stats: [private:/tmp] root# nfsstat -s -v 3 Server rpc: Connection oriented: calls badcalls nullrecv badlen xdrcall dupchecks dupreqs 154161041 0 0 0 0 35669015 0 Connectionless: calls badcalls nullrecv badlen xdrcall dupchecks dupreqs 0 0 0 0 0 0 0 Server NFSv3: calls badcalls 153760861 0 Version 3: (153482035 calls) null getattr setattr lookup access readlink 1 0% 68265095 44% 2334522 1% 33186002 21% 9373979 6% 4430 0% read write create mkdir symlink mknod 776093 0% 6496524 4% 6353313 4% 1883254 1% 814 0% 2 0% remove rmdir rename link readdir readdirplus 12084296 7% 2692 0% 208683 0% 5969808 3% 23751 0% 335079 0% fsstat fsinfo pathconf commit 6 0% 1 0% 2 0% 6183688 4% Server nfs_acl: Version 3: (278383 calls) null getacl setacl getxattrdir 0 0% 32060 11% 5 0% 246318 88% Memory: [private:/tmp] root# echo ''arc::print -d'' | mdb -k { anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0t7337733120 p = 0t3673915285 c = 0t7338496468 c_min = 0t533448320 c_max = 0t15996604416 hits = 0t6864679166 misses = 0t518711 deleted = 0t295199 skipped = 0t1701 hash_elements = 0t500010 hash_elements_max = 0t511362 hash_collisions = 0t1820702 hash_chains = 0t149161 hash_chain_max = 0t10 no_grow = 0 } [private:/tmp] root# echo ''::kmastat'' | mdb -k | sort -n -k 5 | tail -40 zio_buf_10240 10240 1524 1642 16814080 46367 0 zio_buf_3072 3072 4663 5604 17215488 354215 0 streams_dblk_1936 2048 43 8418 17240064 159700970 0 zio_buf_77824 77824 221 226 17588224 1464 0 zio_buf_28672 28672 660 664 19038208 9233 0 zio_buf_32768 32768 578 595 19496960 74273 0 zio_buf_3584 3584 4492 5952 21331968 186767 0 zio_buf_24576 24576 876 885 21749760 15378 0 zio_buf_14336 14336 1008 1556 22306816 2101355 0 kmem_va_8192 8192 2361 2752 22544384 4067 0 kmem_alloc_128 128 161894 173042 22863872 41686966 0 zio_buf_4096 4096 6002 6016 24641536 193926 0 kmem_magazine_15 128 207407 212939 28135424 237076 0 zio_buf_5120 5120 5031 6144 31457280 126567 0 zio_buf_20480 20480 1557 1567 32092160 39694 0 kmem_alloc_8 8 4064087 4064240 32964608 15892691 0 kmem_va_32768 32768 595 1028 33685504 1731 0 kmem_va_24576 24576 885 1560 40894464 2278 0 kmem_alloc_112 112 289900 366192 41664512 7743803 0 kmem_alloc_40 40 806400 1048077 42504192 14060780 0 kmem_alloc_256 256 187308 191970 52420608 33326285 0 arc_buf_hdr_t 128 500219 512089 67661824 14104876 0 kmem_slab_cache 56 1239939 1250640 71147520 1612006 0 kmem_va_12288 12288 4609 5690 74579968 8137 0 zfs_znode_cache 192 544134 545643 106426368 7771071 0 kmem_va_28672 28672 2651 3292 107872256 6270 0 rootnex_dmahdl 2504 42668 42832 109649920 32118262 0 kmem_io_64G_4096 4096 32768 32768 134217728 32768 0 vn_cache 240 552163 552255 150802432 634861 0 kmem_bufctl_cache 24 7085442 7085568 172752896 8534442 0 kmem_va_20480 20480 9939 11568 252706816 14227 0 zio_buf_512 512 578348 604080 309288960 29468142 0 dmu_buf_impl_t 328 874033 908160 309985280 21869642 0 kmem_alloc_96 96 3879869 3879918 378384384 21906309 0 zio_buf_65536 65536 7269 7273 476643328 1368534 0 kmem_va_4096 4096 1178724 1178752 533200896 1485372 0 dnode_t 640 4641005 4641006 3168260096 8165695 0 zio_buf_131072 131072 25251 25256 3310354432 245308 0 zio_buf_16384 16384 214643 214667 3517104128 747095 0 kmem_va_16384 16384 214783 217944 3570794496 375400 0 ZPool (I''m omit''ing zfs list output because the filesystems are named by customer numbers): [private:/tmp] root# zpool status pool: joyous state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM joyous ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 spares c5t1d0 AVAIL c5t2d0 AVAIL c5t3d0 AVAIL errors: No known data errors ZFS_Sync Dtrace: [private:/tmp] root# ./zfs_sync.d txg time sync-write async-write sync-read async-read 0 5206ms. 0/s 0MB/s; 0/s 0MB/s; 10/s 0MB/s; 0/s 0MB/s 2234681 281ms. 156/s 0MB/s; 32249/s 29MB/s; 14/s 0MB/s; 0/s 0MB/s 2234682 249ms. 176/s 0MB/s; 23216/s 23MB/s; 40/s 2MB/s; 0/s 0MB/s 0 5840ms. 0/s 0MB/s; 0/s 0MB/s; 24/s 1MB/s; 0/s 0MB/s 0 6239ms. 0/s 0MB/s; 0/s 0MB/s; 67/s 4MB/s; 0/s 0MB/s 2234683 427ms. 103/s 0MB/s; 22346/s 24MB/s; 117/s 7MB/s; 0/s 0MB/s 0 6729ms. 0/s 0MB/s; 0/s 0MB/s; 6/s 0MB/s; 0/s 0MB/s 2234684 228ms. 192/s 0MB/s; 24052/s 23MB/s; 17/s 1MB/s; 0/s 0MB/s 0 4770ms. 0/s 0MB/s; 0/s 0MB/s; 7/s 0MB/s; 0/s 0MB/s 2234685 212ms. 207/s 0MB/s; 25382/s 20MB/s; 0/s 0MB/s; 0/s 0MB/s 0 4787ms. 0/s 0MB/s; 0/s 0MB/s; 31/s 1MB/s; 0/s 0MB/s 2234686 215ms. 200/s 0MB/s; 19544/s 17MB/s; 18/s 1MB/s; 0/s 0MB/s 2234687 197ms. 223/s 0MB/s; 25253/s 23MB/s; 15/s 0MB/s; 0/s 0MB/s 0 8010ms. 0/s 0MB/s; 0/s 0MB/s; 3/s 0MB/s; 0/s 0MB/s 0 5611ms. 0/s 0MB/s; 0/s 0MB/s; 4/s 0MB/s; 0/s 0MB/s 2234688 255ms. 172/s 0MB/s; 23250/s 21MB/s; 19/s 1MB/s; 0/s 0MB/s 0 5357ms. 0/s 0MB/s; 0/s 0MB/s; 8/s 0MB/s; 0/s 0MB/s 2234689 249ms. 176/s 0MB/s; 21678/s 27MB/s; 100/s 6MB/s; 0/s 0MB/s 0 6465ms. 0/s 0MB/s; 0/s 0MB/s; 1/s 0MB/s; 0/s 0MB/s 2234690 240ms. 183/s 0MB/s; 20162/s 16MB/s; 25/s 1MB/s; 0/s 0MB/s And finally, 30 seconds of nicstat: Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat 21:00:58 aggr0/1 130.3 136.2 657.5 629.0 203.0 221.8 0.22 0.00 21:00:58 e1000g1/0 19.00 17.87 112.6 103.4 172.7 177.0 0.03 0.00 21:00:58 aggr1 130.3 136.2 657.4 629.0 203.0 221.8 0.22 0.00 21:00:58 e1000g3/0 51.48 27.27 272.0 156.1 193.8 178.9 0.06 0.00 21:00:58 e1000g0 0.16 0.47 2.39 3.99 67.50 120.6 0.00 0.00 21:00:58 e1000g2/0 59.83 91.11 272.8 369.6 224.6 252.4 0.12 0.00 21:00:58 e1000g0/0 0.16 0.47 2.39 3.99 67.50 120.5 0.00 0.00 Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat 21:01:08 aggr0/1 1158.0 1213.2 6497.7 5838.5 182.5 212.8 1.94 0.00 21:01:08 e1000g1/0 506.0 158.4 2851.2 368.6 181.7 440.0 0.54 0.00 21:01:08 aggr1 1158.0 1213.1 6497.3 5838.0 182.5 212.8 1.94 0.00 21:01:08 e1000g3/0 567.9 31.65 3296.7 175.4 176.4 184.8 0.49 0.00 21:01:08 e1000g0 0.12 0.42 1.99 3.49 64.00 123.4 0.00 0.00 21:01:08 e1000g2/0 83.36 1023.5 345.1 5296.2 247.4 197.9 0.91 0.00 21:01:08 e1000g0/0 0.12 0.42 1.99 3.49 64.00 123.4 0.00 0.00 Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat 21:01:18 aggr0/1 762.1 876.4 4133.2 3775.1 188.8 237.7 1.34 0.00 21:01:18 e1000g1/0 492.2 171.6 2733.4 671.8 184.4 261.6 0.54 0.00 21:01:18 aggr1 762.1 876.5 4133.6 3775.6 188.8 237.7 1.34 0.00 21:01:18 e1000g3/0 191.2 25.95 1081.8 141.3 181.0 188.1 0.18 0.00 21:01:18 e1000g0 0.14 0.41 2.19 3.39 64.00 123.9 0.00 0.00 21:01:18 e1000g2/0 79.32 678.5 322.3 2960.3 252.0 234.7 0.62 0.00 21:01:18 e1000g0/0 0.14 0.41 2.19 3.39 64.00 123.9 0.00 0.00 We know that there are things that can be improved in our network and we''re rushing to fix those things, but more and more I''ve become less certain that this is a network issue at its heart. Again, all the help is greatly appreciated. benr. _______________________________________________ nfs-discuss mailing list nfs-discuss@opensolaris.org