Robert Milkowski
2007-Feb-12 11:45 UTC
[zfs-discuss] NFS/ZFS performance problems - txg_wait_open() deadlocks?
Hi. System is snv_56 sun4u sparc SUNW,Sun-Fire-V440, zil_disable=1 We see many operation on nfs clients to that server really slow (like 90 seconds for unlink()). It''s not a problem with network, there''s also plenty oc CPU available. Storage isn''t saturated either. First strange thing - normally on that server nfsd has about 1500-2500 number of threads. I did ''svcadm disable nfs/server'' while watching nfsd with prstat. It quickly exited many threads but 651 of them were still there. Then it took almost 3 minutes to completely exit nfsd - during all those 3 minutes all 651 threads were there and then suddenly in seconds it exited all threads. While it was hanging I did:> ::ps!grep nfsd > 000006000a729068::walk thread|::findstack -vwith output logged to a file. I attached output as nfsd.txt.gz You can see that one thread was in: stack pointer for thread 30003621660: 2a1038c7021 [ 000002a1038c7021 cv_wait+0x3c() ] 000002a1038c70d1 exitlwps+0x10c(0, 100000, 42100002, 6000a729068, 6000a72912e, 42000002) 000002a1038c7181 proc_exit+0x1c(1, 0, ff000000, 1, 0, e965) 000002a1038c7231 exit+8(1, 0, ffbff9d0, 1, ff164770, 16419) 000002a1038c72e1 syscall_trap32+0xcc(0, 0, ffbff9d0, 1, ff164770, 16419) stack pointer for thread 3000351b620: 2a101f864b1 Which is ok. However all the other threads (650 of them) were in cv_wait: bash-3.00# cat nfsd.txt | grep -c "nfssys+" 650 bash-3.00# cat nfsd.txt | grep -c "cv_wait+" 651 bash-3.00# bash-3.00# cat nfsd.txt | grep -c "txg_wait_open+" 360 bash-3.00# cat nfsd.txt | grep "txg_wait_open+" | grep -c "337d1e" 360 So all those txg_wait_open are for the same transaction group - txg_wait_open+0x58(6000ad0c910, 337d1e, All the other threads were in: bash-3.00# cat nfsd.txt | grep -c "zfs_range_lock_reader+" 289 Several different dnodes (however several threads to the same dnode). Also using ''zpool iostat 1'' while nfsd was shuting down I could see mostly reads and some writes every 40-60s. Why nfsd was closing for almost 3 minutes? Because we were sitting in txg_wait_open() for so long - why? Now, during "normal" operation when the server is serving data via nfsd ''zpool iostat 1'' looks like: bash-3.00# zpool iostat 1 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- f3-2 466G 758G 562 148 31.9M 1.68M f3-2 466G 758G 1.38K 0 82.1M 0 f3-2 466G 758G 1.27K 0 77.4M 0 f3-2 466G 758G 1.26K 0 76.7M 0 f3-2 466G 758G 1.23K 0 68.7M 0 f3-2 466G 758G 1.06K 0 52.2M 0 f3-2 466G 758G 931 0 31.2M 0 f3-2 466G 758G 1013 0 30.6M 0 f3-2 466G 758G 938 0 32.7M 0 f3-2 466G 758G 942 0 16.5M 0 f3-2 466G 758G 1.13K 0 41.7M 0 f3-2 466G 758G 818 1005 3.86M 10.7M f3-2 466G 758G 1.02K 26 35.6M 489K f3-2 466G 758G 1.06K 0 41.9M 0 f3-2 466G 758G 555 0 13.7M 0 f3-2 466G 758G 943 0 44.3M 0 f3-2 466G 758G 1K 0 26.3M 0 f3-2 466G 758G 928 0 16.1M 0 f3-2 466G 758G 854 0 31.3M 0 f3-2 466G 758G 903 0 29.7M 0 f3-2 466G 758G 917 0 26.2M 0 f3-2 466G 758G 1.01K 0 33.4M 0 f3-2 466G 758G 918 0 20.2M 0 f3-2 466G 758G 1.07K 0 41.6M 0 f3-2 466G 758G 967 0 12.6M 0 f3-2 466G 758G 973 0 33.8M 0 f3-2 466G 758G 871 0 13.6M 0 f3-2 466G 758G 1.01K 0 28.7M 0 f3-2 466G 758G 894 0 21.3M 0 f3-2 466G 758G 902 0 25.3M 0 f3-2 466G 758G 1.11K 0 33.3M 0 f3-2 466G 758G 935 0 27.4M 0 f3-2 466G 758G 930 0 28.5M 0 f3-2 466G 758G 774 0 12.9M 0 f3-2 466G 758G 900 0 31.7M 0 f3-2 466G 758G 902 0 23.0M 0 f3-2 466G 758G 867 0 20.3M 0 f3-2 466G 758G 999 0 39.1M 0 f3-2 466G 758G 774 0 12.9M 0 ^C Why I don''t see writes every 5s? There''s definitely something to write all the time. I''ve migrated (zfs send|recv) one of a file systems in f3-2 to another server (x4500) and there''re no problems so far and I can see regular writes every 5s. Maybe there''s a problem with multiple file systems in a pool or maybe data became so fragmented that is causing the problem? bash-3.00# dtrace -q -n BEGIN''{gt=timestamp;}'' -n fbt::txg_wait_open:entry''/execname=="nfsd"/{txg_c++;self->t1=timestamp;self->arg1=arg1;}'' -n fbt::txg_wait_open:return''/self->t1/{txg_c--;@[self->arg1]=quantize((timestamp-self->t1)/1000000000);self->t1=0;self->arg1=0;}'' -n tick-10s''{printa(@);printf("txg_count: %d\ntotal time in seconds: %d\n",txg_c,(timestamp-gt)/1000000000);}'' txg_count: 0 total time in seconds: 9 txg_count: 0 total time in seconds: 19 txg_count: 0 total time in seconds: 29 txg_count: 0 total time in seconds: 39 txg_count: 0 total time in seconds: 49 txg_count: 487 total time in seconds: 59 txg_count: 568 total time in seconds: 69 txg_count: 595 total time in seconds: 79 txg_count: 821 total time in seconds: 89 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 txg_count: 742 total time in seconds: 99 3374427 value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 744 16 | 0 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 txg_count: 843 total time in seconds: 109 3374427 value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 744 16 | 0 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 txg_count: 890 total time in seconds: 119 3374427 value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 744 16 | 0 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 txg_count: 890 total time in seconds: 129 3374427 value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 744 16 | 0 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 txg_count: 890 total time in seconds: 139 3374427 value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 744 16 | 0 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 txg_count: 890 total time in seconds: 149 3374427 value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 744 16 | 0 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 txg_count: 890 total time in seconds: 159 3374427 value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 744 16 | 0 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 3374428 value ------------- Distribution ------------- count 16 | 0 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 895 64 | 0 txg_count: 804 total time in seconds: 169 3374427 value ------------- Distribution ------------- count 4 | 0 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 744 16 | 0 3374426 value ------------- Distribution ------------- count 2 | 0 4 |@@@@@@@@@@@ 226 8 | 0 16 |@ 27 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 568 64 | 0 3374428 value ------------- Distribution ------------- count 16 | 0 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 895 64 | 0 txg_count: 804 total time in seconds: 179 ^C We''re stuck in txg_wait_open() even for over 64 seconds! with hundreds of threads waiting for the same transaction. Why is it? See also in pstack for nfsd I attached that no thread is in zio_wait() so we''re not waiting for disks. bash-3.00# cat zfs_check.d #!/usr/sbin/dtrace -s fbt::zfs_range_lock:entry /execname=="nfsd"/ { self->t=timestamp; } fbt::zfs_range_lock:return /self->t/ { @range=quantize((timestamp-self->t)/1000000); } fbt::txg_wait_open:entry /execname=="nfsd"/ { self->t2=timestamp; } fbt::txg_wait_open:return /self->t2/ { @txg = quantize((timestamp-self->t2)/1000000); } fbt::zio_wait:entry /execname=="nfsd"/ { self->t3=timestamp; } fbt::zio_wait:return /self->t3/ { @zio = quantize((timestamp-self->t3)/1000000); } tick-30s { printf("\nrange_lock +++++++++++++++++++"); printa(@range); printf("\ntxg wait +++++++++++++++++++"); printa(@txg); printf("\nzio wait +++++++++++++++++++"); printa(@zio); } bash-3.00# bash-3.00# ./zfs_check.d [...] 3 51875 :tick-30s range_lock +++++++++++++++++++ value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3974 1 | 2 2 | 4 4 | 2 8 | 1 16 | 1 32 | 1 64 | 5 128 | 2 256 | 0 512 | 0 1024 | 0 2048 | 0 4096 | 2 8192 | 1 16384 | 2 32768 | 5 65536 | 9 131072 | 0 txg wait +++++++++++++++++++ value ------------- Distribution ------------- count 32768 | 0 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 808 131072 | 0 zio wait +++++++++++++++++++ value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@ 24494 1 | 88 2 | 112 4 |@ 1431 8 |@@@@@@ 5805 16 |@@ 2211 32 |@@ 2419 64 |@@@ 2692 128 |@ 1357 256 |@ 952 512 | 7 1024 | 0 ^C So, again - 64 seconds in txg_wait_open() but IOs were serviced quite quickly. Trying to dd to a file system locally also shows the same performance problem. We''re experiencing big performance problems due this - any help would be appreciated. ps. txg_wait_open() of course doesn''t deadlocks but looks suspicious. This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: nfsd.txt.gz Type: application/x-gzip Size: 92507 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070212/cf1dc5d6/attachment.bin>
Robert Milkowski
2007-Feb-12 13:16 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
bash-3.00# dtrace -n fbt::txg_quiesce:return''{printf("%Y ",walltimestamp);}'' dtrace: description ''fbt::txg_quiesce:return'' matched 1 probe CPU ID FUNCTION:NAME 3 38168 txg_quiesce:return 2007 Feb 12 14:08:15 0 38168 txg_quiesce:return 2007 Feb 12 14:12:14 3 38168 txg_quiesce:return 2007 Feb 12 14:15:05 ^C Why I do not see it exactly every 5s? On other server I get output exactly every 5s. This message posted from opensolaris.org
Roch - PAE
2007-Feb-12 14:19 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Robert Milkowski writes: > bash-3.00# dtrace -n fbt::txg_quiesce:return''{printf("%Y ",walltimestamp);}'' > dtrace: description ''fbt::txg_quiesce:return'' matched 1 probe > CPU ID FUNCTION:NAME > 3 38168 txg_quiesce:return 2007 Feb 12 14:08:15 > 0 38168 txg_quiesce:return 2007 Feb 12 14:12:14 > 3 38168 txg_quiesce:return 2007 Feb 12 14:15:05 > ^C > > > > Why I do not see it exactly every 5s? > On other server I get output exactly every 5s. > > I am not sure about this specific funtion but if the question is the same as why is the pool synching more often than 5sec, then that can be because of low memory condition (if we have too much dirty memory to sync we don''t wait the 5 seconds.). See arc_tempreserve_space around (ERESTART). -r > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Robert Milkowski
2007-Feb-12 14:37 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Hello Roch, Monday, February 12, 2007, 3:19:23 PM, you wrote: RP> Robert Milkowski writes: >> bash-3.00# dtrace -n fbt::txg_quiesce:return''{printf("%Y ",walltimestamp);}'' >> dtrace: description ''fbt::txg_quiesce:return'' matched 1 probe >> CPU ID FUNCTION:NAME >> 3 38168 txg_quiesce:return 2007 Feb 12 14:08:15 >> 0 38168 txg_quiesce:return 2007 Feb 12 14:12:14 >> 3 38168 txg_quiesce:return 2007 Feb 12 14:15:05 >> ^C >> >> >> >> Why I do not see it exactly every 5s? >> On other server I get output exactly every 5s. >> >> RP> I am not sure about this specific funtion but if the RP> question is the same as why is the pool synching more often RP> than 5sec, then that can be because of low memory condition RP> (if we have too much dirty memory to sync we don''t wait the RP> 5 seconds.). See arc_tempreserve_space around (ERESTART). The opposite - why it''s not syncing every 5s and rather every few minutes on that server. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Roch - PAE
2007-Feb-12 14:54 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Duh!. Long sync (which delays the next sync) are also possible on a write intensive workloads. Throttling heavy writters, I think, is the key to fixing this. Robert Milkowski writes: > Hello Roch, > > Monday, February 12, 2007, 3:19:23 PM, you wrote: > > RP> Robert Milkowski writes: > >> bash-3.00# dtrace -n fbt::txg_quiesce:return''{printf("%Y ",walltimestamp);}'' > >> dtrace: description ''fbt::txg_quiesce:return'' matched 1 probe > >> CPU ID FUNCTION:NAME > >> 3 38168 txg_quiesce:return 2007 Feb 12 14:08:15 > >> 0 38168 txg_quiesce:return 2007 Feb 12 14:12:14 > >> 3 38168 txg_quiesce:return 2007 Feb 12 14:15:05 > >> ^C > >> > >> > >> > >> Why I do not see it exactly every 5s? > >> On other server I get output exactly every 5s. > >> > >> > > RP> I am not sure about this specific funtion but if the > RP> question is the same as why is the pool synching more often > RP> than 5sec, then that can be because of low memory condition > RP> (if we have too much dirty memory to sync we don''t wait the > RP> 5 seconds.). See arc_tempreserve_space around (ERESTART). > > The opposite - why it''s not syncing every 5s and rather every > few minutes on that server. > > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Robert Milkowski
2007-Feb-12 15:52 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Hello Roch, Monday, February 12, 2007, 3:54:30 PM, you wrote: RP> Duh!. RP> Long sync (which delays the next sync) are also possible on RP> a write intensive workloads. Throttling heavy writters, I RP> think, is the key to fixing this. Well, then maybe it''s not the cause to our problems. Nevertheless 60-90s for unlink() is just plain wrong especially when you''ve got <10ms IOs to array, almost zero writes, plenty of CPU free, etc. Definitely something is wrong here. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
eric kustarz
2007-Feb-12 18:08 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
On Feb 12, 2007, at 7:52 AM, Robert Milkowski wrote:> Hello Roch, > > Monday, February 12, 2007, 3:54:30 PM, you wrote: > > RP> Duh!. > > RP> Long sync (which delays the next sync) are also possible on > RP> a write intensive workloads. Throttling heavy writters, I > RP> think, is the key to fixing this. > > Well, then maybe it''s not the cause to our problems. > Nevertheless 60-90s for unlink() is just plain wrong especially when > you''ve got <10ms IOs to array, almost zero writes, plenty of CPU free, > etc. > > Definitely something is wrong here.Looks like spa_sync() via the txg_sync_thread thread is taking way too long, which is causing new (NFS) requests to be delayed (such as unlink). Is this just a NFS server, or is there local activity as well? A complete threadlist would be interesting, as would memory usage. Have you increased the load on this machine? I have seen a similar situation (new requests being blocked waiting for the sync thread to finish), but that''s only been when either 1) the hardware is broken and taking too long or 2) the server is way overloaded. eric
Robert Milkowski
2007-Feb-13 14:21 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Hello eric, Monday, February 12, 2007, 7:08:20 PM, you wrote: ek> On Feb 12, 2007, at 7:52 AM, Robert Milkowski wrote:>> Hello Roch, >> >> Monday, February 12, 2007, 3:54:30 PM, you wrote: >> >> RP> Duh!. >> >> RP> Long sync (which delays the next sync) are also possible on >> RP> a write intensive workloads. Throttling heavy writters, I >> RP> think, is the key to fixing this. >> >> Well, then maybe it''s not the cause to our problems. >> Nevertheless 60-90s for unlink() is just plain wrong especially when >> you''ve got <10ms IOs to array, almost zero writes, plenty of CPU free, >> etc. >> >> Definitely something is wrong here.ek> Looks like spa_sync() via the txg_sync_thread thread is taking way ek> too long, which is causing new (NFS) requests to be delayed (such as ek> unlink). ek> Is this just a NFS server, or is there local activity as well? No local activity, no other applications - just NFS server. ek> A complete threadlist would be interesting, as would memory usage. System is not paging at all and free memory reported by vmstat stays at about 1GB. All processes in a sysytem: bash-3.00# ps -ef UID PID PPID C STIME TTY TIME CMD root 0 0 0 Feb 09 ? 0:44 sched root 1 0 0 Feb 09 ? 0:12 /sbin/init -v root 2 0 0 Feb 09 ? 0:00 pageout root 3 0 0 Feb 09 ? 42:09 fsflush root 245 1 0 Feb 09 ? 0:00 /usr/sbin/mdmonitord root 7 1 0 Feb 09 ? 0:18 /lib/svc/bin/svc.startd root 9 1 0 Feb 09 ? 0:49 /lib/svc/bin/svc.configd daemon 296 1 0 Feb 09 ? 14:20 /usr/lib/nfs/lockd daemon 283 1 0 Feb 09 ? 0:00 /usr/lib/nfs/statd root 403 1 0 Feb 09 ? 0:00 /usr/sbin/nsrexecd root 286 7 0 Feb 09 ? 0:00 /usr/lib/saf/sac -t 300 daemon 282 1 0 Feb 09 ? 0:02 /usr/lib/nfs/nfsmapid root 159 1 0 Feb 09 ? 0:46 /usr/lib/picl/picld root 161 1 0 Feb 09 ? 0:30 /usr/sbin/nscd root 24699 380 0 16:41:19 ? 0:00 /usr/lib/ssh/sshd daemon 278 1 0 Feb 09 ? 0:17 /usr/sbin/rpcbind daemon 173 1 0 Feb 09 ? 0:01 /usr/lib/crypto/kcfd root 156 1 0 Feb 09 ? 0:00 /usr/lib/sysevent/syseventd root 295 1 0 Feb 09 ? 0:01 /usr/lib/utmpd root 233 1 0 Feb 09 ? 0:42 /usr/lib/inet/xntpd root 274 1 0 Feb 09 ? 0:03 /usr/sbin/cron root 328 1 0 Feb 09 ? 0:01 /usr/sbin/syslogd root 404 403 0 Feb 09 ? 0:01 /usr/sbin/nsrexecd root 187 1 0 Feb 09 ? 3:14 /usr/lib/inet/in.mpathd -a root 179 1 0 Feb 09 ? 0:00 devfsadmd root 298 1 0 Feb 09 ? 0:07 /usr/lib/inet/inetd start root 299 7 0 Feb 09 console 0:00 /usr/lib/saf/ttymon -g -d /dev/console -l console -T sun -m ldterm,ttcompat -h root 301 286 0 Feb 09 ? 0:00 /usr/lib/saf/ttymon root 378 1 0 Feb 09 ? 0:07 /usr/lib/fm/fmd/fmd root 380 1 0 Feb 09 ? 0:00 /usr/lib/ssh/sshd root 588 1 0 Feb 09 ? 23:46 /usr/jdk/jdk1.5.0_09/bin/java -Xms4M -Xmx64M -classpath /usr/share/lib/jdmk/jdm root 801 1 0 Feb 09 ? 0:01 /usr/lib/nfs/mountd root 24702 24699 0 16:41:20 ? 0:00 /usr/lib/ssh/sshd root 19793 1 0 Feb 10 ? 4:36 /usr/sfw/sbin/snmpd root 8598 380 0 14:49:02 ? 0:00 /usr/lib/ssh/sshd root 22686 22680 0 Feb 12 pts/1 0:00 bash root 22678 22671 0 Feb 12 ? 0:02 /usr/lib/ssh/sshd root 24704 24702 0 16:41:20 pts/2 0:00 -bash root 8638 8636 0 14:49:05 pts/3 0:00 -sh root 8647 8645 0 14:49:11 pts/3 0:00 ps -ef root 8645 8638 0 14:49:06 pts/3 0:00 bash root 22680 22678 0 Feb 12 pts/1 0:00 -sh root 8636 8598 0 14:49:05 ? 0:00 /usr/lib/ssh/sshd root 22671 380 0 Feb 12 ? 0:00 /usr/lib/ssh/sshd daemon 28297 1 0 22:17:32 ? 173:25 /usr/lib/nfs/nfsd bash-3.00# ek> Have you increased the load on this machine? I have seen a similar ek> situation (new requests being blocked waiting for the sync thread to ek> finish), but that''s only been when either 1) the hardware is broken ek> and taking too long or 2) the server is way overloaded. I don''t think HW is broken - the same situation on two v440 servers and T2000 server. iostat doesn''t show any problems accessing disks (no queues, short service times, etc.). We moved workload from v440 with 8GB of RAM to T2000 with 32GB of RAM and for many hours it was working just great. We did try to stop nfsd on T2000 and it exited within a second or two - looked almost like it was working great. But then next day (today) we''ve started experiencing the same problems - long IOs (dozen of seconds) so we decided to spot nfsd - this time it took over 20 minutes for all threads to complete. Then we did ''zpool export f3-2'' and it took 59 minutes to complete!! See other thread here I''ve just started ("[zfs-discuss] zpool export consumes whole CPU and takes more than 30 minutes to complete"). Looks like for some reason ZFS isn''t able to complete all writes to disks. More memory just delayed the problem and zil_disable set to 1 mitigates the problem for some time until zfs has filled up all memory and has to wait for data being written to disk and then nfs operations starts to take 30-90s, sometimes even much more. Then you''ve got problem with stopping nfsd, or exporting a pool (different thing is why during export entire one cpu is consumed by zfs which is the limiting factor). The same on S10U3 and snv_54. Something is really broken here. Maybe the hardware... -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
eric kustarz
2007-Feb-14 08:50 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
> > ek> Have you increased the load on this machine? I have seen a > similar > ek> situation (new requests being blocked waiting for the sync > thread to > ek> finish), but that''s only been when either 1) the hardware is > broken > ek> and taking too long or 2) the server is way overloaded. > > I don''t think HW is broken - the same situation on two v440 servers > and T2000 server. iostat doesn''t show any problems accessing disks (no > queues, short service times, etc.). > > We moved workload from v440 with 8GB of RAM to T2000 with 32GB of RAM > and for many hours it was working just great. We did try to stop nfsd > on T2000 and it exited within a second or two - looked almost like it > was working great. But then next day (today) we''ve started > experiencing the same problems - long IOs (dozen of seconds) so we > decided to spot nfsd - this time it took over 20 minutes for all > threads to complete. Then we did ''zpool export f3-2'' and it took 59 > minutes to complete!! See other thread here I''ve just started > ("[zfs-discuss] zpool export consumes whole CPU and takes more than 30 > minutes to complete"). > > Looks like for some reason ZFS isn''t able to complete all writes to > disks. More memory just delayed the problem and zil_disable set to 1 > mitigates the problem for some time until zfs has filled up all memory > and has to wait for data being written to disk and then nfs operations > starts to take 30-90s, sometimes even much more. Then you''ve got > problem with stopping nfsd, or exporting a pool (different thing is > why during export entire one cpu is consumed by zfs which is the > limiting factor). > > The same on S10U3 and snv_54. >So dtracing metaslab_ff_alloc() would be a good way to know if you''re hitting: 6495013 Loops and recursion in metaslab_ff_alloc can kill performance, even on a pool with lots of free data A dscript mentioned in the bug report (i believe the non-public part) is: " #!/usr/sbin/dtrace -s #pragma D option quiet BEGIN { self->in_metaslab = 0; } fbt::metaslab_ff_alloc:entry /self->in_metaslab == 0/ { self->in_metaslab = 1; self->loopcount = 0; } fbt::avl_walk:entry /self->in_metaslab/ { self->loopcount++; } fbt::metaslab_ff_alloc:return /self->in_metaslab/ { self->in_metaslab = 0; @loops["Loops count"] = quantize(self->loopcount); self->loopcount = 0; } " Now, note, this dscript isn''t perfect as it doesn''t take into recursion, but feel free to tweak it if/as you like. If you''re seeing lots of avl_walk() calls per metaslab_ff_alloc() call then its the above bug. eric
Robert Milkowski
2007-Feb-14 09:57 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Hello eric, Wednesday, February 14, 2007, 9:50:29 AM, you wrote:>> >> ek> Have you increased the load on this machine? I have seen a >> similar >> ek> situation (new requests being blocked waiting for the sync >> thread to >> ek> finish), but that''s only been when either 1) the hardware is >> broken >> ek> and taking too long or 2) the server is way overloaded. >> >> I don''t think HW is broken - the same situation on two v440 servers >> and T2000 server. iostat doesn''t show any problems accessing disks (no >> queues, short service times, etc.). >> >> We moved workload from v440 with 8GB of RAM to T2000 with 32GB of RAM >> and for many hours it was working just great. We did try to stop nfsd >> on T2000 and it exited within a second or two - looked almost like it >> was working great. But then next day (today) we''ve started >> experiencing the same problems - long IOs (dozen of seconds) so we >> decided to spot nfsd - this time it took over 20 minutes for all >> threads to complete. Then we did ''zpool export f3-2'' and it took 59 >> minutes to complete!! See other thread here I''ve just started >> ("[zfs-discuss] zpool export consumes whole CPU and takes more than 30 >> minutes to complete"). >> >> Looks like for some reason ZFS isn''t able to complete all writes to >> disks. More memory just delayed the problem and zil_disable set to 1 >> mitigates the problem for some time until zfs has filled up all memory >> and has to wait for data being written to disk and then nfs operations >> starts to take 30-90s, sometimes even much more. Then you''ve got >> problem with stopping nfsd, or exporting a pool (different thing is >> why during export entire one cpu is consumed by zfs which is the >> limiting factor). >> >> The same on S10U3 and snv_54. >>ek> So dtracing metaslab_ff_alloc() would be a good way to know if you''re ek> hitting: ek> 6495013 Loops and recursion in metaslab_ff_alloc can kill ek> performance, even on a pool with lots of free data ek> A dscript mentioned in the bug report (i believe the non-public part) ek> is: ek> " ek> #!/usr/sbin/dtrace -s ek> #pragma D option quiet ek> BEGIN ek> { ek> self->in_metaslab = 0; ek> } ek> fbt::metaslab_ff_alloc:entry /self->>in_metaslab == 0/ ek> { ek> self->in_metaslab = 1; ek> self->loopcount = 0; ek> } ek> fbt::avl_walk:entry /self->>in_metaslab/ ek> { ek> self->loopcount++; ek> } ek> fbt::metaslab_ff_alloc:return /self->>in_metaslab/ ek> { ek> self->in_metaslab = 0; ek> @loops["Loops count"] = quantize(self->loopcount); ek> self->loopcount = 0; ek> } ek> " ek> Now, note, this dscript isn''t perfect as it doesn''t take into ek> recursion, but feel free to tweak it if/as you like. If you''re ek> seeing lots of avl_walk() calls per metaslab_ff_alloc() call then its ek> the above bug. I''ve been using it in another CR where destroying one of a snapshots was helping the performance. Nevertheless here it''s on that server: Short period of time: bash-3.00# ./metaslab-6495013.d ^C Loops count value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 17674 1 |@@@@@@@ 4418 2 |@@@ 2123 4 |@@ 1257 8 |@ 753 16 |@ 416 32 | 220 64 | 103 128 | 58 256 | 38 512 | 21 1024 | 13 2048 | 10 4096 | 8 8192 | 3 16384 | 3 32768 | 2 65536 | 1 131072 | 26 262144 | 7 524288 | 0 bash-3.00# Looks like that''s it. Here''s another server with similar problem. bash-3.00# ./metaslab-6495013.d ^C Loops count value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1648 1 |@@@ 197 2 |@@ 93 4 |@ 56 8 |@ 45 16 |@ 36 32 | 28 64 |@ 32 128 | 18 256 | 28 512 | 21 1024 | 20 2048 | 10 4096 | 25 8192 | 9 16384 | 9 32768 | 10 65536 | 4 131072 | 11 262144 |@ 47 524288 | 0 bash-3.00# Also right now these results are not during peak hours so it will get much worse later... :(( IIRC you''ve written before that someone is actively working on it right now, right? Any update? Any approx. ETA? I would like to test it ASAP even before putback. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
eric kustarz
2007-Feb-14 16:04 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
> > I''ve been using it in another CR where destroying one of a snapshots > was helping the performance. Nevertheless here it''s on that server: > > Short period of time: > > bash-3.00# ./metaslab-6495013.d > > ^C > > Loops count > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 17674 > 1 |@@@@@@@ 4418 > 2 |@@@ 2123 > 4 |@@ 1257 > 8 |@ 753 > 16 |@ 416 > 32 | 220 > 64 | 103 > 128 | 58 > 256 | 38 > 512 | 21 > 1024 | 13 > 2048 | 10 > 4096 | 8 > 8192 | 3 > 16384 | 3 > 32768 | 2 > 65536 | 1 > 131072 | 26 > 262144 | 7 > 524288 | 0 > > bash-3.00# > > > Looks like that''s it.Yeah, sometimes doing over 200,000 avl_walks isn''t going to be good. ...>> > IIRC you''ve written before that someone is actively working on it > right now, right? Any update? Any approx. ETA? I would like to test it > ASAP even before putback.I believe this is next on George''s list (he''s got a couple of fixes he needs to putback first). So since you''re in this state, would you mind seeing the ''size'' (arg1) distribution instead of avl_walk()s for metaslab_ff_claim()? Something like: " #!/usr/sbin/dtrace -s #pragma D option quiet BEGIN { self->in_metaslab = 0; } fbt::metaslab_ff_alloc:entry /self->in_metaslab == 0/ { self->in_metaslab = 1; @sizes["metaslab sizes"] = quantize(arg1); } fbt::metaslab_ff_alloc:return /self->in_metaslab/ { self->in_metaslab = 0; } " I''m wondering if we can just lower the amount of space we''re trying to alloc as the pool becomes more fragmented - we''ll lose a little I/ O performance, but it should limit this bug. eric
Robert Milkowski
2007-Feb-15 13:47 UTC
[zfs-discuss] Re: Re[8]: Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
bash-3.00# ./metaslab-6495013-2.d ^C metaslab sizes value ------------- Distribution ------------- count 256 | 0 512 |@@@@@@@@@@@@ 9031 1024 |@@@@@@@@@@ 7462 2048 |@@@@@@@@@ 6832 4096 |@@@ 2242 8192 |@ 783 16384 |@ 934 32768 |@ 422 65536 |@@ 1463 131072 |@@ 1467 262144 | 0 bash-3.00# This message posted from opensolaris.org
Robert Milkowski
2007-Feb-15 14:08 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Hello eric, Wednesday, February 14, 2007, 5:04:01 PM, you wrote: ek> I''m wondering if we can just lower the amount of space we''re trying ek> to alloc as the pool becomes more fragmented - we''ll lose a little I/ ek> O performance, but it should limit this bug. Do you think that zfs send|recv for such file systems could help for a while? (would it defragment data?) -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski
2007-Feb-15 14:25 UTC
[zfs-discuss] Re: Re[8]: Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
To confirm again if it''s avl_walk I stopped nfsd (which took a while) then issued sync() which is still running after 10 minutes. I observer that for many seconds zpool iostta (or iostta) do not show any or almost any disk traffic, then some megabytes, etc. However all the time one CPU is 100% in SYS. Roght now cpu id=24 is 100% in SYS so: bash-3.00# dtrace -n profile-1003''/cpu == 24/{@[cpu,stack()]=count();}'' -n tick-10s''{trunc(@,33);printa(@);exit(0);}'' dtrace: description ''profile-1003'' matched 1 probe dtrace: description ''tick-10s'' matched 1 probe CPU ID FUNCTION:NAME 30 49714 :tick-10s 24 zfs`metaslab_ff_alloc+0xac zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 11 24 genunix`avl_walk+0x8 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 12 24 genunix`avl_walk+0x3c zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 12 24 genunix`avl_walk zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 14 24 genunix`avl_walk+0x70 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 14 24 genunix`avl_walk+0x88 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 14 24 genunix`avl_walk+0x98 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 16 24 genunix`avl_walk+0x1c zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 17 24 genunix`avl_walk+0x68 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 17 24 zfs`metaslab_ff_alloc+0xa8 zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 17 24 genunix`avl_walk+0x54 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 18 24 genunix`avl_walk+0xac zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 18 24 zfs`metaslab_ff_alloc+0xa4 zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 21 24 genunix`avl_walk+0x78 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 22 24 genunix`avl_walk+0x84 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 22 24 zfs`metaslab_ff_alloc+0x8c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 24 24 genunix`avl_walk+0x80 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 26 24 zfs`metaslab_ff_alloc+0xa0 zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 37 24 genunix`avl_walk+0x2c zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 39 24 zfs`metaslab_ff_alloc+0x98 zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 40 24 genunix`avl_walk+0x4 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 41 24 genunix`avl_walk+0xc zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 43 24 genunix`avl_walk+0x90 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 48 24 genunix`avl_walk+0xa4 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 50 24 zfs`metaslab_ff_alloc+0xb0 zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 69 24 genunix`avl_walk+0xa0 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 81 24 genunix`avl_walk+0x24 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 172 24 zfs`metaslab_ff_alloc+0x84 zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 322 24 zfs`metaslab_ff_alloc+0x80 zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 749 24 genunix`avl_walk+0x64 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 758 24 genunix`avl_walk+0x48 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 761 24 genunix`avl_walk+0x60 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 2746 24 genunix`avl_walk+0x44 zfs`metaslab_ff_alloc+0x9c zfs`space_map_alloc+0x10 zfs`metaslab_group_alloc+0x1e0 zfs`metaslab_alloc_dva+0x114 zfs`metaslab_alloc+0x2c zfs`zio_dva_allocate+0x4c zfs`zio_write_compress+0x1ec zfs`arc_write+0xe4 zfs`dbuf_sync+0x6c0 zfs`dnode_sync+0x35c zfs`dmu_objset_sync_dnodes+0x6c zfs`dmu_objset_sync+0x54 zfs`dsl_dataset_sync+0xc zfs`dsl_pool_sync+0x64 zfs`spa_sync+0x1b0 zfs`txg_sync_thread+0x134 unix`thread_start+0x4 2796 bash-3.00# dtrace -n profile-1003''/cpu == 24/{@[cpu,stack()]=count();}'' -n tick-10s''{trunc(@,33);printa(@);exit(0);}'' bash-3.00# This message posted from opensolaris.org
eric kustarz
2007-Feb-20 16:55 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
On Feb 15, 2007, at 6:08 AM, Robert Milkowski wrote:> Hello eric, > > Wednesday, February 14, 2007, 5:04:01 PM, you wrote: > > ek> I''m wondering if we can just lower the amount of space we''re > trying > ek> to alloc as the pool becomes more fragmented - we''ll lose a > little I/ > ek> O performance, but it should limit this bug. > > Do you think that zfs send|recv for such file systems could help for a > while? (would it defragment data?) > >If you were able to send over your complete pool, destroy the existing one and re-create a new one using recv, then that should help with fragmentation. That said, that''s a very poor man''s defragger. The defragmentation should happen automatically or at least while the pool is online. In the absence of a built-in defragger and without a fix for 6495013, i think the best thing you could do is either add more storage or remove some data (such as removing some old snapshots or move some unneeded storage to another system/backup). Not sure if either of those are applicable to you. Should note, ZFS isn''t the only COW filesystem that hits this problem. In WAFL''s case, i believe they hit this same problem around 90% capacity but are very nice about detecting it and reporting back to the admin. eric
Robert Milkowski
2007-Feb-20 21:12 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Hello eric, Tuesday, February 20, 2007, 5:55:47 PM, you wrote: ek> On Feb 15, 2007, at 6:08 AM, Robert Milkowski wrote:>> Hello eric, >> >> Wednesday, February 14, 2007, 5:04:01 PM, you wrote: >> >> ek> I''m wondering if we can just lower the amount of space we''re >> trying >> ek> to alloc as the pool becomes more fragmented - we''ll lose a >> little I/ >> ek> O performance, but it should limit this bug. >> >> Do you think that zfs send|recv for such file systems could help for a >> while? (would it defragment data?) >> >>ek> If you were able to send over your complete pool, destroy the ek> existing one and re-create a new one using recv, then that should ek> help with fragmentation. That said, that''s a very poor man''s ek> defragger. The defragmentation should happen automatically or at ek> least while the pool is online. I was rather thinking about sending all file systems to another server/pool (I''m in a middle of the process), then deleting source file systems and send the file systems back. Of course no problem with destroying the pool but I wonder why do you think it''s needed? Just deleting file systems won''t be enough? btw: I''ve already migrated three file systems that way to x4500 and so far they are working great - no cpu usage, much less read IOs (both in # of IO and in volume) and everything is committed exactly every 5s. So I guess there''s high degree of probability it will stay the same once I migrate them back to cluster. ek> In the absence of a built-in defragger and without a fix for 6495013, ek> i think the best thing you could do is either add more storage or ek> remove some data (such as removing some old snapshots or move some ek> unneeded storage to another system/backup). Not sure if either of ek> those are applicable to you. Some time ago removing snapshots helped. Then we stopped creating them and not there''s nothing really to remove left so I''m doing above. btw: 6495013 - what is it exactly (bugs.opensolaris.org doesn''t show it either sunsolve). ek> Should note, ZFS isn''t the only COW filesystem that hits this ek> problem. In WAFL''s case, i believe they hit this same problem around ek> 90% capacity but are very nice about detecting it and reporting back ek> to the admin. Yeah, but with ZFS I hit really big problem when pool has still more than 50% free space (which is almost 700GB of free space). And it''s not only that writes are slow it''s also that during the problem one whole core is eaten up to 100% all the time. Which rises a suspicion that perhaps something could go here in parallel and speed up thing a little bit on T2000 or v440. Crude workaround but still most of other cores just do nothing and one core is 100% in sys doing all those avl_walks (long journey). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
eric kustarz
2007-Feb-20 22:29 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
> > ek> If you were able to send over your complete pool, destroy the > ek> existing one and re-create a new one using recv, then that should > ek> help with fragmentation. That said, that''s a very poor man''s > ek> defragger. The defragmentation should happen automatically or at > ek> least while the pool is online. > > I was rather thinking about sending all file systems to another > server/pool (I''m in a middle of the process), then deleting source > file systems and send the file systems back. Of course no problem with > destroying the pool but I wonder why do you think it''s needed? Just > deleting file systems won''t be enough?Yeah, destroying the filesystems should be enough (i was equating destroying all filesystems to a zpool destroy).> > btw: I''ve already migrated three file systems that way to x4500 and > so far they are working great - no cpu usage, much less read IOs > (both in # of IO and in volume) and everything is committed > exactly every 5s. So I guess there''s high degree of probability > it will stay the same once I migrate them back to cluster. > > > ek> In the absence of a built-in defragger and without a fix for > 6495013, > ek> i think the best thing you could do is either add more storage or > ek> remove some data (such as removing some old snapshots or move some > ek> unneeded storage to another system/backup). Not sure if either of > ek> those are applicable to you. > > Some time ago removing snapshots helped. Then we stopped creating them > and not there''s nothing really to remove left so I''m doing above. > > btw: 6495013 - what is it exactly (bugs.opensolaris.org doesn''t show > it either sunsolve).Hmm, should be working now: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6495013 6495013 Loops and recursion in metaslab_ff_alloc can kill performance, even on a pool with lots of free data eric
Robert Milkowski
2007-Feb-20 22:38 UTC
[zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Hello eric, Tuesday, February 20, 2007, 11:29:41 PM, you wrote:>> >> ek> If you were able to send over your complete pool, destroy the >> ek> existing one and re-create a new one using recv, then that should >> ek> help with fragmentation. That said, that''s a very poor man''s >> ek> defragger. The defragmentation should happen automatically or at >> ek> least while the pool is online. >> >> I was rather thinking about sending all file systems to another >> server/pool (I''m in a middle of the process), then deleting source >> file systems and send the file systems back. Of course no problem with >> destroying the pool but I wonder why do you think it''s needed? Just >> deleting file systems won''t be enough?ek> Yeah, destroying the filesystems should be enough (i was equating ek> destroying all filesystems to a zpool destroy).>> >> btw: I''ve already migrated three file systems that way to x4500 and >> so far they are working great - no cpu usage, much less read IOs >> (both in # of IO and in volume) and everything is committed >> exactly every 5s. So I guess there''s high degree of probability >> it will stay the same once I migrate them back to cluster. >> >> >> ek> In the absence of a built-in defragger and without a fix for >> 6495013, >> ek> i think the best thing you could do is either add more storage or >> ek> remove some data (such as removing some old snapshots or move some >> ek> unneeded storage to another system/backup). Not sure if either of >> ek> those are applicable to you. >> >> Some time ago removing snapshots helped. Then we stopped creating them >> and not there''s nothing really to remove left so I''m doing above. >> >> btw: 6495013 - what is it exactly (bugs.opensolaris.org doesn''t show >> it either sunsolve).ek> Hmm, should be working now: ek> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6495013 ek> 6495013 Loops and recursion in metaslab_ff_alloc can kill ek> performance, even on a pool with lots of free data thanks I''ll ask my case engineer to add a call record for us. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com