Martin, Marcia R
2006-Jun-21 20:14 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Did I miss something on this thread? Was the root cause of the 15-minute fsync every actually determined? -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of eric kustarz Sent: Wednesday, June 21, 2006 2:12 PM To: Neil.Perrin at sun.com Cc: zfs-discuss at opensolaris.org; Torrey McMahon Subject: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved Neil Perrin wrote:> > > Robert Milkowski wrote On 06/21/06 11:09,: > >> Hello Neil, >> >>>> Why is this option available then? (Yes, that''s a loaded question.) >>> >> >> NP> I wouldn''t call it an option, but an internal debugging switch >> that I >> NP> originally added to allow progress when initially integrating the >> ZIL. >> NP> As Roch says it really shouldn''t be ever set (as it does negate >> POSIX >> NP> synchronous semantics). Nor should it be mentioned to a customer. >> NP> In fact I''m inclined to now remove it - however it does still >> have a use >> NP> as it helped root cause this problem. >> >> Isn''t it similar to unsupported fastfs for ufs? > > > It is similar in the sense that it speeds up the file system. > Using fastfs can be much more dangerous though as it can lead to a > badly corrupted file system as writing meta data is delayed and > written out of order. Whereas disabling the ZIL does not affect the > integrity of the fs. The transaction group model of ZFS gives > consistency in the event of a crash/power fail. However, any data that> was promised to be on stable storage may not be unless the transaction> group committed (an operation that is started every 5s). > > We once had plans to add a mount option to allow the admin to control > the ZIL. Here''s a brief section of the RFE (6280630): > > sync={deferred,standard,forced} > > Controls synchronous semantics for the dataset. > > When set to ''standard'' (the default), synchronous > operations > such as fsync(3C) behave precisely as defined in > fcntl.h(3HEAD). > > When set to ''deferred'', requests for synchronous > semantics > are ignored. However, ZFS still guarantees thatordering> is preserved -- that is, consecutive operations reach > stable > storage in order. (If a thread performs operation A > followed > by operation B, then the moment that B reaches stable > storage, > A is guaranteed to be on stable storage as well.) ZFS> also > guarantees that all operations will be scheduled for > write to > stable storage within a few seconds, so that an > unexpected > power loss only takes the last few seconds of change > with it. > > When set to ''forced'', all operations becomesynchronous.> No operation will return until all previous operations > have been committed to stable storage. This option > can be > useful if an application is found to depend on > synchronous > semantics without actually requesting them; otherwise,it> will just make everything slow, and is notrecommended.> > Of course we would need to stress the dangers of setting ''deferred''. > What do you guys think? > > Neil.Scares me, and it seems we should wait until people are demanding it and we *have* to do it (if that time ever comes) - that is, we can''t squeeze any more performance gain out of the ''standard'' method. If problems do occur because of ''deferred'' mode, once i wrap-up zpool history, we''ll have that they set this logged to disk. eric _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Some more info. yesterday when i did a "zpool iostat 1" i got the following # here is sample of zpool iostat command # canary 39.1G 14.9G 509 0 24.0M 0 # canary 39.1G 14.9G 477 41 20.3M 169K # canary 39.1G 14.9G 496 1 22.7M 10.9K # canary 39.1G 14.9G 477 0 21.6M 1013 # canary 39.1G 14.9G 489 0 20.8M 1012 # canary 39.1G 14.9G 467 0 20.1M 1015 # canary 39.1G 14.9G 478 0 21.7M 1013 # canary 39.1G 14.9G 454 0 22.9M 1014 # canary 39.1G 14.9G 458 26 18.3M 529K # canary 39.1G 14.9G 384 71 15.9M 837K # canary 39.1G 14.9G 439 21 18.5M 110K # canary 39.1G 14.9G 510 2 21.7M 14.8K # canary 39.1G 14.9G 466 0 20.9M 1.98K # canary 39.1G 14.9G 492 39 20.2M 181K # canary 39.1G 14.9G 489 7 20.5M 63.4K # canary 39.1G 14.9G 500 0 21.1M 1.96K # canary 39.1G 14.9G 499 1 22.5M 1.96K # canary 39.1G 14.9G 523 0 23.8M 1012 # canary 39.1G 14.9G 519 0 21.9M 1013 # canary 39.1G 14.9G 480 0 20.0M 506 # canary 39.1G 14.9G 508 0 21.6M 0 # canary 39.1G 14.9G 488 0 20.5M 0 What seemed strange is the constant reads that were occurring. Today (after last nights reboot) we see this: # zpool iostat 1 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- canary 40.0G 14.0G 96 189 5.30M 1.60M canary 40.0G 14.0G 0 0 0 0 canary 40.0G 14.0G 9 60 761K 828K canary 40.0G 14.0G 41 124 2.91M 1.55M canary 40.0G 14.0G 3 0 253K 0 canary 40.0G 14.0G 8 0 570K 0 canary 40.0G 14.0G 11 25 760K 358K canary 40.0G 14.0G 29 275 2.10M 3.70M canary 40.0G 14.0G 82 273 5.45M 3.47M canary 40.0G 14.0G 43 296 2.78M 3.90M canary 40.0G 14.0G 9 297 634K 3.64M canary 40.0G 14.0G 12 324 887K 4.24M canary 40.0G 14.0G 15 271 1.18M 3.56M canary 40.0G 14.0G 5 295 444K 3.75M canary 40.0G 14.0G 14 319 1014K 4.14M canary 40.0G 14.0G 10 321 950K 3.34M canary 40.0G 14.0G 21 258 1.55M 2.46M canary 40.0G 14.0G 349 29 23.3M 62.4K canary 40.0G 14.0G 253 44 16.6M 541K canary 40.0G 14.0G 48 324 3.03M 3.21M canary 40.0G 14.0G 11 372 887K 3.74M canary 40.0G 14.0G 35 268 2.29M 2.64M canary 40.0G 14.0G 12 327 824K 3.20M canary 40.0G 14.0G 21 298 1.42M 2.85M canary 40.0G 14.0G 60 276 3.96M 2.07M canary 40.0G 14.0G 203 126 13.0M 673K canary 40.0G 14.0G 39 370 2.48M 1.84M canary 40.0G 14.0G 6 559 444K 2.58M canary 40.0G 14.0G 0 653 0 2.97M canary 40.0G 14.0G 0 582 63.4K 2.43M canary 40.0G 14.0G 0 697 0 3.30M canary 40.0G 14.0G 8 528 570K 2.22M canary 40.0G 14.0G 4 524 317K 2.48M canary 40.0G 14.0G 0 656 63.4K 3.64M canary 40.0G 14.0G 0 661 0 2.50M canary 40.0G 14.0G 0 658 0 2.99M canary 40.0G 14.0G 0 630 0 2.68M canary 40.0G 14.0G 0 647 0 2.83M canary 40.0G 14.0G 0 622 0 2.85M canary 40.0G 14.0G 0 717 0 3.41M with 110,000 file transfers a day (each 50kb tar file is untarred and produces 8 files) i did not expect the read rate as seen yesterday. Could there have been something constantly reading the disk? More info: When the problem was occurring, i ran this dtrace script (per Jim Fiori) # cat prof.d #!/usr/sbin/dtrace -s BEGIN { total=0; } profile-1007 /arg0/ { @[stack(5)] = count(); total++; } END { printf("\nTOTAL %d\n",total); trunc (@,30); printa(@); } Here was the output. : genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 72 unix`cpu_halt+0x10 unix`idle+0x128 unix`thread_start+0x4 74 unix`init_mondo+0x20 unix`xt_some+0xbc unix`sfmmu_tlb_range_demap+0x170 unix`hat_unload_callback+0x7f0 unix`ppmapout+0x94 74 unix`idle+0xdc unix`thread_start+0x4 76 unix`shipit+0x28 unix`send_mondo_set+0x5c unix`xt_some+0xc4 unix`xt_sync+0x84 unix`sfmmu_chgattr+0x380 79 zfs`metaslab_pick_block+0x94 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 zfs`zio_dva_allocate+0x64 81 zfs`lzjb_decompress+0x98 zfs`zio_read_decompress+0x38 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 84 zfs`mze_compare genunix`avl_find+0x2c zfs`mze_find+0x34 zfs`zap_lookup+0x70 zfs`zfs_dirent_lock+0x2bc 87 unix`disp_getwork+0xd8 unix`idle+0xd4 unix`thread_start+0x4 89 unix`disp_getwork+0x10c unix`idle+0xd4 unix`thread_start+0x4 95 unix`mutex_vector_enter+0x130 genunix`anon_dup_fill_holes+0x158 genunix`segvn_dup+0x3ac genunix`as_dup+0xd0 genunix`cfork+0x168 100 SUNW,UltraSPARC-T1`bcopy+0x2b4 unix`ppcopy+0x34 genunix`anon_map_privatepages+0x524 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 100 zfs`lzjb_decompress+0x80 zfs`zio_read_decompress+0x38 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 100 unix`disp_getwork+0xb8 unix`idle+0xd4 unix`thread_start+0x4 101 unix`mutex_vector_enter+0x144 genunix`anon_dup_fill_holes+0x158 genunix`segvn_dup+0x3ac genunix`as_dup+0xd0 genunix`cfork+0x168 102 zfs`metaslab_pick_block+0x9c zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 zfs`zio_dva_allocate+0x64 103 zfs`lzjb_decompress+0x78 zfs`zio_read_decompress+0x38 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 106 unix`shipit+0x28 unix`send_mondo_set+0x5c unix`xt_some+0xc4 unix`sfmmu_tlb_demap+0x100 unix`sfmmu_hblk_chgattr+0x2f8 112 unix`disp_getwork+0x8 unix`idle+0xd4 unix`thread_start+0x4 113 genunix`avl_walk+0xc zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 116 zfs`metaslab_pick_block+0x80 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 zfs`zio_write_allocate_gang_members+0xf0 119 genunix`avl_walk+0x90 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 120 genunix`avl_walk+0x84 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 122 genunix`avl_walk+0x4 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 124 genunix`avl_walk+0xa4 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 124 unix`idle+0xec unix`thread_start+0x4 127 unix`page_exists+0x7c unix`free_vp_pages+0x48 unix`hat_unload_callback+0x854 genunix`segvn_unmap+0x1b8 genunix`as_free+0xf4 131 unix`idle+0x78 unix`thread_start+0x4 132 SUNW,UltraSPARC-T1`bcopy+0x2bc unix`ppcopy+0x34 genunix`anon_map_privatepages+0x524 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 138 zfs`lzjb_decompress+0x88 zfs`zio_read_decompress+0x38 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 138 genunix`avl_walk+0x2c zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 139 SUNW,UltraSPARC-T1`bcopy+0x2c8 unix`ppcopy+0x34 genunix`anon_map_privatepages+0x524 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 139 SUNW,UltraSPARC-T1`bcopy+0x2b8 unix`ppcopy+0x34 genunix`anon_map_privatepages+0x524 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 143 unix`mutex_vector_enter+0x130 genunix`anon_map_privatepages+0xf8 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 genunix`segvn_fault+0x438 150 SUNW,UltraSPARC-T1`bcopy+0x2cc unix`ppcopy+0x34 genunix`anon_map_privatepages+0x524 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 151 unix`cpu_halt+0xd8 unix`idle+0x128 unix`thread_start+0x4 161 genunix`kmem_cache_kstat_update+0x1fc kstat`read_kstat_data+0x14c genunix`fop_ioctl+0x20 genunix`ioctl+0x184 unix`syscall_trap32+0xcc 164 unix`atomic_and_64+0x4 unix`idle+0x128 unix`thread_start+0x4 174 unix`mutex_vector_enter+0x144 genunix`anon_map_privatepages+0xf8 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 genunix`segvn_fault+0x438 184 unix`idle+0x9c unix`thread_start+0x4 187 unix`mutex_vector_enter+0x134 genunix`anon_dup_fill_holes+0x158 genunix`segvn_dup+0x3ac genunix`as_dup+0xd0 genunix`cfork+0x168 191 genunix`avl_walk+0xa0 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 191 unix`mutex_vector_enter+0x148 genunix`anon_dup_fill_holes+0x158 genunix`segvn_dup+0x3ac genunix`as_dup+0xd0 genunix`cfork+0x168 199 zfs`metaslab_pick_block+0xac zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 zfs`zio_dva_allocate+0x64 206 unix`disp_getwork+0xd0 unix`idle+0xd4 unix`thread_start+0x4 210 unix`disp_getwork+0xf0 unix`idle+0xd4 unix`thread_start+0x4 220 unix`idle+0x128 unix`thread_start+0x4 244 unix`disp_getwork+0x9c unix`idle+0xd4 unix`thread_start+0x4 247 unix`mutex_vector_enter+0x130 genunix`anon_decref_pages+0x78 genunix`anon_free_pages+0x80 genunix`segvn_free+0xb0 genunix`seg_free+0x2c 260 unix`disp_getwork+0x80 unix`idle+0xd4 unix`thread_start+0x4 278 unix`disp_getwork+0x78 unix`idle+0xd4 unix`thread_start+0x4 281 unix`mutex_vector_enter+0x144 genunix`anon_decref_pages+0x78 genunix`anon_free_pages+0x80 genunix`segvn_free+0xb0 genunix`seg_free+0x2c 286 unix`disp_getwork+0xe8 unix`idle+0xd4 unix`thread_start+0x4 293 unix`disp_getwork+0x94 unix`idle+0xd4 unix`thread_start+0x4 295 zfs`metaslab_pick_block+0x7c zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 zfs`zio_write_allocate_gang_members+0xf0 336 unix`atomic_and_64+0xc unix`idle+0x128 unix`thread_start+0x4 357 unix`disp_getwork+0xa0 unix`idle+0xd4 unix`thread_start+0x4 357 unix`mutex_vector_enter+0x134 genunix`anon_map_privatepages+0xf8 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 genunix`segvn_fault+0x438 358 unix`mutex_vector_enter+0x148 genunix`anon_map_privatepages+0xf8 genunix`anon_map_getpages+0x984 genunix`segvn_fault_anonpages+0x310 genunix`segvn_fault+0x438 390 unix`utl0+0x4c 395 genunix`avl_walk+0x24 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 458 unix`disp_getwork+0x70 unix`idle+0xd4 unix`thread_start+0x4 484 unix`disp_getwork+0xf4 unix`idle+0xd4 unix`thread_start+0x4 487 unix`disp_getwork+0xb0 unix`idle+0xd4 unix`thread_start+0x4 496 unix`mutex_vector_enter+0x148 genunix`anon_decref_pages+0x78 genunix`anon_free_pages+0x80 genunix`segvn_free+0xb0 genunix`seg_free+0x2c 555 unix`disp_getwork+0x98 unix`idle+0xd4 unix`thread_start+0x4 562 unix`mutex_vector_enter+0x134 genunix`anon_decref_pages+0x78 genunix`anon_free_pages+0x80 genunix`segvn_free+0xb0 genunix`seg_free+0x2c 567 unix`disp_getwork+0x84 unix`idle+0xd4 unix`thread_start+0x4 569 zfs`metaslab_pick_block+0x80 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 zfs`zio_dva_allocate+0x64 577 unix`disp_getwork+0x90 unix`idle+0xd4 unix`thread_start+0x4 600 unix`disp_getwork+0x7c unix`idle+0xd4 unix`thread_start+0x4 618 zfs`metaslab_pick_block+0x7c zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 zfs`zio_dva_allocate+0x64 1338 unix`i_ddi_splx+0x1c unix`disp_getwork+0x160 unix`idle+0xd4 unix`thread_start+0x4 1564 genunix`avl_walk+0x64 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 1629 genunix`avl_walk+0x48 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 1639 genunix`avl_walk+0x60 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 2458 genunix`avl_walk+0x44 zfs`metaslab_pick_block+0x98 zfs`metaslab_getblock+0x3c zfs`metaslab_group_alloc+0xd0 zfs`metaslab_alloc+0x50 2479 unix`shipit+0x28 unix`send_mondo_set+0x5c unix`xt_some+0xc4 unix`sfmmu_tlb_range_demap+0x170 unix`hat_unload_callback+0x7f0 2585 unix`shipit+0x28 unix`send_mondo_set+0x5c unix`xt_some+0xc4 unix`xt_sync+0x84 unix`hat_unload_callback+0x808 2587 unix`disp_getwork+0xec unix`idle+0xd4 unix`thread_start+0x4 2933 unix`disp_getwork+0x8c unix`idle+0xd4 unix`thread_start+0x4 3122 unix`disp_getwork+0x88 unix`idle+0xd4 unix`thread_start+0x4 7751 unix`cpu_halt+0xc8 unix`idle+0x128 unix`thread_start+0x4 554827 # lastly, the reboot last night (with the /etc/system change) has ended the high fdsynch issue. Since no one has said they know what caused the issue I am not confident that it will not come back. The only suggestion made in the last week to fix this was to change the /etc/system value. This is a production box that hosts all data from the world (750 Sun Ray servers, 180 campuses) we needed to get the box back to normal service levels. can someone suggest a little program that we could run periodically to test fdsynch? we could set an alarm to send an email if it degrades again. Sean Roch wrote On 06/21/06 07:31,:> 15 minutes to do a fdsync is way outside the slowdown usually seen. > The footprint for 6413510 is that when a huge amount of > data is being written non synchronously and a fsync comes in for the > same filesystem then all the non-synchronous data is also forced out > synchronously. So is there a lot of data being written during the vi? > > vi will write the whole file (in 4K) chunks and fsync it. > (based on a single experiment).Neil Perrin wrote Sean kindly gave me access to the system and so far I have reproduced the problem. It just requires a fsync on a file with 1 byte and for me takes 10 minutes to fsync! I have run a few dscripts but have yet to make much more progress. I do see that zfs version is fairly old so we may be chasing an old bug, or perhaps this really is an extreme version of 6413510, as there is often 3MB of data being collected and written to the pool and probably the same fs. Neil -- <http://www.sun.com> * Sean Meighan * Mgr ITSM Engineering *Sun Microsystems, Inc.* US Phone x32329 / +1 408 850-9537 Mobile 303-520-2024 Fax 408 850-9537 Email Sean.Meighan at Sun.COM ------------------------------------------------------------------------ NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ------------------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060621/83ff0819/attachment.html>
Martin, Marcia R writes: > Did I miss something on this thread? Was the root cause of the > 15-minute fsync <> actually determined? > I think so ;-) -r
> a test against the same iscsi targets using linux and XFS and the > NFS server implementation there gave me 1.25MB/sec writes. I was about > to throw in the towel and deem ZFS/NFS has unusable until B41 came > along and at least gave me 1.25MB/sec.That''s still super slow -- is this over a 10Mb link or something? Jeff
On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com> wrote:> > a test against the same iscsi targets using linux and XFS and the > > NFS server implementation there gave me 1.25MB/sec writes. I was about > > to throw in the towel and deem ZFS/NFS has unusable until B41 came > > along and at least gave me 1.25MB/sec. > > That''s still super slow -- is this over a 10Mb link or something? > > Jeff > >Nope, gig-e link (single e1000g, or aggregate, doesn''t matter) to the iscsi target, and single gig-e link (nge) to the NFS clients, who are gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. Again, the issue is the multiple fsyncs that NFS requires, and likely the serialization of those iscsi requests. Apparently, there is a basic latency in iscsi that one could improve upon with FC, but we are definitely in the all ethernet/iscsi camp for multi-building storage pool growth and don''t have interest in a FC-based SAN.
Nicolas Williams
2006-Jun-22 23:48 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote:> Again, the issue is the multiple fsyncs that NFS requires, and likely > the serialization of those iscsi requests. Apparently, there is a > basic latency in iscsi that one could improve upon with FC, but we are > definitely in the all ethernet/iscsi camp for multi-building storage > pool growth and don''t have interest in a FC-based SAN.There may be two things going on here. One is the fsync()s. So many fsync()s, required by the semantics of NFS serialize I/O too much (which, with a client doing single-threaded things, kills), and, looking at Roch''s Dynamics of ZFS blog entry, perhaps there''s some aliasing of fsync() pressure as memory pressure/disk I/O saturation. From Roch''s blog entry: CAVEAT: The current state of the ZIL is such that if there is a lot of pending data in a Filesystem (written to the FS, not yet output to disk) and a process issues an fsync() for one of it''s files, then all pending operations will have to be sent to disk before the synchronous command can complete. This can lead to unexpected performance characteristics. Code is under review. Since NFS is doing so many fsyncs this, I reason, can force many writes into txg+2, thus triggering throttling of the NFS server threads. Nico --
On 6/22/06, Bill Moore <Bill.Moore at sun.com> wrote:> Hey Joe. We''re working on some ZFS changes in this area, and if you > could run an experiment for us, that would be great. Just do this: > > echo ''zil_disable/W1'' | mdb -kw > > We''re working on some fixes to the ZIL so it won''t be a bottleneck when > fsyncs come around. The above command will let us know what kind of > improvement is on the table. After our fixes you could get from 30-80% > of that improvement, but this would be a good data point. This change > makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a > txg every 5 seconds. So at most, your disk will be 5 seconds out of > date compared to what it should be. It''s a pretty small window, but it > all depends on your appetite for such windows. :) > > After running the above command, you''ll need to unmount/mount the > filesystem in order for the change to take effect. > > If you don''t have time, no big deal. > > > --Bill > > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: > > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com> wrote: > > >> a test against the same iscsi targets using linux and XFS and the > > >> NFS server implementation there gave me 1.25MB/sec writes. I was about > > >> to throw in the towel and deem ZFS/NFS has unusable until B41 came > > >> along and at least gave me 1.25MB/sec. > > > > > >That''s still super slow -- is this over a 10Mb link or something? > > > > > >Jeff > > > > > > > > > > Nope, gig-e link (single e1000g, or aggregate, doesn''t matter) to the > > iscsi target, and single gig-e link (nge) to the NFS clients, who are > > gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. > > > > Again, the issue is the multiple fsyncs that NFS requires, and likely > > the serialization of those iscsi requests. Apparently, there is a > > basic latency in iscsi that one could improve upon with FC, but we are > > definitely in the all ethernet/iscsi camp for multi-building storage > > pool growth and don''t have interest in a FC-based SAN. > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Well, following Bill''s advice and the previous note on disabling zil, I ran my test on a B38 opteron initiator and if you do a time on the copy from the client, 6250 8k files transfer at 6MB/sec now. If you watch the entire commit on the backend using "zpool iostat 1" I see that it takes a few more seconds, and the actual rate there is 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41.
Joe Little writes: > On 6/22/06, Bill Moore <Bill.Moore at sun.com> wrote: > > Hey Joe. We''re working on some ZFS changes in this area, and if you > > could run an experiment for us, that would be great. Just do this: > > > > echo ''zil_disable/W1'' | mdb -kw > > > > We''re working on some fixes to the ZIL so it won''t be a bottleneck when > > fsyncs come around. The above command will let us know what kind of > > improvement is on the table. After our fixes you could get from 30-80% > > of that improvement, but this would be a good data point. This change > > makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a > > txg every 5 seconds. So at most, your disk will be 5 seconds out of > > date compared to what it should be. It''s a pretty small window, but it > > all depends on your appetite for such windows. :) > > > > After running the above command, you''ll need to unmount/mount the > > filesystem in order for the change to take effect. > > > > If you don''t have time, no big deal. > > > > > > --Bill > > > > > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: > > > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com> wrote: > > > >> a test against the same iscsi targets using linux and XFS and the > > > >> NFS server implementation there gave me 1.25MB/sec writes. I was about > > > >> to throw in the towel and deem ZFS/NFS has unusable until B41 came > > > >> along and at least gave me 1.25MB/sec. > > > > > > > >That''s still super slow -- is this over a 10Mb link or something? > > > > > > > >Jeff I think the performance is in line with expectation for, small file, single threaded, open/write/close NFS workload (nfs must commit on close). Therefore I expect : (avg file size) / (I/O latency). Joe does this formula approach the 1.25 MB/s ? > > > > > > > > > > > > > > Nope, gig-e link (single e1000g, or aggregate, doesn''t matter) to the > > > iscsi target, and single gig-e link (nge) to the NFS clients, who are > > > gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. > > > > > > Again, the issue is the multiple fsyncs that NFS requires, and likely > > > the serialization of those iscsi requests. Apparently, there is a > > > basic latency in iscsi that one could improve upon with FC, but we are > > > definitely in the all ethernet/iscsi camp for multi-building storage > > > pool growth and don''t have interest in a FC-based SAN. > > > _______________________________________________ > > > zfs-discuss mailing list > > > zfs-discuss at opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > Well, following Bill''s advice and the previous note on disabling zil, > I ran my test on a B38 opteron initiator and if you do a time on the > copy from the client, 6250 8k files transfer at 6MB/sec now. If you > watch the entire commit on the backend using "zpool iostat 1" I see > that it takes a few more seconds, and the actual rate there is > 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Joe, you know this but for the benefit of others, I have to highlight that running any NFS server this way, may cause silent data corruption from client''s point of view. Whenever a server keeps data in RAM this way and does not commit it to stable storage upon request from clients, that opens a time window for corruption. So a client writes to a page, then reads the same page, and if the server suffered a crash in between, the data may not match. So this is performance at the expense of data integrity. -r
On 6/23/06, Roch <Roch.Bourbonnais at sun.com> wrote:> > Joe Little writes: > > On 6/22/06, Bill Moore <Bill.Moore at sun.com> wrote: > > > Hey Joe. We''re working on some ZFS changes in this area, and if you > > > could run an experiment for us, that would be great. Just do this: > > > > > > echo ''zil_disable/W1'' | mdb -kw > > > > > > We''re working on some fixes to the ZIL so it won''t be a bottleneck when > > > fsyncs come around. The above command will let us know what kind of > > > improvement is on the table. After our fixes you could get from 30-80% > > > of that improvement, but this would be a good data point. This change > > > makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a > > > txg every 5 seconds. So at most, your disk will be 5 seconds out of > > > date compared to what it should be. It''s a pretty small window, but it > > > all depends on your appetite for such windows. :) > > > > > > After running the above command, you''ll need to unmount/mount the > > > filesystem in order for the change to take effect. > > > > > > If you don''t have time, no big deal. > > > > > > > > > --Bill > > > > > > > > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: > > > > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com> wrote: > > > > >> a test against the same iscsi targets using linux and XFS and the > > > > >> NFS server implementation there gave me 1.25MB/sec writes. I was about > > > > >> to throw in the towel and deem ZFS/NFS has unusable until B41 came > > > > >> along and at least gave me 1.25MB/sec. > > > > > > > > > >That''s still super slow -- is this over a 10Mb link or something? > > > > > > > > > >Jeff > > I think the performance is in line with expectation for, > small file, single threaded, open/write/close NFS > workload (nfs must commit on close). Therefore I expect : > > (avg file size) / (I/O latency). > > Joe does this formula approach the 1.25 MB/s ? >To this day, I still don''t know how to calculate the i/o latency. Average file size is always expected to be close to kernel page size for NASes -- 4-8k. Always tune for that.> > > > > > > > > > > > > > > > > > > Nope, gig-e link (single e1000g, or aggregate, doesn''t matter) to the > > > > iscsi target, and single gig-e link (nge) to the NFS clients, who are > > > > gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. > > > > > > > > Again, the issue is the multiple fsyncs that NFS requires, and likely > > > > the serialization of those iscsi requests. Apparently, there is a > > > > basic latency in iscsi that one could improve upon with FC, but we are > > > > definitely in the all ethernet/iscsi camp for multi-building storage > > > > pool growth and don''t have interest in a FC-based SAN. > > > > _______________________________________________ > > > > zfs-discuss mailing list > > > > zfs-discuss at opensolaris.org > > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > > Well, following Bill''s advice and the previous note on disabling zil, > > I ran my test on a B38 opteron initiator and if you do a time on the > > copy from the client, 6250 8k files transfer at 6MB/sec now. If you > > watch the entire commit on the backend using "zpool iostat 1" I see > > that it takes a few more seconds, and the actual rate there is > > 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41. > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > Joe, you know this but for the benefit of others, I have to > highlight that running any NFS server this way, may cause > silent data corruption from client''s point of view. > > Whenever a server keeps data in RAM this way and does not > commit it to stable storage upon request from clients, that > opens a time window for corruption. So a client writes to a > page, then reads the same page, and if the server suffered a > crash in between, the data may not match. > > So this is performance at the expense of data integrity. > > -rYes.. ZFS in its normal mode has better data integrity. However, this may be a more ideal tradeoff if you have specific read/write patterns. In my case, I''m going to use ZFS initially for my tier2 storage, with nightly write periods (needs to be short duration rsync from tier1) and mostly read periods throughout the rest of the day. I''d love to use ZFS as a tier1 service as well, but then you''d have to perform as a NetApp does. Same tricks, same NVRAM or initial write to local stable storage before writing to backend storage. 6MB/sec is closer to expected behavior for first tier at the expense of reliability. I don''t know what the answer is for Sun to make ZFS 1st Tier quality with their NFS implementation and its sync happiness.> >
On 6/23/06, Roch <Roch.Bourbonnais at sun.com> wrote:> > Joe Little writes: > > On 6/22/06, Bill Moore <Bill.Moore at sun.com> wrote: > > > Hey Joe. We''re working on some ZFS changes in this area, and if you > > > could run an experiment for us, that would be great. Just do this: > > > > > > echo ''zil_disable/W1'' | mdb -kw > > > > > > We''re working on some fixes to the ZIL so it won''t be a bottleneck when > > > fsyncs come around. The above command will let us know what kind of > > > improvement is on the table. After our fixes you could get from 30-80% > > > of that improvement, but this would be a good data point. This change > > > makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a > > > txg every 5 seconds. So at most, your disk will be 5 seconds out of > > > date compared to what it should be. It''s a pretty small window, but it > > > all depends on your appetite for such windows. :) > > > > > > After running the above command, you''ll need to unmount/mount the > > > filesystem in order for the change to take effect. > > > > > > If you don''t have time, no big deal. > > > > > > > > > --Bill > > > > > > > > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: > > > > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com> wrote: > > > > >> a test against the same iscsi targets using linux and XFS and the > > > > >> NFS server implementation there gave me 1.25MB/sec writes. I was about > > > > >> to throw in the towel and deem ZFS/NFS has unusable until B41 came > > > > >> along and at least gave me 1.25MB/sec. > > > > > > > > > >That''s still super slow -- is this over a 10Mb link or something? > > > > > > > > > >Jeff > > I think the performance is in line with expectation for, > small file, single threaded, open/write/close NFS > workload (nfs must commit on close). Therefore I expect : > > (avg file size) / (I/O latency). > > Joe does this formula approach the 1.25 MB/s ?Ok. I was given a dtrace script, and it would _appear_ that my average latency for the iscsi devices (regardless of zil on/off) is 387 ms. Using the above calculation with my artificial test of 8k files, that''s still only around 20K/sec, which is what I saw in the RAIDZ case. With zil disabled, its shoving 256k chunks down the pipe, or 677K/sec by your calculation. I''m seeing about 10x that in performance, so latency may not be so constant, or most likely affected below the device layer by more aggressive ordering of data when I can write larger chunks and spend less time on commiting smaller file sizes.> > > > > > > > > > > > > > > > > > > > Nope, gig-e link (single e1000g, or aggregate, doesn''t matter) to the > > > > iscsi target, and single gig-e link (nge) to the NFS clients, who are > > > > gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. > > > > > > > > Again, the issue is the multiple fsyncs that NFS requires, and likely > > > > the serialization of those iscsi requests. Apparently, there is a > > > > basic latency in iscsi that one could improve upon with FC, but we are > > > > definitely in the all ethernet/iscsi camp for multi-building storage > > > > pool growth and don''t have interest in a FC-based SAN. > > > > _______________________________________________ > > > > zfs-discuss mailing list > > > > zfs-discuss at opensolaris.org > > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > > Well, following Bill''s advice and the previous note on disabling zil, > > I ran my test on a B38 opteron initiator and if you do a time on the > > copy from the client, 6250 8k files transfer at 6MB/sec now. If you > > watch the entire commit on the backend using "zpool iostat 1" I see > > that it takes a few more seconds, and the actual rate there is > > 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41. > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > Joe, you know this but for the benefit of others, I have to > highlight that running any NFS server this way, may cause > silent data corruption from client''s point of view. > > Whenever a server keeps data in RAM this way and does not > commit it to stable storage upon request from clients, that > opens a time window for corruption. So a client writes to a > page, then reads the same page, and if the server suffered a > crash in between, the data may not match. > > So this is performance at the expense of data integrity. > > -r > >
Richard Elling
2006-Jun-23 16:39 UTC
[zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Joe Little wrote:> On 6/23/06, Roch <Roch.Bourbonnais at sun.com> wrote: >> Joe, you know this but for the benefit of others, I have to >> highlight that running any NFS server this way, may cause >> silent data corruption from client''s point of view. >> >> Whenever a server keeps data in RAM this way and does not >> commit it to stable storage upon request from clients, that >> opens a time window for corruption. So a client writes to a >> page, then reads the same page, and if the server suffered a >> crash in between, the data may not match. >> >> So this is performance at the expense of data integrity.I agree, as a RAS guy this line of reasoning makes me nervous... I''ve never known anyone who regularly made this trade-off and didn''t get burned.> Yes.. ZFS in its normal mode has better data integrity. However, this > may be a more ideal tradeoff if you have specific read/write patterns.The only pattern this makes sense for is the write-only pattern. That pattern has near zero utility.> In my case, I''m going to use ZFS initially for my tier2 storage, with > nightly write periods (needs to be short duration rsync from tier1) > and mostly read periods throughout the rest of the day. I''d love to > use ZFS as a tier1 service as well, but then you''d have to perform as > a NetApp does. Same tricks, same NVRAM or initial write to local > stable storage before writing to backend storage. 6MB/sec is closer to > expected behavior for first tier at the expense of reliability. I > don''t know what the answer is for Sun to make ZFS 1st Tier quality > with their NFS implementation and its sync happiness.I know the answer will not compromise data integrity. -- richard
Tao Chen
2006-Jun-23 16:56 UTC
Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
I should copy this to the list. ---------- Forwarded message ---------- On 6/23/06, Joe Little <jmlittle at gmail.com> wrote:> I can post back to Roch what this latency is. I think the latency is a > constant regardless of the zil or not. all that I do by disabling the > zil is that I''m able to submit larger chunks at a time (faster) than > doing 1k or worse blocks 3 times per file (the NFS fsync penalty)Please send the script ( I attached a modified version ) along with the result. They need to see how it works to trust ( or dispute ) the result. Rule #1 in performance tuning is do not trust the report from an unproven tool :) I have some comment on the output below. This is for a bit longer (16 trees of 6250 8k files, again with zil> disabled): > > Generating report from biorpt.sh.rec ... > > === Top 5 I/O types ==> > DEVICE T BLKs COUNT > -------- - ---- -------- > sd2 W 256 3095 > sd1 W 256 2843 > sd1 W 2 201 > sd2 W 2 197 > sd1 W 32 185This part tells me majority of I/Os are 128KB writes on sd2 and sd1. === Top 5 worst I/O response time ==>> DEVICE T BLKs OFFSET TIMESTAMP TIME.ms > -------- - ---- ---------- ----------- ------- > sd2 W 175 529070671 85.933843 3559.55 > sd1 W 256 521097680 47.561918 3097.21 > sd1 W 256 521151968 54.944253 3090.42 > sd1 W 256 521152224 54.944207 3090.23 > sd1 W 64 521152480 54.944241 3090.21Longest response time are more than 3 seconds, ouch. === Top 5 Devices with largest number of I/Os ==>> DEVICE READ AVG.ms MB WRITE AVG.ms MB IOs SEEK > ------- ------- ------ ------ ------- ------ ------ ------- ---- > sd1 6 0.34 0 4948 387.88 413 4954 0% > sd2 6 0.25 0 4230 387.07 405 4236 0% > cmdk0 23 8.11 0 152 0.84 0 175 10%Average response time of > 300ms is bad. I calculate "SEEK rate" on 512-byte block basis, since I/Os are mostly 128K, the seek rate is less than 1% ( 0 ), in other words I consider this as mostly sequential I/O. I guess it''s debatable whether 512-byte-based calculation is meaningful. === Top 5 Devices with largest amount of data transfer ==>> DEVICE READ AVG.ms MB WRITE AVG.ms MB Tol.MB MB/s > ------- ------- ------ ------ ------- ------ ------ ------- ---- > sd1 6 0.34 0 4948 387.88 413 413 4 > sd2 6 0.25 0 4230 387.07 405 405 4 > cmdk0 23 8.11 0 152 0.84 0 0 0 > > ===> Report saved in biorpt.sh.rec.rpt <==I calculate the MB/s on per-second basis, meaning as long as there''s at least one finished I/O on the device in a second, that second is used in calculating throughput. Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060623/80a6c275/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: biorpt.sh Type: application/x-sh Size: 10638 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060623/80a6c275/attachment.sh>
Richard Elling
2006-Jun-23 17:10 UTC
Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
comment on analysis below... Tao Chen wrote:> I should copy this to the list. > > ---------- Forwarded message ---------- > > On 6/23/06, * Joe Little* <jmlittle at gmail.com > <mailto:jmlittle at gmail.com>> wrote: > > I can post back to Roch what this latency is. I think the latency is a > constant regardless of the zil or not. all that I do by disabling the > zil is that I''m able to submit larger chunks at a time (faster) than > doing 1k or worse blocks 3 times per file (the NFS fsync penalty) > > > > Please send the script ( I attached a modified version ) along with the > result. > They need to see how it works to trust ( or dispute ) the result. > Rule #1 in performance tuning is do not trust the report from an > unproven tool :) > > I have some comment on the output below. > > This is for a bit longer (16 trees of 6250 8k files, again with zil > disabled): > > Generating report from biorpt.sh.rec ... > > === Top 5 I/O types ==> > DEVICE T BLKs COUNT > -------- - ---- -------- > sd2 W 256 3095 > sd1 W 256 2843 > sd1 W 2 201 > sd2 W 2 197 > sd1 W 32 185 > > > This part tells me majority of I/Os are 128KB writes on sd2 and sd1. > > === Top 5 worst I/O response time ==> > DEVICE T BLKs OFFSET TIMESTAMP TIME.ms > -------- - ---- ---------- ----------- ------- > sd2 W 175 529070671 85.933843 3559.55 > sd1 W 256 521097680 47.561918 3097.21 > sd1 W 256 521151968 54.944253 3090.42 > sd1 W 256 521152224 54.944207 3090.23 > sd1 W 64 521152480 54.944241 3090.21 > > > Longest response time are more than 3 seconds, ouch.Very suspicious. I would check for a retrans as 4 of these are suspiciously close to 3 seconds. You''ll need to check retrans on both ends of the wire.> === Top 5 Devices with largest number of I/Os ==> > DEVICE READ AVG.ms MB WRITE AVG.ms MB IOs SEEK > ------- ------- ------ ------ ------- ------ ------ ------- ---- > sd1 6 0.34 0 4948 387.88 413 4954 0% > sd2 6 0.25 0 4230 387.07 405 4236 0% > cmdk0 23 8.11 0 152 0.84 0 175 10% > > > Average response time of > 300ms is bad.Average is totally useless with this sort of a distribution. I''d suggest using a statistical package to explore the distribution. Just a few 3-second latencies will skew the average quite a lot. -- richard
Tao Chen
2006-Jun-23 17:29 UTC
Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/23/06, Richard Elling <Richard.Elling at sun.com> wrote:> > comment on analysis below... > > Tao Chen wrote: > > > === Top 5 Devices with largest number of I/Os ==> > > > DEVICE READ AVG.ms MB WRITE AVG.ms MB IOs > SEEK > > ------- ------- ------ ------ ------- ------ ------ ------- > ---- > > sd1 6 0.34 0 4948 387.88 413 4954 > 0% > > sd2 6 0.25 0 4230 387.07 405 4236 > 0% > > cmdk0 23 8.11 0 152 0.84 > 0 175 10% > > > > > > Average response time of > 300ms is bad. > > Average is totally useless with this sort of a distribution. > I''d suggest using a statistical package to explore the distribution. > Just a few 3-second latencies will skew the average quite a lot. > -- richard >A summary report is nothing more than an indication of issues, or non-issue. So I agree that an "average" is just, an average. However, "a few 3-second latencies" will not spoil the result too much when there''re more than 4000 I/Os sampled. The script saves the "raw" data in a .rec file, so you can run whatever statistic tool you have against it. I am currently more worried about how accurate and useful the raw data is, which is generated from a DTrace command in it. The "raw" record is in this format: - Timestamp(sec.microsec) - DeviceName - W/R - BLK_NO (offset) - BLK_CNT (I/O size) - IO_Time (I/O elapsed time) Tao (msec.xx) Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060623/4fbf9c99/attachment.html>
On 6/23/06, Roch <Roch.Bourbonnais at sun.com> wrote:> > > > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: > > > > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com> wrote: > > > > >> a test against the same iscsi targets using linux and XFS and the > > > > >> NFS server implementation there gave me 1.25MB/sec writes. I was > about > > > > >> to throw in the towel and deem ZFS/NFS has unusable until B41 > came > > > > >> along and at least gave me 1.25MB/sec. > > > > > > > > > >That''s still super slow -- is this over a 10Mb link or something? > > > > > > > > > >Jeff > > I think the performance is in line with expectation for, > small file, single threaded, open/write/close NFS > workload (nfs must commit on close). Therefore I expect : > > (avg file size) / (I/O latency). > > Joe does this formula approach the 1.25 MB/s ?Joe sent me another set of DTrace output (biorpt.sh.rec.gz), running 105 seconds with zil_disable=1. I generate a graph using Grace ( rec.gif ). The interesting part for me: 1) How I/O response time (at bdev level) changes in a pattern. 2) Both iSCSI (sd2) and local (sd1) storage follow the same pattern and have almost identicle latency on average. 3) The latency is very high, either on average or at peaks. Although a low throughput is expected given large amount of small files, I don''t expect such high latency, and of course 1.25MB/s is too low, even after turn on zil_disable, I see 4MB/s in this data set. I/O size at bdev level are actually pretty decent: mostly (75%) 128KB. Here''s a summary: # biorpt -i biorpt.sh.rec Generating report from biorpt.sh.rec ... === Top 5 I/O types == DEVICE T BLKs COUNT -------- - ---- -------- sd1 W 256 3122 sd2 W 256 3118 sd1 W 2 164 sd2 W 2 151 sd2 W 3 123 === Top 5 worst I/O response time == DEVICE T BLKs OFFSET TIMESTAMP TIME.ms -------- - ---- ---------- ----------- ------- sd1 W 256 529562656 104.322170 3316.90 sd1 W 256 529563424 104.322185 3281.97 sd2 W 256 521152480 104.262081 3262.49 sd2 W 256 521152736 104.262102 3258.56 sd1 W 256 529562912 104.262091 3249.85 === Top 5 Devices with largest number of I/Os == DEVICE READ AVG.ms MB WRITE AVG.ms MB IOs SEEK ------- ------- ------ ------ ------- ------ ------ ------- ---- sd1 7 2.70 0 4169 440.62 409 4176 0% sd2 6 0.25 0 4131 444.79 407 4137 0% cmdk0 5 21.50 0 138 0.82 0 143 11% === Top 5 Devices with largest amount of data transfer == DEVICE READ AVG.ms MB WRITE AVG.ms MB Tol.MB MB/s ------- ------- ------ ------ ------- ------ ------ ------- ---- sd1 7 2.70 0 4169 440.62 409 409 4 sd2 6 0.25 0 4131 444.79 407 407 4 cmdk0 5 21.50 0 138 0.82 0 0 0 Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060624/a7384166/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: rec.gif Type: image/gif Size: 20686 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060624/a7384166/attachment.gif> -------------- next part -------------- A non-text attachment was scrubbed... Name: biorpt.sh.rec.gz Type: application/x-gzip Size: 96110 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060624/a7384166/attachment.bin>
To clarify what has just been stated. With zil disabled I got 4MB/sec. With zil enabled I get 1.25MB/sec On 6/23/06, Tao Chen <nameistao at gmail.com> wrote:> > > On 6/23/06, Roch <Roch.Bourbonnais at sun.com> wrote: > > > > > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: > > > > > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com> wrote: > > > > > >> a test against the same iscsi targets using linux and XFS and the > > > > > >> NFS server implementation there gave me 1.25MB/sec writes. I was > about > > > > > >> to throw in the towel and deem ZFS/NFS has unusable until B41 > came > > > > > >> along and at least gave me 1.25MB/sec. > > > > > > > > > > > >That''s still super slow -- is this over a 10Mb link or something? > > > > > > > > > > > >Jeff > > > > I think the performance is in line with expectation for, > > small file, single threaded, open/write/close > NFS > > workload (nfs must commit on close). Therefore I expect : > > > > (avg file size) / (I/O latency). > > > > Joe does this formula approach the 1.25 MB/s ? > > > Joe sent me another set of DTrace output (biorpt.sh.rec.gz), > running 105 seconds with zil_disable=1. > > I generate a graph using Grace ( rec.gif ). > > The interesting part for me: > 1) How I/O response time (at bdev level) changes in a pattern. > 2) Both iSCSI (sd2) and local (sd1) storage follow the same pattern and have > almost identicle latency on average. > 3) The latency is very high, either on average or at peaks. > > Although a low throughput is expected given large amount of small files, I > don''t expect such high latency, > and of course 1.25MB/s is too low, even after turn on zil_disable, I see > 4MB/s in this data set. > I/O size at bdev level are actually pretty decent: mostly (75%) 128KB. > > Here''s a summary: > > # biorpt -i biorpt.sh.rec > > Generating report from biorpt.sh.rec ... > > === Top 5 I/O types ==> > DEVICE T BLKs COUNT > -------- - ---- -------- > sd1 W 256 3122 > sd2 W 256 3118 > sd1 W 2 164 > sd2 W 2 151 > sd2 W 3 123 > > > > === Top 5 worst I/O response time ==> > DEVICE T BLKs OFFSET TIMESTAMP TIME.ms > -------- - ---- ---------- ----------- ------- > sd1 W 256 529562656 104.322170 3316.90 > sd1 W 256 529563424 104.322185 3281.97 > sd2 W 256 521152480 104.262081 3262.49 > sd2 W 256 521152736 104.262102 3258.56 > sd1 W 256 529562912 104.262091 3249.85 > > > > === Top 5 Devices with largest number of I/Os ==> > DEVICE READ AVG.ms MB WRITE AVG.ms MB IOs SEEK > ------- ------- ------ ------ ------- ------ ------ ------- ---- > sd1 7 2.70 0 4169 440.62 409 4176 0% > sd2 6 0.25 0 4131 444.79 407 4137 0% > cmdk0 5 21.50 0 138 0.82 0 143 11% > > > === Top 5 Devices with largest amount of data transfer ==> > > DEVICE READ AVG.ms MB WRITE AVG.ms MB Tol.MB MB/s > ------- ------- ------ ------ ------- ------ ------ ------- ---- > sd1 7 2.70 0 4169 440.62 409 409 4 > sd2 6 0.25 0 4131 444.79 407 407 4 > cmdk0 5 21.50 0 138 0.82 0 0 0 > > Tao >
So if you have a single thread doing open/write/close of 8K files and get 1.25MB/sec, that tells me you have something like a 6ms I/O latency. Which look reasonable also. What does iostat -x svc_t (client side) says ? 400ms seems high for the workload _and_ doesn''t match my formula, so I don''t like it ;-) Quick look at your script looks fine tough; but something just does not compute here. Why this formula (which applies to any NFS single threaded client app working on small files). Even if the open and write parts are infinitely fast, on close(2), NFS must insure that data is set to disk. So at a minimum every close(2) must wait 1 I/O latency. During that wait the single thread client application will not initiate the following open/write/close segment. At best you get one file output per I/O latency. The I/O latency is the one seen by the client and includes network part but that should be small compared to the physical I/O. -r