thr3ads.net - zfs discuss - [zfs-discuss] 15 minute fdsync problem and ZFS: Solved [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Martin, Marcia R

2006-Jun-21 20:14 UTC

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Did I miss something on this thread?  Was the root cause of the
15-minute fsync every actually determined? 

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of eric kustarz
Sent: Wednesday, June 21, 2006 2:12 PM
To: Neil.Perrin at sun.com
Cc: zfs-discuss at opensolaris.org; Torrey McMahon
Subject: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Neil Perrin wrote:
>
>
> Robert Milkowski wrote On 06/21/06 11:09,:
>
>> Hello Neil,
>>
>>>> Why is this option available then? (Yes, that''s a
loaded question.)
>>>
>>
>> NP> I wouldn''t call it an option, but an internal debugging
switch
>> that I
>> NP> originally added to allow progress when initially integrating
the
>> ZIL.
>> NP> As Roch says it really shouldn''t be ever set (as it
does negate
>> POSIX
>> NP> synchronous semantics). Nor should it be mentioned to a
customer.
>> NP> In fact I''m inclined to now remove it - however it does
still
>> have a use
>> NP> as it helped root cause this problem.
>>
>> Isn''t it similar to unsupported fastfs for ufs?
>
>
> It is similar in the sense that it speeds up the file system.
> Using fastfs can be much more dangerous though as it can lead to a 
> badly corrupted file system as writing meta data is delayed and 
> written out of order. Whereas disabling the ZIL does not affect the 
> integrity of the fs. The transaction group model of ZFS gives 
> consistency in the event of a crash/power fail. However, any data that
> was promised to be on stable storage may not be unless the transaction
> group committed (an operation that is started every 5s).
>
> We once had plans to add a mount option to allow the admin to control 
> the ZIL. Here''s a brief section of the RFE (6280630):
>
>         sync={deferred,standard,forced}
>
>                 Controls synchronous semantics for the dataset.
>
>                 When set to ''standard'' (the default),
synchronous
> operations
>                 such as fsync(3C) behave precisely as defined in
>                 fcntl.h(3HEAD).
>
>                 When set to ''deferred'', requests for
synchronous
> semantics
>                 are ignored.  However, ZFS still guarantees that
ordering>                 is preserved -- that is, consecutive operations reach 
> stable
>                 storage in order.  (If a thread performs operation A 
> followed
>                 by operation B, then the moment that B reaches stable 
> storage,
>                 A is guaranteed to be on stable storage as well.)  ZFS
> also
>                 guarantees that all operations will be scheduled for 
> write to
>                 stable storage within a few seconds, so that an 
> unexpected
>                 power loss only takes the last few seconds of change 
> with it.
>
>                 When set to ''forced'', all operations
become
synchronous.>                 No operation will return until all previous operations
>                 have been committed to stable storage.  This option 
> can be
>                 useful if an application is found to depend on 
> synchronous
>                 semantics without actually requesting them; otherwise,
it>                 will just make everything slow, and is not
recommended.>
> Of course we would need to stress the dangers of setting
''deferred''.
> What do you guys think?
>
> Neil.

Scares me, and it seems we should wait until people are demanding it and
we *have* to do it (if that time ever comes) - that is, we can''t
squeeze
any more performance gain out of the ''standard'' method.

If problems do occur because of ''deferred'' mode, once i
wrap-up zpool
history, we''ll have that they set this logged to disk.

eric

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Sean Meighan

2006-Jun-21 20:59 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Some more info. yesterday when i did a "zpool iostat 1" i got the
following

#       here is sample of zpool iostat command
# canary      39.1G  14.9G    509      0  24.0M      0
# canary      39.1G  14.9G    477     41  20.3M   169K
# canary      39.1G  14.9G    496      1  22.7M  10.9K
# canary      39.1G  14.9G    477      0  21.6M   1013
# canary      39.1G  14.9G    489      0  20.8M   1012
# canary      39.1G  14.9G    467      0  20.1M   1015
# canary      39.1G  14.9G    478      0  21.7M   1013
# canary      39.1G  14.9G    454      0  22.9M   1014
# canary      39.1G  14.9G    458     26  18.3M   529K
# canary      39.1G  14.9G    384     71  15.9M   837K
# canary      39.1G  14.9G    439     21  18.5M   110K
# canary      39.1G  14.9G    510      2  21.7M  14.8K
# canary      39.1G  14.9G    466      0  20.9M  1.98K
# canary      39.1G  14.9G    492     39  20.2M   181K
# canary      39.1G  14.9G    489      7  20.5M  63.4K
# canary      39.1G  14.9G    500      0  21.1M  1.96K
# canary      39.1G  14.9G    499      1  22.5M  1.96K
# canary      39.1G  14.9G    523      0  23.8M   1012
# canary      39.1G  14.9G    519      0  21.9M   1013
# canary      39.1G  14.9G    480      0  20.0M    506
# canary      39.1G  14.9G    508      0  21.6M      0
# canary      39.1G  14.9G    488      0  20.5M      0

What seemed strange is the constant reads that were occurring. Today 
(after last nights reboot) we see this:


# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
canary      40.0G  14.0G     96    189  5.30M  1.60M
canary      40.0G  14.0G      0      0      0      0
canary      40.0G  14.0G      9     60   761K   828K
canary      40.0G  14.0G     41    124  2.91M  1.55M
canary      40.0G  14.0G      3      0   253K      0
canary      40.0G  14.0G      8      0   570K      0
canary      40.0G  14.0G     11     25   760K   358K
canary      40.0G  14.0G     29    275  2.10M  3.70M
canary      40.0G  14.0G     82    273  5.45M  3.47M
canary      40.0G  14.0G     43    296  2.78M  3.90M
canary      40.0G  14.0G      9    297   634K  3.64M
canary      40.0G  14.0G     12    324   887K  4.24M
canary      40.0G  14.0G     15    271  1.18M  3.56M
canary      40.0G  14.0G      5    295   444K  3.75M
canary      40.0G  14.0G     14    319  1014K  4.14M
canary      40.0G  14.0G     10    321   950K  3.34M
canary      40.0G  14.0G     21    258  1.55M  2.46M
canary      40.0G  14.0G    349     29  23.3M  62.4K
canary      40.0G  14.0G    253     44  16.6M   541K
canary      40.0G  14.0G     48    324  3.03M  3.21M
canary      40.0G  14.0G     11    372   887K  3.74M
canary      40.0G  14.0G     35    268  2.29M  2.64M
canary      40.0G  14.0G     12    327   824K  3.20M
canary      40.0G  14.0G     21    298  1.42M  2.85M
canary      40.0G  14.0G     60    276  3.96M  2.07M
canary      40.0G  14.0G    203    126  13.0M   673K
canary      40.0G  14.0G     39    370  2.48M  1.84M
canary      40.0G  14.0G      6    559   444K  2.58M
canary      40.0G  14.0G      0    653      0  2.97M
canary      40.0G  14.0G      0    582  63.4K  2.43M
canary      40.0G  14.0G      0    697      0  3.30M
canary      40.0G  14.0G      8    528   570K  2.22M
canary      40.0G  14.0G      4    524   317K  2.48M
canary      40.0G  14.0G      0    656  63.4K  3.64M
canary      40.0G  14.0G      0    661      0  2.50M
canary      40.0G  14.0G      0    658      0  2.99M
canary      40.0G  14.0G      0    630      0  2.68M
canary      40.0G  14.0G      0    647      0  2.83M
canary      40.0G  14.0G      0    622      0  2.85M
canary      40.0G  14.0G      0    717      0  3.41M

with 110,000 file transfers a day (each 50kb tar file is untarred and 
produces 8 files) i did not expect the read rate as seen yesterday. 
Could there have been something constantly reading the disk?



More info:

When the problem was occurring, i ran this dtrace script (per Jim Fiori)
# cat prof.d
#!/usr/sbin/dtrace -s
BEGIN
{
        total=0;
}
profile-1007
/arg0/
{

        @[stack(5)] = count();
        total++;
}


END
{
        printf("\nTOTAL %d\n",total);
        trunc (@,30);
        printa(@);
}

Here was the output.
:
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
               72

              unix`cpu_halt+0x10
              unix`idle+0x128
              unix`thread_start+0x4
               74

              unix`init_mondo+0x20
              unix`xt_some+0xbc
              unix`sfmmu_tlb_range_demap+0x170
              unix`hat_unload_callback+0x7f0
              unix`ppmapout+0x94
               74

              unix`idle+0xdc
              unix`thread_start+0x4
               76

              unix`shipit+0x28
              unix`send_mondo_set+0x5c
              unix`xt_some+0xc4
              unix`xt_sync+0x84
              unix`sfmmu_chgattr+0x380
               79

              zfs`metaslab_pick_block+0x94
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              zfs`zio_dva_allocate+0x64
               81

              zfs`lzjb_decompress+0x98
              zfs`zio_read_decompress+0x38
              genunix`taskq_thread+0x1a4
              unix`thread_start+0x4
               84

              zfs`mze_compare
              genunix`avl_find+0x2c
              zfs`mze_find+0x34
              zfs`zap_lookup+0x70
              zfs`zfs_dirent_lock+0x2bc
               87

              unix`disp_getwork+0xd8
              unix`idle+0xd4
              unix`thread_start+0x4
               89

              unix`disp_getwork+0x10c
              unix`idle+0xd4
              unix`thread_start+0x4
               95

              unix`mutex_vector_enter+0x130
              genunix`anon_dup_fill_holes+0x158
              genunix`segvn_dup+0x3ac
              genunix`as_dup+0xd0
              genunix`cfork+0x168
              100

              SUNW,UltraSPARC-T1`bcopy+0x2b4
              unix`ppcopy+0x34
              genunix`anon_map_privatepages+0x524
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              100

              zfs`lzjb_decompress+0x80
              zfs`zio_read_decompress+0x38
              genunix`taskq_thread+0x1a4
              unix`thread_start+0x4
              100

              unix`disp_getwork+0xb8
              unix`idle+0xd4
              unix`thread_start+0x4
              101

              unix`mutex_vector_enter+0x144
              genunix`anon_dup_fill_holes+0x158
              genunix`segvn_dup+0x3ac
              genunix`as_dup+0xd0
              genunix`cfork+0x168
              102

              zfs`metaslab_pick_block+0x9c
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              zfs`zio_dva_allocate+0x64
              103

              zfs`lzjb_decompress+0x78
              zfs`zio_read_decompress+0x38
              genunix`taskq_thread+0x1a4
              unix`thread_start+0x4
              106

              unix`shipit+0x28
              unix`send_mondo_set+0x5c
              unix`xt_some+0xc4
              unix`sfmmu_tlb_demap+0x100
              unix`sfmmu_hblk_chgattr+0x2f8
              112

              unix`disp_getwork+0x8
              unix`idle+0xd4
              unix`thread_start+0x4
              113

              genunix`avl_walk+0xc
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              116

              zfs`metaslab_pick_block+0x80
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              zfs`zio_write_allocate_gang_members+0xf0
              119

              genunix`avl_walk+0x90
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              120

              genunix`avl_walk+0x84
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              122

              genunix`avl_walk+0x4
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              124

              genunix`avl_walk+0xa4
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              124

              unix`idle+0xec
              unix`thread_start+0x4
              127

              unix`page_exists+0x7c
              unix`free_vp_pages+0x48
              unix`hat_unload_callback+0x854
              genunix`segvn_unmap+0x1b8
              genunix`as_free+0xf4
              131

              unix`idle+0x78
              unix`thread_start+0x4
              132

              SUNW,UltraSPARC-T1`bcopy+0x2bc
              unix`ppcopy+0x34
              genunix`anon_map_privatepages+0x524
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              138

              zfs`lzjb_decompress+0x88
              zfs`zio_read_decompress+0x38
              genunix`taskq_thread+0x1a4
              unix`thread_start+0x4
              138

              genunix`avl_walk+0x2c
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              139

              SUNW,UltraSPARC-T1`bcopy+0x2c8
              unix`ppcopy+0x34
              genunix`anon_map_privatepages+0x524
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              139

              SUNW,UltraSPARC-T1`bcopy+0x2b8
              unix`ppcopy+0x34
              genunix`anon_map_privatepages+0x524
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              143

              unix`mutex_vector_enter+0x130
              genunix`anon_map_privatepages+0xf8
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              genunix`segvn_fault+0x438
              150

              SUNW,UltraSPARC-T1`bcopy+0x2cc
              unix`ppcopy+0x34
              genunix`anon_map_privatepages+0x524
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              151

              unix`cpu_halt+0xd8
              unix`idle+0x128
              unix`thread_start+0x4
              161

              genunix`kmem_cache_kstat_update+0x1fc
              kstat`read_kstat_data+0x14c
              genunix`fop_ioctl+0x20
              genunix`ioctl+0x184
              unix`syscall_trap32+0xcc
              164

              unix`atomic_and_64+0x4
              unix`idle+0x128
              unix`thread_start+0x4
              174

              unix`mutex_vector_enter+0x144
              genunix`anon_map_privatepages+0xf8
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              genunix`segvn_fault+0x438
              184

              unix`idle+0x9c
              unix`thread_start+0x4
              187

              unix`mutex_vector_enter+0x134
              genunix`anon_dup_fill_holes+0x158
              genunix`segvn_dup+0x3ac
              genunix`as_dup+0xd0
              genunix`cfork+0x168
              191

              genunix`avl_walk+0xa0
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              191

              unix`mutex_vector_enter+0x148
              genunix`anon_dup_fill_holes+0x158
              genunix`segvn_dup+0x3ac
              genunix`as_dup+0xd0
              genunix`cfork+0x168
              199

              zfs`metaslab_pick_block+0xac
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              zfs`zio_dva_allocate+0x64
              206

              unix`disp_getwork+0xd0
              unix`idle+0xd4
              unix`thread_start+0x4
              210

              unix`disp_getwork+0xf0
              unix`idle+0xd4
              unix`thread_start+0x4
              220

              unix`idle+0x128
              unix`thread_start+0x4
              244

              unix`disp_getwork+0x9c
              unix`idle+0xd4
              unix`thread_start+0x4
              247

              unix`mutex_vector_enter+0x130
              genunix`anon_decref_pages+0x78
              genunix`anon_free_pages+0x80
              genunix`segvn_free+0xb0
              genunix`seg_free+0x2c
              260

              unix`disp_getwork+0x80
              unix`idle+0xd4
              unix`thread_start+0x4
              278

              unix`disp_getwork+0x78
              unix`idle+0xd4
              unix`thread_start+0x4
              281

              unix`mutex_vector_enter+0x144
              genunix`anon_decref_pages+0x78
              genunix`anon_free_pages+0x80
              genunix`segvn_free+0xb0
              genunix`seg_free+0x2c
              286

              unix`disp_getwork+0xe8
              unix`idle+0xd4
              unix`thread_start+0x4
              293

              unix`disp_getwork+0x94
              unix`idle+0xd4
              unix`thread_start+0x4
              295

              zfs`metaslab_pick_block+0x7c
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              zfs`zio_write_allocate_gang_members+0xf0
              336

              unix`atomic_and_64+0xc
              unix`idle+0x128
              unix`thread_start+0x4
              357

              unix`disp_getwork+0xa0
              unix`idle+0xd4
              unix`thread_start+0x4
              357

              unix`mutex_vector_enter+0x134
              genunix`anon_map_privatepages+0xf8
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              genunix`segvn_fault+0x438
              358

              unix`mutex_vector_enter+0x148
              genunix`anon_map_privatepages+0xf8
              genunix`anon_map_getpages+0x984
              genunix`segvn_fault_anonpages+0x310
              genunix`segvn_fault+0x438
              390

              unix`utl0+0x4c
              395

              genunix`avl_walk+0x24
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              458

              unix`disp_getwork+0x70
              unix`idle+0xd4
              unix`thread_start+0x4
              484

              unix`disp_getwork+0xf4
              unix`idle+0xd4
              unix`thread_start+0x4
              487

              unix`disp_getwork+0xb0
              unix`idle+0xd4
              unix`thread_start+0x4
              496

              unix`mutex_vector_enter+0x148
              genunix`anon_decref_pages+0x78
              genunix`anon_free_pages+0x80
              genunix`segvn_free+0xb0
              genunix`seg_free+0x2c
              555

              unix`disp_getwork+0x98
              unix`idle+0xd4
              unix`thread_start+0x4
              562

              unix`mutex_vector_enter+0x134
              genunix`anon_decref_pages+0x78
              genunix`anon_free_pages+0x80
              genunix`segvn_free+0xb0
              genunix`seg_free+0x2c
              567

              unix`disp_getwork+0x84
              unix`idle+0xd4
              unix`thread_start+0x4
              569

              zfs`metaslab_pick_block+0x80
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              zfs`zio_dva_allocate+0x64
              577

              unix`disp_getwork+0x90
              unix`idle+0xd4
              unix`thread_start+0x4
              600

              unix`disp_getwork+0x7c
              unix`idle+0xd4
              unix`thread_start+0x4
              618

              zfs`metaslab_pick_block+0x7c
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
              zfs`zio_dva_allocate+0x64
             1338

              unix`i_ddi_splx+0x1c
              unix`disp_getwork+0x160
              unix`idle+0xd4
              unix`thread_start+0x4
             1564

              genunix`avl_walk+0x64
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
             1629

              genunix`avl_walk+0x48
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
             1639

              genunix`avl_walk+0x60
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
             2458

              genunix`avl_walk+0x44
              zfs`metaslab_pick_block+0x98
              zfs`metaslab_getblock+0x3c
              zfs`metaslab_group_alloc+0xd0
              zfs`metaslab_alloc+0x50
             2479

              unix`shipit+0x28
              unix`send_mondo_set+0x5c
              unix`xt_some+0xc4
              unix`sfmmu_tlb_range_demap+0x170
              unix`hat_unload_callback+0x7f0
             2585

              unix`shipit+0x28
              unix`send_mondo_set+0x5c
              unix`xt_some+0xc4
              unix`xt_sync+0x84
              unix`hat_unload_callback+0x808
             2587

              unix`disp_getwork+0xec
              unix`idle+0xd4
              unix`thread_start+0x4
             2933

              unix`disp_getwork+0x8c
              unix`idle+0xd4
              unix`thread_start+0x4
             3122

              unix`disp_getwork+0x88
              unix`idle+0xd4
              unix`thread_start+0x4
             7751

              unix`cpu_halt+0xc8
              unix`idle+0x128
              unix`thread_start+0x4
           554827
#


lastly, the reboot last night (with the /etc/system change) has ended 
the high fdsynch issue. Since no one has said they know what caused the 
issue I am not confident that it will not come back. The only suggestion 
made in the last week to fix this was to change the /etc/system value. 
This is a production box that hosts all data from the world (750 Sun Ray 
servers, 180 campuses) we needed to get the box back to normal service 
levels.

can someone suggest a little program that we could run periodically to 
test fdsynch? we could set an alarm to send an email if it degrades again.


Sean




Roch wrote On 06/21/06 07:31,:
>   15 minutes to do a fdsync is way outside the slowdown usually seen.
>   The footprint for 6413510 is that when a huge amount of
>   data is being written non synchronously and a fsync comes in for the
>   same filesystem then all the non-synchronous data is also forced out
>   synchronously. So is there a lot of data being written during the vi?
>
> vi will write the whole file (in 4K) chunks and fsync it.
> (based on a single experiment).

Neil Perrin wrote
Sean kindly gave me access to the system and so far I have reproduced the
problem. It just requires a fsync on a file with 1 byte and for me takes
10 minutes to fsync! I have run a few dscripts but have yet to make much
more progress. I do see that zfs version is fairly old so we may be chasing
an old bug, or perhaps this really is an extreme version of 6413510,
as there is often 3MB of data being collected and written to the pool
and probably the same fs.

Neil

-- 
<http://www.sun.com> 	* Sean Meighan *
Mgr ITSM Engineering

*Sun Microsystems, Inc.*
US
Phone x32329 / +1 408 850-9537
Mobile 303-520-2024
Fax 408 850-9537
Email Sean.Meighan at Sun.COM
	

------------------------------------------------------------------------
NOTICE: This email message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. 
If you are not the intended recipient, please contact the sender by 
reply email and destroy all copies of the original message.
------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060621/83ff0819/attachment.html>

Roch

2006-Jun-22 07:52 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Martin, Marcia R writes:
 > Did I miss something on this thread?  Was the root cause of the
 > 15-minute fsync <> actually determined? 
 > 

I think so ;-)

-r

Jeff Bonwick

2006-Jun-22 21:53 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

> a test against the same iscsi targets using linux and XFS and the
> NFS server implementation there gave me 1.25MB/sec writes. I was about
> to throw in the towel and deem ZFS/NFS has unusable until B41 came
> along and at least gave me 1.25MB/sec.
That''s still super slow -- is this over a 10Mb link or something?

Jeff

Joe Little

2006-Jun-22 23:22 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com>
wrote:> > a test against the same iscsi targets using linux and XFS and the
> > NFS server implementation there gave me 1.25MB/sec writes. I was about
> > to throw in the towel and deem ZFS/NFS has unusable until B41 came
> > along and at least gave me 1.25MB/sec.
>
> That''s still super slow -- is this over a 10Mb link or something?
>
> Jeff
>
>
Nope, gig-e link (single e1000g, or aggregate, doesn''t matter) to the
iscsi target, and single gig-e link (nge) to the NFS clients, who are
gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference.

Again, the issue is the multiple fsyncs that NFS requires, and likely
the serialization of those iscsi requests. Apparently, there is a
basic latency in iscsi that one could improve upon with FC, but we are
definitely in the all ethernet/iscsi camp for multi-building storage
pool growth and don''t have interest in a FC-based SAN.

Nicolas Williams

2006-Jun-22 23:48 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little
wrote:> Again, the issue is the multiple fsyncs that NFS requires, and likely
> the serialization of those iscsi requests. Apparently, there is a
> basic latency in iscsi that one could improve upon with FC, but we are
> definitely in the all ethernet/iscsi camp for multi-building storage
> pool growth and don''t have interest in a FC-based SAN.
There may be two things going on here.  One is the fsync()s.  So many
fsync()s, required by the semantics of NFS serialize I/O too much
(which, with a client doing single-threaded things, kills), and, looking
at Roch''s Dynamics of ZFS blog entry, perhaps there''s some
aliasing of
fsync() pressure as memory pressure/disk I/O saturation.  From Roch''s
blog entry:

    CAVEAT: The current state of the ZIL is such that if there is a lot
    of pending data in a Filesystem (written to the FS, not yet output
    to disk) and a process issues an fsync() for one of it''s files,
then
    all pending operations will have to be sent to disk before the
    synchronous command can complete. This can lead to unexpected
    performance characteristics.  Code is under review.

Since NFS is doing so many fsyncs this, I reason, can force many writes
into txg+2, thus triggering throttling of the NFS server threads.

Nico
--

Joe Little

2006-Jun-23 03:48 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On 6/22/06, Bill Moore <Bill.Moore at sun.com>
wrote:> Hey Joe.  We''re working on some ZFS changes in this area, and if
you
> could run an experiment for us, that would be great.  Just do this:
>
>     echo ''zil_disable/W1'' | mdb -kw
>
> We''re working on some fixes to the ZIL so it won''t be a
bottleneck when
> fsyncs come around.  The above command will let us know what kind of
> improvement is on the table.  After our fixes you could get from 30-80%
> of that improvement, but this would be a good data point.  This change
> makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a
> txg every 5 seconds.  So at most, your disk will be 5 seconds out of
> date compared to what it should be.  It''s a pretty small window,
but it
> all depends on your appetite for such windows.  :)
>
> After running the above command, you''ll need to unmount/mount the
> filesystem in order for the change to take effect.
>
> If you don''t have time, no big deal.
>
>
> --Bill
>
>
> On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote:
> > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com> wrote:
> > >> a test against the same iscsi targets using linux and XFS and
the
> > >> NFS server implementation there gave me 1.25MB/sec writes. I
was about
> > >> to throw in the towel and deem ZFS/NFS has unusable until B41
came
> > >> along and at least gave me 1.25MB/sec.
> > >
> > >That''s still super slow -- is this over a 10Mb link or
something?
> > >
> > >Jeff
> > >
> > >
> >
> > Nope, gig-e link (single e1000g, or aggregate, doesn''t
matter) to the
> > iscsi target, and single gig-e link (nge) to the NFS clients, who are
> > gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference.
> >
> > Again, the issue is the multiple fsyncs that NFS requires, and likely
> > the serialization of those iscsi requests. Apparently, there is a
> > basic latency in iscsi that one could improve upon with FC, but we are
> > definitely in the all ethernet/iscsi camp for multi-building storage
> > pool growth and don''t have interest in a FC-based SAN.
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Well, following Bill''s advice and the previous note on disabling zil,
I ran my test on a B38 opteron initiator and if you do a time on the
copy from the client, 6250 8k files transfer at 6MB/sec now. If you
watch the entire commit on the backend using "zpool iostat 1" I see
that it takes a few more seconds, and the actual rate there is
4MB/sec. Beats my best of 1.25MB/sec, and this is not B41.

Roch

2006-Jun-23 09:21 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Joe Little writes:
 > On 6/22/06, Bill Moore <Bill.Moore at sun.com> wrote:
 > > Hey Joe.  We''re working on some ZFS changes in this area,
and if you
 > > could run an experiment for us, that would be great.  Just do this:
 > >
 > >     echo ''zil_disable/W1'' | mdb -kw
 > >
 > > We''re working on some fixes to the ZIL so it won''t
be a bottleneck when
 > > fsyncs come around.  The above command will let us know what kind of
 > > improvement is on the table.  After our fixes you could get from
30-80%
 > > of that improvement, but this would be a good data point.  This
change
 > > makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out
a
 > > txg every 5 seconds.  So at most, your disk will be 5 seconds out of
 > > date compared to what it should be.  It''s a pretty small
window, but it
 > > all depends on your appetite for such windows.  :)
 > >
 > > After running the above command, you''ll need to
unmount/mount the
 > > filesystem in order for the change to take effect.
 > >
 > > If you don''t have time, no big deal.
 > >
 > >
 > > --Bill
 > >
 > >
 > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote:
 > > > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com>
wrote:
 > > > >> a test against the same iscsi targets using linux and
XFS and the
 > > > >> NFS server implementation there gave me 1.25MB/sec
writes. I was about
 > > > >> to throw in the towel and deem ZFS/NFS has unusable
until B41 came
 > > > >> along and at least gave me 1.25MB/sec.
 > > > >
 > > > >That''s still super slow -- is this over a 10Mb link
or something?
 > > > >
 > > > >Jeff

I  think the performance is   in line with expectation  for,
small  file,    single  threaded,     open/write/close   NFS
workload (nfs must commit on close). Therefore I expect :

	(avg file size) / (I/O latency).

Joe does this formula approach the 1.25 MB/s ?


 > > > >
 > > > >
 > > >
 > > > Nope, gig-e link (single e1000g, or aggregate, doesn''t
matter) to the
 > > > iscsi target, and single gig-e link (nge) to the NFS clients,
who are
 > > > gig-e. Sun Ultra20 or AMD Quad Opteron, again with no
difference.
 > > >
 > > > Again, the issue is the multiple fsyncs that NFS requires, and
likely
 > > > the serialization of those iscsi requests. Apparently, there is
a
 > > > basic latency in iscsi that one could improve upon with FC, but
we are
 > > > definitely in the all ethernet/iscsi camp for multi-building
storage
 > > > pool growth and don''t have interest in a FC-based SAN.
 > > > _______________________________________________
 > > > zfs-discuss mailing list
 > > > zfs-discuss at opensolaris.org
 > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 > >
 > 
 > Well, following Bill''s advice and the previous note on disabling
zil,
 > I ran my test on a B38 opteron initiator and if you do a time on the
 > copy from the client, 6250 8k files transfer at 6MB/sec now. If you
 > watch the entire commit on the backend using "zpool iostat 1" I
see
 > that it takes a few more seconds, and the actual rate there is
 > 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41.
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Joe, you know this but for the benefit of  others, I have to
highlight that running  any NFS server  this way, may cause
silent data corruption from client''s point of view.

Whenever a server keeps  data in RAM this  way and  does not
commit it to stable storage  upon request from clients, that
opens a time window for corruption. So  a client writes to a
page, then reads the same page, and if the server suffered a
crash in between, the data may not match.

So this is performance at the expense of data integrity.

-r

Joe Little

2006-Jun-23 14:25 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On 6/23/06, Roch <Roch.Bourbonnais at sun.com>
wrote:>
> Joe Little writes:
>  > On 6/22/06, Bill Moore <Bill.Moore at sun.com> wrote:
>  > > Hey Joe.  We''re working on some ZFS changes in this
area, and if you
>  > > could run an experiment for us, that would be great.  Just do
this:
>  > >
>  > >     echo ''zil_disable/W1'' | mdb -kw
>  > >
>  > > We''re working on some fixes to the ZIL so it
won''t be a bottleneck when
>  > > fsyncs come around.  The above command will let us know what
kind of
>  > > improvement is on the table.  After our fixes you could get from
30-80%
>  > > of that improvement, but this would be a good data point.  This
change
>  > > makes ZFS ignore the iSCSI/NFS fsync requests, but we still push
out a
>  > > txg every 5 seconds.  So at most, your disk will be 5 seconds
out of
>  > > date compared to what it should be.  It''s a pretty
small window, but it
>  > > all depends on your appetite for such windows.  :)
>  > >
>  > > After running the above command, you''ll need to
unmount/mount the
>  > > filesystem in order for the change to take effect.
>  > >
>  > > If you don''t have time, no big deal.
>  > >
>  > >
>  > > --Bill
>  > >
>  > >
>  > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote:
>  > > > On 6/22/06, Jeff Bonwick <bonwick at
zion.eng.sun.com> wrote:
>  > > > >> a test against the same iscsi targets using linux
and XFS and the
>  > > > >> NFS server implementation there gave me 1.25MB/sec
writes. I was about
>  > > > >> to throw in the towel and deem ZFS/NFS has
unusable until B41 came
>  > > > >> along and at least gave me 1.25MB/sec.
>  > > > >
>  > > > >That''s still super slow -- is this over a 10Mb
link or something?
>  > > > >
>  > > > >Jeff
>
> I  think the performance is   in line with expectation  for,
> small  file,    single  threaded,     open/write/close   NFS
> workload (nfs must commit on close). Therefore I expect :
>
>         (avg file size) / (I/O latency).
>
> Joe does this formula approach the 1.25 MB/s ?
>
To this day, I still don''t know how to calculate the i/o latency.
Average file size is always expected to be close to kernel page size
for NASes -- 4-8k. Always tune for that.

>
>  > > > >
>  > > > >
>  > > >
>  > > > Nope, gig-e link (single e1000g, or aggregate,
doesn''t matter) to the
>  > > > iscsi target, and single gig-e link (nge) to the NFS
clients, who are
>  > > > gig-e. Sun Ultra20 or AMD Quad Opteron, again with no
difference.
>  > > >
>  > > > Again, the issue is the multiple fsyncs that NFS requires,
and likely
>  > > > the serialization of those iscsi requests. Apparently,
there is a
>  > > > basic latency in iscsi that one could improve upon with FC,
but we are
>  > > > definitely in the all ethernet/iscsi camp for
multi-building storage
>  > > > pool growth and don''t have interest in a FC-based
SAN.
>  > > > _______________________________________________
>  > > > zfs-discuss mailing list
>  > > > zfs-discuss at opensolaris.org
>  > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>  > >
>  >
>  > Well, following Bill''s advice and the previous note on
disabling zil,
>  > I ran my test on a B38 opteron initiator and if you do a time on the
>  > copy from the client, 6250 8k files transfer at 6MB/sec now. If you
>  > watch the entire commit on the backend using "zpool iostat
1" I see
>  > that it takes a few more seconds, and the actual rate there is
>  > 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41.
>  > _______________________________________________
>  > zfs-discuss mailing list
>  > zfs-discuss at opensolaris.org
>  > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
> Joe, you know this but for the benefit of  others, I have to
> highlight that running  any NFS server  this way, may cause
> silent data corruption from client''s point of view.
>
> Whenever a server keeps  data in RAM this  way and  does not
> commit it to stable storage  upon request from clients, that
> opens a time window for corruption. So  a client writes to a
> page, then reads the same page, and if the server suffered a
> crash in between, the data may not match.
>
> So this is performance at the expense of data integrity.
>
> -r
Yes.. ZFS in its normal mode has better data integrity. However, this
may be a more ideal tradeoff if you have specific read/write patterns.
In my case, I''m going to use ZFS initially for my tier2 storage, with
nightly write periods (needs to be short duration rsync from tier1)
and mostly read periods throughout the rest of the day. I''d love to
use ZFS as a tier1 service as well, but then you''d have to perform as
a NetApp does. Same tricks, same NVRAM or initial write to local
stable storage before writing to backend storage. 6MB/sec is closer to
expected behavior for first tier at the expense of reliability. I
don''t know what the answer is for Sun to make ZFS 1st Tier quality
with their NFS implementation and its sync happiness.
>
>

Joe Little

2006-Jun-23 16:37 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On 6/23/06, Roch <Roch.Bourbonnais at sun.com>
wrote:>
> Joe Little writes:
>  > On 6/22/06, Bill Moore <Bill.Moore at sun.com> wrote:
>  > > Hey Joe.  We''re working on some ZFS changes in this
area, and if you
>  > > could run an experiment for us, that would be great.  Just do
this:
>  > >
>  > >     echo ''zil_disable/W1'' | mdb -kw
>  > >
>  > > We''re working on some fixes to the ZIL so it
won''t be a bottleneck when
>  > > fsyncs come around.  The above command will let us know what
kind of
>  > > improvement is on the table.  After our fixes you could get from
30-80%
>  > > of that improvement, but this would be a good data point.  This
change
>  > > makes ZFS ignore the iSCSI/NFS fsync requests, but we still push
out a
>  > > txg every 5 seconds.  So at most, your disk will be 5 seconds
out of
>  > > date compared to what it should be.  It''s a pretty
small window, but it
>  > > all depends on your appetite for such windows.  :)
>  > >
>  > > After running the above command, you''ll need to
unmount/mount the
>  > > filesystem in order for the change to take effect.
>  > >
>  > > If you don''t have time, no big deal.
>  > >
>  > >
>  > > --Bill
>  > >
>  > >
>  > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote:
>  > > > On 6/22/06, Jeff Bonwick <bonwick at
zion.eng.sun.com> wrote:
>  > > > >> a test against the same iscsi targets using linux
and XFS and the
>  > > > >> NFS server implementation there gave me 1.25MB/sec
writes. I was about
>  > > > >> to throw in the towel and deem ZFS/NFS has
unusable until B41 came
>  > > > >> along and at least gave me 1.25MB/sec.
>  > > > >
>  > > > >That''s still super slow -- is this over a 10Mb
link or something?
>  > > > >
>  > > > >Jeff
>
> I  think the performance is   in line with expectation  for,
> small  file,    single  threaded,     open/write/close   NFS
> workload (nfs must commit on close). Therefore I expect :
>
>         (avg file size) / (I/O latency).
>
> Joe does this formula approach the 1.25 MB/s ?
Ok. I was given a dtrace script, and it would _appear_ that my average
latency for the iscsi devices (regardless of zil on/off) is 387 ms.
Using the above calculation with my artificial test of 8k files,
that''s still only around 20K/sec, which is what I saw in the RAIDZ
case. With zil disabled, its shoving 256k chunks down the pipe, or
677K/sec by your calculation. I''m seeing about 10x that in
performance, so latency may not be so constant, or most likely
affected below the device layer by more aggressive ordering of data
when I can write larger chunks and spend less time on commiting
smaller file sizes.

>
>
>  > > > >
>  > > > >
>  > > >
>  > > > Nope, gig-e link (single e1000g, or aggregate,
doesn''t matter) to the
>  > > > iscsi target, and single gig-e link (nge) to the NFS
clients, who are
>  > > > gig-e. Sun Ultra20 or AMD Quad Opteron, again with no
difference.
>  > > >
>  > > > Again, the issue is the multiple fsyncs that NFS requires,
and likely
>  > > > the serialization of those iscsi requests. Apparently,
there is a
>  > > > basic latency in iscsi that one could improve upon with FC,
but we are
>  > > > definitely in the all ethernet/iscsi camp for
multi-building storage
>  > > > pool growth and don''t have interest in a FC-based
SAN.
>  > > > _______________________________________________
>  > > > zfs-discuss mailing list
>  > > > zfs-discuss at opensolaris.org
>  > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>  > >
>  >
>  > Well, following Bill''s advice and the previous note on
disabling zil,
>  > I ran my test on a B38 opteron initiator and if you do a time on the
>  > copy from the client, 6250 8k files transfer at 6MB/sec now. If you
>  > watch the entire commit on the backend using "zpool iostat
1" I see
>  > that it takes a few more seconds, and the actual rate there is
>  > 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41.
>  > _______________________________________________
>  > zfs-discuss mailing list
>  > zfs-discuss at opensolaris.org
>  > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
> Joe, you know this but for the benefit of  others, I have to
> highlight that running  any NFS server  this way, may cause
> silent data corruption from client''s point of view.
>
> Whenever a server keeps  data in RAM this  way and  does not
> commit it to stable storage  upon request from clients, that
> opens a time window for corruption. So  a client writes to a
> page, then reads the same page, and if the server suffered a
> crash in between, the data may not match.
>
> So this is performance at the expense of data integrity.
>
> -r
>
>

Richard Elling

2006-Jun-23 16:39 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Joe Little wrote:> On 6/23/06, Roch <Roch.Bourbonnais at sun.com> wrote:
>> Joe, you know this but for the benefit of  others, I have to
>> highlight that running  any NFS server  this way, may cause
>> silent data corruption from client''s point of view.
>>
>> Whenever a server keeps  data in RAM this  way and  does not
>> commit it to stable storage  upon request from clients, that
>> opens a time window for corruption. So  a client writes to a
>> page, then reads the same page, and if the server suffered a
>> crash in between, the data may not match.
>>
>> So this is performance at the expense of data integrity.
I agree, as a RAS guy this line of reasoning makes me nervous...
I''ve never known anyone who regularly made this trade-off and
didn''t get burned.
> Yes.. ZFS in its normal mode has better data integrity. However, this
> may be a more ideal tradeoff if you have specific read/write patterns.
The only pattern this makes sense for is the write-only pattern.
That pattern has near zero utility.
> In my case, I''m going to use ZFS initially for my tier2 storage,
with
> nightly write periods (needs to be short duration rsync from tier1)
> and mostly read periods throughout the rest of the day. I''d love
to
> use ZFS as a tier1 service as well, but then you''d have to perform
as
> a NetApp does. Same tricks, same NVRAM or initial write to local
> stable storage before writing to backend storage. 6MB/sec is closer to
> expected behavior for first tier at the expense of reliability. I
> don''t know what the answer is for Sun to make ZFS 1st Tier quality
> with their NFS implementation and its sync happiness.
I know the answer will not compromise data integrity.
  -- richard

Tao Chen

2006-Jun-23 16:56 UTC

head link

Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

I should copy this to the list.

---------- Forwarded message ----------

On 6/23/06, Joe Little <jmlittle at gmail.com> wrote:
> I can post back to Roch what this latency is. I think the latency is a
> constant regardless of the zil or not. all that I do by disabling the
> zil is that I''m able to submit larger chunks at a time (faster)
than
> doing 1k or worse blocks 3 times per file (the NFS fsync penalty)


Please send the script ( I attached a modified version ) along with the
result.
They need to see how it works to trust ( or dispute ) the result.
Rule #1 in performance tuning is do not trust the report from an unproven
tool :)

I have some comment on the output below.

This is for a bit longer (16 trees of 6250 8k files, again with
zil> disabled):
>
> Generating report from biorpt.sh.rec ...
>
>    === Top 5 I/O types ==>
>   DEVICE    T  BLKs     COUNT
>   --------  -  ----  --------
>   sd2       W   256      3095
>   sd1       W   256      2843
>   sd1       W     2       201
>   sd2       W     2       197
>   sd1       W    32       185

This part tells me majority of I/Os are 128KB writes on sd2 and sd1.

         === Top 5 worst I/O response time ==>>   DEVICE    T  BLKs      OFFSET    TIMESTAMP  TIME.ms
>   --------  -  ----  ----------  -----------  -------
>   sd2       W   175   529070671    85.933843  3559.55
>   sd1       W   256   521097680    47.561918  3097.21
>   sd1       W   256   521151968    54.944253  3090.42
>   sd1       W   256   521152224    54.944207  3090.23
>   sd1       W    64   521152480    54.944241   3090.21

Longest response time are more than 3 seconds, ouch.

          === Top 5 Devices with largest number of I/Os
==>>   DEVICE      READ AVG.ms     MB    WRITE AVG.ms     MB      IOs SEEK
>   -------  ------- ------ ------  ------- ------ ------  ------- ----
>   sd1            6   0.34      0     4948 387.88    413     4954   0%
>   sd2            6   0.25      0     4230 387.07    405     4236   0%
>   cmdk0         23   8.11      0      152   0.84      0      175  10%

Average response time of > 300ms is bad.
I calculate "SEEK rate" on 512-byte block basis, since I/Os are mostly
128K,
the seek rate is less than 1% ( 0 ), in other words I consider this as
mostly sequential I/O. I guess it''s debatable whether 512-byte-based
calculation is meaningful.

      === Top 5 Devices with largest amount of data transfer
==>>   DEVICE      READ AVG.ms     MB    WRITE AVG.ms     MB   Tol.MB MB/s
>   -------  ------- ------ ------  ------- ------ ------  ------- ----
>   sd1            6   0.34      0     4948 387.88    413      413    4
>   sd2            6   0.25      0     4230 387.07    405      405    4
>   cmdk0         23   8.11      0      152   0.84      0        0    0
>
>   ===> Report saved in biorpt.sh.rec.rpt <==
I calculate the MB/s on per-second basis, meaning as long as there''s at
least one finished I/O on the device in a second, that second is used in
calculating throughput.

Tao
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060623/80a6c275/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: biorpt.sh
Type: application/x-sh
Size: 10638 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060623/80a6c275/attachment.sh>

Richard Elling

2006-Jun-23 17:10 UTC

head link

Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

comment on analysis below...

Tao Chen wrote:> I should copy this to the list.
> 
> ---------- Forwarded message ----------
> 
> On 6/23/06, * Joe Little* <jmlittle at gmail.com 
> <mailto:jmlittle at gmail.com>> wrote:
> 
>     I can post back to Roch what this latency is. I think the latency is a
>     constant regardless of the zil or not. all that I do by disabling the
>     zil is that I''m able to submit larger chunks at a time
(faster) than
>     doing 1k or worse blocks 3 times per file (the NFS fsync penalty) 
> 
> 
> 
> Please send the script ( I attached a modified version ) along with the 
> result.
> They need to see how it works to trust ( or dispute ) the result.
> Rule #1 in performance tuning is do not trust the report from an 
> unproven tool :)
> 
> I have some comment on the output below.
> 
>     This is for a bit longer (16 trees of 6250 8k files, again with zil
>     disabled):
> 
>     Generating report from biorpt.sh.rec ...
> 
>        === Top 5 I/O types ==> 
>       DEVICE    T  BLKs     COUNT
>       --------  -  ----  --------
>       sd2       W   256      3095
>       sd1       W   256      2843
>       sd1       W     2       201
>       sd2       W     2       197
>       sd1       W    32       185
> 
> 
> This part tells me majority of I/Os are 128KB writes on sd2 and sd1.
> 
>              === Top 5 worst I/O response time ==> 
>       DEVICE    T  BLKs      OFFSET    TIMESTAMP  TIME.ms
>       --------  -  ----  ----------  -----------  -------
>       sd2       W   175   529070671    85.933843  3559.55
>       sd1       W   256   521097680    47.561918  3097.21
>       sd1       W   256   521151968    54.944253  3090.42
>       sd1       W   256   521152224    54.944207  3090.23
>       sd1       W    64   521152480    54.944241   3090.21
> 
> 
> Longest response time are more than 3 seconds, ouch.
Very suspicious.  I would check for a retrans as 4 of these
are suspiciously close to 3 seconds.  You''ll need to check retrans
on both ends of the wire.
>               === Top 5 Devices with largest number of I/Os ==> 
>       DEVICE      READ AVG.ms     MB    WRITE AVG.ms     MB      IOs SEEK
>       -------  ------- ------ ------  ------- ------ ------  ------- ----
>       sd1            6   0.34      0     4948 387.88    413     4954   0%
>       sd2            6   0.25      0     4230 387.07    405     4236   0%
>       cmdk0         23   8.11      0      152   0.84      0      175  10%
> 
> 
> Average response time of > 300ms is bad.
Average is totally useless with this sort of a distribution.
I''d suggest using a statistical package to explore the distribution.
Just a few 3-second latencies will skew the average quite a lot.
  -- richard

Tao Chen

2006-Jun-23 17:29 UTC

head link

Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On 6/23/06, Richard Elling <Richard.Elling at sun.com>
wrote:>
> comment on analysis below...
>
> Tao Chen wrote:
>
> >               === Top 5 Devices with largest number of I/Os ==>
>
> >       DEVICE      READ AVG.ms     MB    WRITE AVG.ms     MB      IOs
> SEEK
> >       -------  ------- ------ ------  ------- ------ ------  -------
> ----
> >       sd1            6   0.34      0     4948 387.88    413     4954
> 0%
> >       sd2            6   0.25      0     4230 387.07    405     4236
> 0%
> >       cmdk0         23   8.11      0      152   0.84
>       0      175  10%
> >
> >
> > Average response time of > 300ms is bad.
>
> Average is totally useless with this sort of a distribution.
> I''d suggest using a statistical package to explore the
distribution.
> Just a few 3-second latencies will skew the average quite a lot.
>   -- richard
>
A summary report is nothing more than an indication of issues, or non-issue.
So I agree that an "average" is just, an average.
However, "a few 3-second latencies" will not spoil the result too much
when
there''re more than 4000 I/Os sampled.

The script saves the "raw" data in a .rec file, so you can run
whatever
statistic tool you have against it. I am currently more worried about how
accurate and useful the raw data is, which is generated from a DTrace
command in it.

The "raw" record is in this format:
- Timestamp(sec.microsec)
- DeviceName
- W/R
- BLK_NO (offset)
- BLK_CNT (I/O size)
- IO_Time (I/O elapsed time)

Tao
                                   (msec.xx)


Tao
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060623/4fbf9c99/attachment.html>

Tao Chen

2006-Jun-24 05:59 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

On 6/23/06, Roch <Roch.Bourbonnais at sun.com>
wrote:>
>
> > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote:
> > > > On 6/22/06, Jeff Bonwick <bonwick at zion.eng.sun.com>
wrote:
> > > > >> a test against the same iscsi targets using linux
and XFS and the
> > > > >> NFS server implementation there gave me 1.25MB/sec
writes. I was
> about
> > > > >> to throw in the towel and deem ZFS/NFS has unusable
until B41
> came
> > > > >> along and at least gave me 1.25MB/sec.
> > > > >
> > > > >That''s still super slow -- is this over a 10Mb
link or something?
> > > > >
> > > > >Jeff
>
> I  think the performance is   in line with expectation  for,
> small  file,    single  threaded,     open/write/close   NFS
> workload (nfs must commit on close). Therefore I expect :
>
>         (avg file size) / (I/O latency).
>
> Joe does this formula approach the 1.25 MB/s ?

Joe sent me another set of DTrace output (biorpt.sh.rec.gz),
running 105 seconds with zil_disable=1.

I generate a graph using Grace ( rec.gif ).

The interesting part for me:
1) How I/O response time (at bdev level) changes in a pattern.
2) Both iSCSI (sd2) and local (sd1) storage follow the same pattern and have
almost identicle latency on average.
3) The latency is very high, either on average or at peaks.

Although a low throughput is expected given large amount of small files, I
don''t expect such high latency,
and of course 1.25MB/s is too low, even after turn on zil_disable, I see
4MB/s in this data set.
I/O size at bdev level are actually pretty decent: mostly (75%) 128KB.

Here''s a summary:

# biorpt -i biorpt.sh.rec
Generating report from biorpt.sh.rec ...

   === Top 5 I/O types ==
  DEVICE    T  BLKs     COUNT
  --------  -  ----  --------
  sd1       W   256      3122
  sd2       W   256      3118
  sd1       W     2       164
  sd2       W     2       151
  sd2       W     3       123


         === Top 5 worst I/O response time ==
  DEVICE    T  BLKs      OFFSET    TIMESTAMP  TIME.ms
  --------  -  ----  ----------  -----------  -------
  sd1       W   256   529562656   104.322170  3316.90
  sd1       W   256   529563424   104.322185  3281.97
  sd2       W   256   521152480   104.262081  3262.49
  sd2       W   256   521152736   104.262102  3258.56
  sd1       W   256   529562912   104.262091  3249.85


          === Top 5 Devices with largest number of I/Os ==
  DEVICE      READ AVG.ms     MB    WRITE AVG.ms     MB      IOs SEEK
  -------  ------- ------ ------  ------- ------ ------  ------- ----
  sd1            7   2.70      0     4169 440.62    409     4176   0%
  sd2            6   0.25      0     4131 444.79    407     4137   0%
  cmdk0          5  21.50      0      138   0.82      0      143  11%


      === Top 5 Devices with largest amount of data transfer ==
  DEVICE      READ AVG.ms     MB    WRITE AVG.ms     MB   Tol.MB MB/s
  -------  ------- ------ ------  ------- ------ ------  ------- ----
  sd1            7   2.70      0     4169 440.62    409      409    4
  sd2            6   0.25      0     4131 444.79    407      407    4
  cmdk0          5  21.50      0      138   0.82      0        0    0

 Tao
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060624/a7384166/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rec.gif
Type: image/gif
Size: 20686 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060624/a7384166/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: biorpt.sh.rec.gz
Type: application/x-gzip
Size: 96110 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060624/a7384166/attachment.bin>

Joe Little

2006-Jun-24 06:01 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

To clarify what has just been stated. With zil disabled I got 4MB/sec.
With zil enabled I get 1.25MB/sec

On 6/23/06, Tao Chen <nameistao at gmail.com>
wrote:>
>
> On 6/23/06, Roch <Roch.Bourbonnais at sun.com> wrote:
> >
> > > > On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote:
> > > > > On 6/22/06, Jeff Bonwick <bonwick at
zion.eng.sun.com> wrote:
> > > > > >> a test against the same iscsi targets using
linux and XFS and the
> > > > > >> NFS server implementation there gave me
1.25MB/sec writes. I was
> about
> > > > > >> to throw in the towel and deem ZFS/NFS has
unusable until B41
> came
> > > > > >> along and at least gave me 1.25MB/sec.
> > > > > >
> > > > > >That''s still super slow -- is this over a
10Mb link or something?
> > > > > >
> > > > > >Jeff
> >
> > I  think the performance is   in line with expectation  for,
> > small  file,    single  threaded,     open/write/close
> NFS
> > workload (nfs must commit on close). Therefore I expect :
> >
> >         (avg file size) / (I/O latency).
> >
> > Joe does this formula approach the 1.25 MB/s ?
>
>
> Joe sent me another set of DTrace output (biorpt.sh.rec.gz),
> running 105 seconds with zil_disable=1.
>
> I generate a graph using Grace ( rec.gif ).
>
> The interesting part for me:
> 1) How I/O response time (at bdev level) changes in a pattern.
> 2) Both iSCSI (sd2) and local (sd1) storage follow the same pattern and
have
> almost identicle latency on average.
> 3) The latency is very high, either on average or at peaks.
>
> Although a low throughput is expected given large amount of small files, I
> don''t expect such high latency,
> and of course 1.25MB/s is too low, even after turn on zil_disable, I see
> 4MB/s in this data set.
> I/O size at bdev level are actually pretty decent: mostly (75%) 128KB.
>
> Here''s a summary:
>
> # biorpt -i biorpt.sh.rec
>
> Generating report from biorpt.sh.rec ...
>
>    === Top 5 I/O types ==>
>    DEVICE    T  BLKs     COUNT
>   --------  -  ----  --------
>    sd1       W   256      3122
>   sd2       W   256      3118
>    sd1       W     2       164
>   sd2       W     2       151
>    sd2       W     3       123
>
>
>
>           === Top 5 worst I/O response time ==>
>    DEVICE    T  BLKs      OFFSET    TIMESTAMP  TIME.ms
>   --------  -  ----  ----------  -----------  -------
>    sd1       W   256   529562656   104.322170  3316.90
>   sd1       W   256   529563424   104.322185  3281.97
>   sd2       W   256   521152480   104.262081  3262.49
>    sd2       W   256   521152736   104.262102  3258.56
>   sd1       W   256   529562912   104.262091  3249.85
>
>
>
>           === Top 5 Devices with largest number of I/Os ==>
>   DEVICE      READ AVG.ms     MB    WRITE AVG.ms     MB      IOs SEEK
>   -------  ------- ------ ------  ------- ------ ------  ------- ----
>    sd1            7   2.70      0     4169 440.62    409     4176   0%
>    sd2            6   0.25      0     4131 444.79    407     4137   0%
>   cmdk0          5  21.50      0      138   0.82       0      143  11%
>
>
>        === Top 5 Devices with largest amount of data transfer ==>
>
>   DEVICE      READ AVG.ms     MB    WRITE AVG.ms     MB   Tol.MB MB/s
>   -------  ------- ------ ------  ------- ------ ------  ------- ----
>   sd1            7   2.70      0     4169 440.62    409      409    4
>    sd2            6   0.25      0     4131 444.79    407      407    4
>    cmdk0          5  21.50      0      138   0.82      0        0    0
>
>  Tao
>

Roch

2006-Jun-26 08:49 UTC

head link

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

So if you have a single thread doing open/write/close of 8K
files and get 1.25MB/sec, that tells me you have something
like a 6ms I/O latency. Which look reasonable also.
What does iostat -x svc_t (client side) says ?

400ms seems high for the workload _and_ doesn''t match my
formula, so I don''t like it ;-)
Quick look at your script looks fine tough; but something
just does not compute here.

Why  this formula (which applies to  any NFS single threaded
client  app working on small  files).  Even if the open  and
write parts are infinitely fast, on close(2), NFS must insure
that  data is  set to disk.  So  at a minimum every close(2)
must wait 1 I/O latency. During  that wait the single thread
client application    will    not  initiate   the  following
open/write/close segment. At  best  you get one file  output
per I/O latency. The I/O latency is the one seen by the
client and includes network part but that should be small
compared to the physical I/O.


-r

zfs discuss - Jun 2006 - 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved

[zfs-discuss] 15 minute fdsync problem and ZFS: Solved