thr3ads.net - zfs discuss - [zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k? [Dec 2008]

If this information is useful, please help other people find it:
Share via:

qihua wu

2008-Dec-25 03:49 UTC

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

Hi, All,

We have an oracle standby running on zfs and the database recovers very very
slow. The problem is the IO performance is very bad. I find the recordsize
of the ZFS is 128K, and the oracle block size is 8K. My

My question is:
When oracle tries to write a 8k block, will zfs read in 128K and then write
128K.  If that''s the case, then zfs will do 16 (128k/8k=16 )times IO as
necessary.

                    extended device statistics

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

    0.0    0.2    0.0    1.6  0.0  0.0    6.0    7.7   0   0 md4

    0.0    0.2    0.0    1.6  0.0  0.0    0.0    7.4   0   0 md14

    0.0    0.2    0.0    1.6  0.0  0.0    0.0    7.6   0   0 md24

    0.0    0.4    0.0    1.7  0.0  0.0    0.0    6.7   0   0 sd0

    0.0    0.4    0.0    1.7  0.0  0.0    0.0    6.5   0   0 sd2

    0.0    1.4    0.0  105.2  0.0  4.9    0.0 3503.3   0 100 ssd97

    0.0    3.0    0.0  384.0  0.0 10.0    0.0 3332.9   0 100 ssd99

    0.0    2.6    0.0  332.8  0.0 10.0    0.0 3845.7   0 100 ssd101

    0.0    4.4    0.0  563.3  0.0 10.0    0.0 2272.4   0 100 ssd103

    0.0    3.4    0.0  435.2  0.0 10.0    0.0 2940.8   0 100 ssd105

    0.0    3.6    0.0  460.8  0.0 10.0    0.0 2777.4   0 100 ssd107

    0.0    0.2    0.0   25.6  0.0  0.0    0.0   72.8   0   1 ssd112



UC4-zuc4arch$> zfs list -o recordsize
RECSIZE
   128K
   128K
   128K
   128K
   128K
   128K
   128K
   128K
   128K

Thanks,
Daniel,
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081225/ac04bb94/attachment.html>

Neil Perrin

2008-Dec-25 04:25 UTC

head link

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

The default recordsize is 128K. So you are correct, for random reads
performance will be bad as excess data is read.  For Oracle it is 
recommended
to set the recordsize to 8k. This can be done when creating the filesystem
using ''zfs create -o recordsize=8k <fs>''. If the fs has
already been
created then you
can use ''zfs set recordsize=8k <fs>''  *however* this
only takes effect
for new files
so existing databases will retain the old block size.

Hope this helps:

Neil.


qihua wu wrote:> Hi, All,
>
> We have an oracle standby running on zfs and the database recovers 
> very very slow. The problem is the IO performance is very bad. I find 
> the recordsize of the ZFS is 128K, and the oracle block size is 8K. My
>
> My question is:
> When oracle tries to write a 8k block, will zfs read in 128K and then 
> write 128K.  If that''s the case, then zfs will do 16 (128k/8k=16 
> )times IO as necessary.
>
>                     extended device statistics              
>
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>
>     0.0    0.2    0.0    1.6  0.0  0.0    6.0    7.7   0   0 md4
>
>     0.0    0.2    0.0    1.6  0.0  0.0    0.0    7.4   0   0 md14
>
>     0.0    0.2    0.0    1.6  0.0  0.0    0.0    7.6   0   0 md24
>
>     0.0    0.4    0.0    1.7  0.0  0.0    0.0    6.7   0   0 sd0
>
>     0.0    0.4    0.0    1.7  0.0  0.0    0.0    6.5   0   0 sd2
>
>     0.0    1.4    0.0  105.2  0.0  4.9    0.0 3503.3   0 100 ssd97
>
>     0.0    3.0    0.0  384.0  0.0 10.0    0.0 3332.9   0 100 ssd99
>
>     0.0    2.6    0.0  332.8  0.0 10.0    0.0 3845.7   0 100 ssd101
>
>     0.0    4.4    0.0  563.3  0.0 10.0    0.0 2272.4   0 100 ssd103
>
>     0.0    3.4    0.0  435.2  0.0 10.0    0.0 2940.8   0 100 ssd105
>
>     0.0    3.6    0.0  460.8  0.0 10.0    0.0 2777.4   0 100 ssd107
>
>     0.0    0.2    0.0   25.6  0.0  0.0    0.0   72.8   0   1 ssd112
>
>
>
>
> UC4-zuc4arch$> zfs list -o recordsize
> RECSIZE
>    128K
>    128K
>    128K
>    128K
>    128K
>    128K
>    128K
>    128K
>    128K
>
> Thanks,
> Daniel,
> ------------------------------------------------------------------------
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

qihua wu

2008-Dec-26 10:49 UTC

head link

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

After I changed the recordsize to 8k, seems the read/write size is not
always 8k when using zpool iostat to check. So ZFS doesn''t obey the
recordsize strictly?

UC4-zuc4arch$> zfs get recordsize
NAME                             PROPERTY
VALUE                            SOURCE
phximddb03data/zuc4arch/data01   recordsize
8K                               local
phximddb03data/zuc4arch/data02   recordsize
8K                               local



UC4-zuc4arch$>  zpool iostat  phximddb03data 1
                   capacity     operations    bandwidth
pool             used  avail   read  write   read  write
--------------  -----  -----  -----  -----  -----  -----
phximddb03data   487G   903G     13     62  1.26M  2.98M
phximddb03data   487G   903G    518      1  4.05M
23.8K                                    ===>  here a write is of size 24k
phximddb03data   487G   903G    456     37  3.58M   111K
phximddb03data   487G   903G    551      0  4.34M  11.9K
phximddb03data   487G   903G    496      8  3.86M   239K
phximddb03data   487G   903G    472    229  3.68M   982K
phximddb03data   487G   903G    499      3  3.91M  3.96K
phximddb03data   487G   903G    525    138  4.12M   631K
phximddb03data   487G   903G    497      0  3.89M      0
phximddb03data   487G   903G    562      0  4.38M      0
phximddb03data   487G   903G    337      3  2.63M  47.5K
phximddb03data   487G   903G    140     35  4.55M
4.23M                               ===> here a write is of size 128k.
phximddb03data   487G   903G    484    272  7.12M  5.44M
phximddb03data   487G   903G    562      0  4.49M   127K
phximddb03data   487G   903G    514      4  4.03M   301K
phximddb03data   487G   903G    505     27  3.99M  1.00M
phximddb03data   487G   903G    518     14  4.10M   692K
phximddb03data   487G   903G    518      1  4.11M  14.4K
phximddb03data   487G   903G    504      2  3.98M   151K
phximddb03data   487G   903G    531      3  4.17M   392K
phximddb03data   487G   903G    375      2  2.95M   380K
phximddb03data   487G   903G    304      5  2.40M   296K
phximddb03data   487G   903G    438      3  3.45M   277K
phximddb03data   487G   903G    376      0  3.00M      0
phximddb03data   487G   903G    239     15  2.84M  1.98M
phximddb03data   487G   903G    221    857  4.51M
16.8M                          ==> here a read is of size 20k.



On Thu, Dec 25, 2008 at 12:25 PM, Neil Perrin <Neil.Perrin at sun.com>
wrote:
> The default recordsize is 128K. So you are correct, for random reads
> performance will be bad as excess data is read.  For Oracle it is
> recommended
> to set the recordsize to 8k. This can be done when creating the filesystem
> using ''zfs create -o recordsize=8k <fs>''. If the fs
has already been
> created then you
> can use ''zfs set recordsize=8k <fs>''  *however*
this only takes effect for
> new files
> so existing databases will retain the old block size.
>
> Hope this helps:
>
> Neil.
>
>
> qihua wu wrote:
>
>> Hi, All,
>>
>> We have an oracle standby running on zfs and the database recovers very
>> very slow. The problem is the IO performance is very bad. I find the
>> recordsize of the ZFS is 128K, and the oracle block size is 8K. My
>>
>> My question is:
>> When oracle tries to write a 8k block, will zfs read in 128K and then
>> write 128K.  If that''s the case, then zfs will do 16
(128k/8k=16 )times IO
>> as necessary.
>>
>>                    extended device statistics
>>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>
>>    0.0    0.2    0.0    1.6  0.0  0.0    6.0    7.7   0   0 md4
>>
>>    0.0    0.2    0.0    1.6  0.0  0.0    0.0    7.4   0   0 md14
>>
>>    0.0    0.2    0.0    1.6  0.0  0.0    0.0    7.6   0   0 md24
>>
>>    0.0    0.4    0.0    1.7  0.0  0.0    0.0    6.7   0   0 sd0
>>
>>    0.0    0.4    0.0    1.7  0.0  0.0    0.0    6.5   0   0 sd2
>>
>>    0.0    1.4    0.0  105.2  0.0  4.9    0.0 3503.3   0 100 ssd97
>>
>>    0.0    3.0    0.0  384.0  0.0 10.0    0.0 3332.9   0 100 ssd99
>>
>>    0.0    2.6    0.0  332.8  0.0 10.0    0.0 3845.7   0 100 ssd101
>>
>>    0.0    4.4    0.0  563.3  0.0 10.0    0.0 2272.4   0 100 ssd103
>>
>>    0.0    3.4    0.0  435.2  0.0 10.0    0.0 2940.8   0 100 ssd105
>>
>>    0.0    3.6    0.0  460.8  0.0 10.0    0.0 2777.4   0 100 ssd107
>>
>>    0.0    0.2    0.0   25.6  0.0  0.0    0.0   72.8   0   1 ssd112
>>
>>
>>
>>
>> UC4-zuc4arch$> zfs list -o recordsize
>> RECSIZE
>>   128K
>>   128K
>>   128K
>>   128K
>>   128K
>>   128K
>>   128K
>>   128K
>>   128K
>>
>> Thanks,
>> Daniel,
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081226/0904f218/attachment.html>

Kees Nuyt

2008-Dec-26 14:08 UTC

head link

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

On Fri, 26 Dec 2008 18:49:41 +0800, "qihua wu"
<staywithpin at gmail.com> wrote:
>After I changed the recordsize to 8k, seems the read/write size is not
>always 8k when using zpool iostat to check. So ZFS doesn''t obey the
>recordsize strictly?

Did you recreate the database? Existing files keep the
recordsize they were created with.

Also, it could be "chained I/O", where consecutive, adjacent
records are handled in one I/O call.

-- 
  (  Kees Nuyt
  )
c[_]

Richard Elling

2008-Dec-26 15:34 UTC

head link

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

qihua wu wrote:> After I changed the recordsize to 8k, seems the read/write size is not 
> always 8k when using zpool iostat to check. So ZFS doesn''t obey
the
> recordsize strictly?
Writes can be coalesced -- it is more efficient to issue larger iops.
Similarly, reads can be prefetched.  In other words, there may
not be a 1:1 relationship between the recordsize and the size of
physical iops.  The smaller recordsize is important for increasing
efficiency when doing lots of random reads for fixed-blocksize
workloads.
 -- richard

qihua wu

2008-Dec-27 07:04 UTC

head link

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

After we changed the recordsize to 8k, We first used dd to move the data
files around. We could see the time recovering a archive log dropped from
40mins to 4 mins. But when using iostat to check, the read io is about 8K
for each read, the write IO is still 128k for each write.  Then we used cp
to move the data files around as someone said dd might not change the
recordsize. after that, the time to recover a log file was drop from 4mins
to 1/4 mins.

So it seems dd doens''t change the recordsize completely, and cp does.
And is
there any utility that could check the recordsize of an existing file?

On Fri, Dec 26, 2008 at 10:08 PM, Kees Nuyt <k.nuyt at zonnet.nl> wrote:
> On Fri, 26 Dec 2008 18:49:41 +0800, "qihua wu"
> <staywithpin at gmail.com> wrote:
>
> >After I changed the recordsize to 8k, seems the read/write size is not
> >always 8k when using zpool iostat to check. So ZFS doesn''t
obey the
> >recordsize strictly?
>
>
> Did you recreate the database? Existing files keep the
> recordsize they were created with.
>
> Also, it could be "chained I/O", where consecutive, adjacent
> records are handled in one I/O call.
>
> --
>  (  Kees Nuyt
>  )
> c[_]
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081227/cd578e6f/attachment.html>

Roch Bourbonnais

2009-Jan-02 15:51 UTC

head link

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

HI Qihua, there are many reasons why the recordsize does not govern  
the I/O size directly. Metadata I/O is one, ZFS I/O scheduler  
aggregation is another.
The application behavior might be a third.

Make sure to create the DB files after modifying the ZFS property.

-r

Le 26 d?c. 08 ? 11:49, qihua wu a ?crit :
> After I changed the recordsize to 8k, seems the read/write size is  
> not always 8k when using zpool iostat to check. So ZFS doesn''t
obey
> the recordsize strictly?
>
> UC4-zuc4arch$> zfs get recordsize
> NAME                             PROPERTY     
> VALUE                            SOURCE
> phximddb03data/zuc4arch/data01   recordsize   
> 8K                               local
> phximddb03data/zuc4arch/data02   recordsize   
> 8K                               local
>
>
>
> UC4-zuc4arch$>  zpool iostat  phximddb03data 1
>                    capacity     operations    bandwidth
> pool             used  avail   read  write   read  write
> --------------  -----  -----  -----  -----  -----  -----
> phximddb03data   487G   903G     13     62  1.26M  2.98M
> phximddb03data   487G   903G    518      1  4.05M   
> 23.8K                                    ===>  here a write is of  
> size 24k
> phximddb03data   487G   903G    456     37  3.58M   111K
> phximddb03data   487G   903G    551      0  4.34M  11.9K
> phximddb03data   487G   903G    496      8  3.86M   239K
> phximddb03data   487G   903G    472    229  3.68M   982K
> phximddb03data   487G   903G    499      3  3.91M  3.96K
> phximddb03data   487G   903G    525    138  4.12M   631K
> phximddb03data   487G   903G    497      0  3.89M      0
> phximddb03data   487G   903G    562      0  4.38M      0
> phximddb03data   487G   903G    337      3  2.63M  47.5K
> phximddb03data   487G   903G    140     35  4.55M   
> 4.23M                               ===> here a write is of size 128k.
> phximddb03data   487G   903G    484    272  7.12M  5.44M
> phximddb03data   487G   903G    562      0  4.49M   127K
> phximddb03data   487G   903G    514      4  4.03M   301K
> phximddb03data   487G   903G    505     27  3.99M  1.00M
> phximddb03data   487G   903G    518     14  4.10M   692K
> phximddb03data   487G   903G    518      1  4.11M  14.4K
> phximddb03data   487G   903G    504      2  3.98M   151K
> phximddb03data   487G   903G    531      3  4.17M   392K
> phximddb03data   487G   903G    375      2  2.95M   380K
> phximddb03data   487G   903G    304      5  2.40M   296K
> phximddb03data   487G   903G    438      3  3.45M   277K
> phximddb03data   487G   903G    376      0  3.00M      0
> phximddb03data   487G   903G    239     15  2.84M  1.98M
> phximddb03data   487G   903G    221    857  4.51M   
> 16.8M                          ==> here a read is of size 20k.
>
>
>
> On Thu, Dec 25, 2008 at 12:25 PM, Neil Perrin <Neil.Perrin at
sun.com>
> wrote:
> The default recordsize is 128K. So you are correct, for random reads
> performance will be bad as excess data is read.  For Oracle it is  
> recommended
> to set the recordsize to 8k. This can be done when creating the  
> filesystem
> using ''zfs create -o recordsize=8k <fs>''. If the fs
has already been
> created then you
> can use ''zfs set recordsize=8k <fs>''  *however*
this only takes
> effect for new files
> so existing databases will retain the old block size.
>
> Hope this helps:
>
> Neil.
>
>
> qihua wu wrote:
> Hi, All,
>
> We have an oracle standby running on zfs and the database recovers  
> very very slow. The problem is the IO performance is very bad. I  
> find the recordsize of the ZFS is 128K, and the oracle block size is  
> 8K. My
>
> My question is:
> When oracle tries to write a 8k block, will zfs read in 128K and  
> then write 128K.  If that''s the case, then zfs will do 16 (128k/ 
> 8k=16 )times IO as necessary.
>
>                    extended device statistics
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>
>    0.0    0.2    0.0    1.6  0.0  0.0    6.0    7.7   0   0 md4
>
>    0.0    0.2    0.0    1.6  0.0  0.0    0.0    7.4   0   0 md14
>
>    0.0    0.2    0.0    1.6  0.0  0.0    0.0    7.6   0   0 md24
>
>    0.0    0.4    0.0    1.7  0.0  0.0    0.0    6.7   0   0 sd0
>
>    0.0    0.4    0.0    1.7  0.0  0.0    0.0    6.5   0   0 sd2
>
>    0.0    1.4    0.0  105.2  0.0  4.9    0.0 3503.3   0 100 ssd97
>
>    0.0    3.0    0.0  384.0  0.0 10.0    0.0 3332.9   0 100 ssd99
>
>    0.0    2.6    0.0  332.8  0.0 10.0    0.0 3845.7   0 100 ssd101
>
>    0.0    4.4    0.0  563.3  0.0 10.0    0.0 2272.4   0 100 ssd103
>
>    0.0    3.4    0.0  435.2  0.0 10.0    0.0 2940.8   0 100 ssd105
>
>    0.0    3.6    0.0  460.8  0.0 10.0    0.0 2777.4   0 100 ssd107
>
>    0.0    0.2    0.0   25.6  0.0  0.0    0.0   72.8   0   1 ssd112
>
>
>
>
> UC4-zuc4arch$> zfs list -o recordsize
> RECSIZE
>   128K
>   128K
>   128K
>   128K
>   128K
>   128K
>   128K
>   128K
>   128K
>
> Thanks,
> Daniel,
> ------------------------------------------------------------------------
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2009-Jan-04 02:45 UTC

head link

Re: What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

Hello qihua,

Saturday, December 27, 2008, 7:04:06 AM, you wrote:

&gt;

After we changed the recordsize to 8k, We first used dd to move the data files
around. We could see the time recovering a archive log dropped from 40mins to 4
mins. But when using iostat to check, the read io is about 8K for each read, the
write IO is still 128k for each write.  Then we used cp to move the data files
around as someone said dd might not change the recordsize. after that, the time
to recover a log file was drop from 4mins to 1/4 mins.

So it seems dd doens''t change the recordsize completely, and cp does.
And is there any utility that could check the recordsize of an existing file?

Probably what happened was that when you did your dd first old files were still
occupying disk space, possibly outer regions. Then you deleted them and did cp
again - this time zfs probably put most of the data on the outer regions of
disks and your backup got faster. (it all depends on your file sizes and disk
sizes).

-- 

Best regards,

 Robert Milkowski                            mailto:milek@task.gda.pl

                                       http://milek.blogspot.com


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Dec 2008 - What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?

Re: What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?