qihua wu
2008-Dec-25 03:49 UTC
[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?
Hi, All, We have an oracle standby running on zfs and the database recovers very very slow. The problem is the IO performance is very bad. I find the recordsize of the ZFS is 128K, and the oracle block size is 8K. My My question is: When oracle tries to write a 8k block, will zfs read in 128K and then write 128K. If that''s the case, then zfs will do 16 (128k/8k=16 )times IO as necessary. extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.2 0.0 1.6 0.0 0.0 6.0 7.7 0 0 md4 0.0 0.2 0.0 1.6 0.0 0.0 0.0 7.4 0 0 md14 0.0 0.2 0.0 1.6 0.0 0.0 0.0 7.6 0 0 md24 0.0 0.4 0.0 1.7 0.0 0.0 0.0 6.7 0 0 sd0 0.0 0.4 0.0 1.7 0.0 0.0 0.0 6.5 0 0 sd2 0.0 1.4 0.0 105.2 0.0 4.9 0.0 3503.3 0 100 ssd97 0.0 3.0 0.0 384.0 0.0 10.0 0.0 3332.9 0 100 ssd99 0.0 2.6 0.0 332.8 0.0 10.0 0.0 3845.7 0 100 ssd101 0.0 4.4 0.0 563.3 0.0 10.0 0.0 2272.4 0 100 ssd103 0.0 3.4 0.0 435.2 0.0 10.0 0.0 2940.8 0 100 ssd105 0.0 3.6 0.0 460.8 0.0 10.0 0.0 2777.4 0 100 ssd107 0.0 0.2 0.0 25.6 0.0 0.0 0.0 72.8 0 1 ssd112 UC4-zuc4arch$> zfs list -o recordsize RECSIZE 128K 128K 128K 128K 128K 128K 128K 128K 128K Thanks, Daniel, -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081225/ac04bb94/attachment.html>
Neil Perrin
2008-Dec-25 04:25 UTC
[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?
The default recordsize is 128K. So you are correct, for random reads performance will be bad as excess data is read. For Oracle it is recommended to set the recordsize to 8k. This can be done when creating the filesystem using ''zfs create -o recordsize=8k <fs>''. If the fs has already been created then you can use ''zfs set recordsize=8k <fs>'' *however* this only takes effect for new files so existing databases will retain the old block size. Hope this helps: Neil. qihua wu wrote:> Hi, All, > > We have an oracle standby running on zfs and the database recovers > very very slow. The problem is the IO performance is very bad. I find > the recordsize of the ZFS is 128K, and the oracle block size is 8K. My > > My question is: > When oracle tries to write a 8k block, will zfs read in 128K and then > write 128K. If that''s the case, then zfs will do 16 (128k/8k=16 > )times IO as necessary. > > extended device statistics > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > > 0.0 0.2 0.0 1.6 0.0 0.0 6.0 7.7 0 0 md4 > > 0.0 0.2 0.0 1.6 0.0 0.0 0.0 7.4 0 0 md14 > > 0.0 0.2 0.0 1.6 0.0 0.0 0.0 7.6 0 0 md24 > > 0.0 0.4 0.0 1.7 0.0 0.0 0.0 6.7 0 0 sd0 > > 0.0 0.4 0.0 1.7 0.0 0.0 0.0 6.5 0 0 sd2 > > 0.0 1.4 0.0 105.2 0.0 4.9 0.0 3503.3 0 100 ssd97 > > 0.0 3.0 0.0 384.0 0.0 10.0 0.0 3332.9 0 100 ssd99 > > 0.0 2.6 0.0 332.8 0.0 10.0 0.0 3845.7 0 100 ssd101 > > 0.0 4.4 0.0 563.3 0.0 10.0 0.0 2272.4 0 100 ssd103 > > 0.0 3.4 0.0 435.2 0.0 10.0 0.0 2940.8 0 100 ssd105 > > 0.0 3.6 0.0 460.8 0.0 10.0 0.0 2777.4 0 100 ssd107 > > 0.0 0.2 0.0 25.6 0.0 0.0 0.0 72.8 0 1 ssd112 > > > > > UC4-zuc4arch$> zfs list -o recordsize > RECSIZE > 128K > 128K > 128K > 128K > 128K > 128K > 128K > 128K > 128K > > Thanks, > Daniel, > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
qihua wu
2008-Dec-26 10:49 UTC
[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?
After I changed the recordsize to 8k, seems the read/write size is not always 8k when using zpool iostat to check. So ZFS doesn''t obey the recordsize strictly? UC4-zuc4arch$> zfs get recordsize NAME PROPERTY VALUE SOURCE phximddb03data/zuc4arch/data01 recordsize 8K local phximddb03data/zuc4arch/data02 recordsize 8K local UC4-zuc4arch$> zpool iostat phximddb03data 1 capacity operations bandwidth pool used avail read write read write -------------- ----- ----- ----- ----- ----- ----- phximddb03data 487G 903G 13 62 1.26M 2.98M phximddb03data 487G 903G 518 1 4.05M 23.8K ===> here a write is of size 24k phximddb03data 487G 903G 456 37 3.58M 111K phximddb03data 487G 903G 551 0 4.34M 11.9K phximddb03data 487G 903G 496 8 3.86M 239K phximddb03data 487G 903G 472 229 3.68M 982K phximddb03data 487G 903G 499 3 3.91M 3.96K phximddb03data 487G 903G 525 138 4.12M 631K phximddb03data 487G 903G 497 0 3.89M 0 phximddb03data 487G 903G 562 0 4.38M 0 phximddb03data 487G 903G 337 3 2.63M 47.5K phximddb03data 487G 903G 140 35 4.55M 4.23M ===> here a write is of size 128k. phximddb03data 487G 903G 484 272 7.12M 5.44M phximddb03data 487G 903G 562 0 4.49M 127K phximddb03data 487G 903G 514 4 4.03M 301K phximddb03data 487G 903G 505 27 3.99M 1.00M phximddb03data 487G 903G 518 14 4.10M 692K phximddb03data 487G 903G 518 1 4.11M 14.4K phximddb03data 487G 903G 504 2 3.98M 151K phximddb03data 487G 903G 531 3 4.17M 392K phximddb03data 487G 903G 375 2 2.95M 380K phximddb03data 487G 903G 304 5 2.40M 296K phximddb03data 487G 903G 438 3 3.45M 277K phximddb03data 487G 903G 376 0 3.00M 0 phximddb03data 487G 903G 239 15 2.84M 1.98M phximddb03data 487G 903G 221 857 4.51M 16.8M ==> here a read is of size 20k. On Thu, Dec 25, 2008 at 12:25 PM, Neil Perrin <Neil.Perrin at sun.com> wrote:> The default recordsize is 128K. So you are correct, for random reads > performance will be bad as excess data is read. For Oracle it is > recommended > to set the recordsize to 8k. This can be done when creating the filesystem > using ''zfs create -o recordsize=8k <fs>''. If the fs has already been > created then you > can use ''zfs set recordsize=8k <fs>'' *however* this only takes effect for > new files > so existing databases will retain the old block size. > > Hope this helps: > > Neil. > > > qihua wu wrote: > >> Hi, All, >> >> We have an oracle standby running on zfs and the database recovers very >> very slow. The problem is the IO performance is very bad. I find the >> recordsize of the ZFS is 128K, and the oracle block size is 8K. My >> >> My question is: >> When oracle tries to write a 8k block, will zfs read in 128K and then >> write 128K. If that''s the case, then zfs will do 16 (128k/8k=16 )times IO >> as necessary. >> >> extended device statistics >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device >> >> 0.0 0.2 0.0 1.6 0.0 0.0 6.0 7.7 0 0 md4 >> >> 0.0 0.2 0.0 1.6 0.0 0.0 0.0 7.4 0 0 md14 >> >> 0.0 0.2 0.0 1.6 0.0 0.0 0.0 7.6 0 0 md24 >> >> 0.0 0.4 0.0 1.7 0.0 0.0 0.0 6.7 0 0 sd0 >> >> 0.0 0.4 0.0 1.7 0.0 0.0 0.0 6.5 0 0 sd2 >> >> 0.0 1.4 0.0 105.2 0.0 4.9 0.0 3503.3 0 100 ssd97 >> >> 0.0 3.0 0.0 384.0 0.0 10.0 0.0 3332.9 0 100 ssd99 >> >> 0.0 2.6 0.0 332.8 0.0 10.0 0.0 3845.7 0 100 ssd101 >> >> 0.0 4.4 0.0 563.3 0.0 10.0 0.0 2272.4 0 100 ssd103 >> >> 0.0 3.4 0.0 435.2 0.0 10.0 0.0 2940.8 0 100 ssd105 >> >> 0.0 3.6 0.0 460.8 0.0 10.0 0.0 2777.4 0 100 ssd107 >> >> 0.0 0.2 0.0 25.6 0.0 0.0 0.0 72.8 0 1 ssd112 >> >> >> >> >> UC4-zuc4arch$> zfs list -o recordsize >> RECSIZE >> 128K >> 128K >> 128K >> 128K >> 128K >> 128K >> 128K >> 128K >> 128K >> >> Thanks, >> Daniel, >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081226/0904f218/attachment.html>
Kees Nuyt
2008-Dec-26 14:08 UTC
[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?
On Fri, 26 Dec 2008 18:49:41 +0800, "qihua wu" <staywithpin at gmail.com> wrote:>After I changed the recordsize to 8k, seems the read/write size is not >always 8k when using zpool iostat to check. So ZFS doesn''t obey the >recordsize strictly?Did you recreate the database? Existing files keep the recordsize they were created with. Also, it could be "chained I/O", where consecutive, adjacent records are handled in one I/O call. -- ( Kees Nuyt ) c[_]
Richard Elling
2008-Dec-26 15:34 UTC
[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?
qihua wu wrote:> After I changed the recordsize to 8k, seems the read/write size is not > always 8k when using zpool iostat to check. So ZFS doesn''t obey the > recordsize strictly?Writes can be coalesced -- it is more efficient to issue larger iops. Similarly, reads can be prefetched. In other words, there may not be a 1:1 relationship between the recordsize and the size of physical iops. The smaller recordsize is important for increasing efficiency when doing lots of random reads for fixed-blocksize workloads. -- richard
qihua wu
2008-Dec-27 07:04 UTC
[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?
After we changed the recordsize to 8k, We first used dd to move the data files around. We could see the time recovering a archive log dropped from 40mins to 4 mins. But when using iostat to check, the read io is about 8K for each read, the write IO is still 128k for each write. Then we used cp to move the data files around as someone said dd might not change the recordsize. after that, the time to recover a log file was drop from 4mins to 1/4 mins. So it seems dd doens''t change the recordsize completely, and cp does. And is there any utility that could check the recordsize of an existing file? On Fri, Dec 26, 2008 at 10:08 PM, Kees Nuyt <k.nuyt at zonnet.nl> wrote:> On Fri, 26 Dec 2008 18:49:41 +0800, "qihua wu" > <staywithpin at gmail.com> wrote: > > >After I changed the recordsize to 8k, seems the read/write size is not > >always 8k when using zpool iostat to check. So ZFS doesn''t obey the > >recordsize strictly? > > > Did you recreate the database? Existing files keep the > recordsize they were created with. > > Also, it could be "chained I/O", where consecutive, adjacent > records are handled in one I/O call. > > -- > ( Kees Nuyt > ) > c[_] > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081227/cd578e6f/attachment.html>
Roch Bourbonnais
2009-Jan-02 15:51 UTC
[zfs-discuss] What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?
HI Qihua, there are many reasons why the recordsize does not govern the I/O size directly. Metadata I/O is one, ZFS I/O scheduler aggregation is another. The application behavior might be a third. Make sure to create the DB files after modifying the ZFS property. -r Le 26 d?c. 08 ? 11:49, qihua wu a ?crit :> After I changed the recordsize to 8k, seems the read/write size is > not always 8k when using zpool iostat to check. So ZFS doesn''t obey > the recordsize strictly? > > UC4-zuc4arch$> zfs get recordsize > NAME PROPERTY > VALUE SOURCE > phximddb03data/zuc4arch/data01 recordsize > 8K local > phximddb03data/zuc4arch/data02 recordsize > 8K local > > > > UC4-zuc4arch$> zpool iostat phximddb03data 1 > capacity operations bandwidth > pool used avail read write read write > -------------- ----- ----- ----- ----- ----- ----- > phximddb03data 487G 903G 13 62 1.26M 2.98M > phximddb03data 487G 903G 518 1 4.05M > 23.8K ===> here a write is of > size 24k > phximddb03data 487G 903G 456 37 3.58M 111K > phximddb03data 487G 903G 551 0 4.34M 11.9K > phximddb03data 487G 903G 496 8 3.86M 239K > phximddb03data 487G 903G 472 229 3.68M 982K > phximddb03data 487G 903G 499 3 3.91M 3.96K > phximddb03data 487G 903G 525 138 4.12M 631K > phximddb03data 487G 903G 497 0 3.89M 0 > phximddb03data 487G 903G 562 0 4.38M 0 > phximddb03data 487G 903G 337 3 2.63M 47.5K > phximddb03data 487G 903G 140 35 4.55M > 4.23M ===> here a write is of size 128k. > phximddb03data 487G 903G 484 272 7.12M 5.44M > phximddb03data 487G 903G 562 0 4.49M 127K > phximddb03data 487G 903G 514 4 4.03M 301K > phximddb03data 487G 903G 505 27 3.99M 1.00M > phximddb03data 487G 903G 518 14 4.10M 692K > phximddb03data 487G 903G 518 1 4.11M 14.4K > phximddb03data 487G 903G 504 2 3.98M 151K > phximddb03data 487G 903G 531 3 4.17M 392K > phximddb03data 487G 903G 375 2 2.95M 380K > phximddb03data 487G 903G 304 5 2.40M 296K > phximddb03data 487G 903G 438 3 3.45M 277K > phximddb03data 487G 903G 376 0 3.00M 0 > phximddb03data 487G 903G 239 15 2.84M 1.98M > phximddb03data 487G 903G 221 857 4.51M > 16.8M ==> here a read is of size 20k. > > > > On Thu, Dec 25, 2008 at 12:25 PM, Neil Perrin <Neil.Perrin at sun.com> > wrote: > The default recordsize is 128K. So you are correct, for random reads > performance will be bad as excess data is read. For Oracle it is > recommended > to set the recordsize to 8k. This can be done when creating the > filesystem > using ''zfs create -o recordsize=8k <fs>''. If the fs has already been > created then you > can use ''zfs set recordsize=8k <fs>'' *however* this only takes > effect for new files > so existing databases will retain the old block size. > > Hope this helps: > > Neil. > > > qihua wu wrote: > Hi, All, > > We have an oracle standby running on zfs and the database recovers > very very slow. The problem is the IO performance is very bad. I > find the recordsize of the ZFS is 128K, and the oracle block size is > 8K. My > > My question is: > When oracle tries to write a 8k block, will zfs read in 128K and > then write 128K. If that''s the case, then zfs will do 16 (128k/ > 8k=16 )times IO as necessary. > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > > 0.0 0.2 0.0 1.6 0.0 0.0 6.0 7.7 0 0 md4 > > 0.0 0.2 0.0 1.6 0.0 0.0 0.0 7.4 0 0 md14 > > 0.0 0.2 0.0 1.6 0.0 0.0 0.0 7.6 0 0 md24 > > 0.0 0.4 0.0 1.7 0.0 0.0 0.0 6.7 0 0 sd0 > > 0.0 0.4 0.0 1.7 0.0 0.0 0.0 6.5 0 0 sd2 > > 0.0 1.4 0.0 105.2 0.0 4.9 0.0 3503.3 0 100 ssd97 > > 0.0 3.0 0.0 384.0 0.0 10.0 0.0 3332.9 0 100 ssd99 > > 0.0 2.6 0.0 332.8 0.0 10.0 0.0 3845.7 0 100 ssd101 > > 0.0 4.4 0.0 563.3 0.0 10.0 0.0 2272.4 0 100 ssd103 > > 0.0 3.4 0.0 435.2 0.0 10.0 0.0 2940.8 0 100 ssd105 > > 0.0 3.6 0.0 460.8 0.0 10.0 0.0 2777.4 0 100 ssd107 > > 0.0 0.2 0.0 25.6 0.0 0.0 0.0 72.8 0 1 ssd112 > > > > > UC4-zuc4arch$> zfs list -o recordsize > RECSIZE > 128K > 128K > 128K > 128K > 128K > 128K > 128K > 128K > 128K > > Thanks, > Daniel, > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Robert Milkowski
2009-Jan-04 02:45 UTC
Re: What will happen when write a block of 8k if the recordsize is 128k. Will 128k be written instead of 8k?
Hello qihua, Saturday, December 27, 2008, 7:04:06 AM, you wrote: > After we changed the recordsize to 8k, We first used dd to move the data files around. We could see the time recovering a archive log dropped from 40mins to 4 mins. But when using iostat to check, the read io is about 8K for each read, the write IO is still 128k for each write. Then we used cp to move the data files around as someone said dd might not change the recordsize. after that, the time to recover a log file was drop from 4mins to 1/4 mins. So it seems dd doens''t change the recordsize completely, and cp does. And is there any utility that could check the recordsize of an existing file? Probably what happened was that when you did your dd first old files were still occupying disk space, possibly outer regions. Then you deleted them and did cp again - this time zfs probably put most of the data on the outer regions of disks and your backup got faster. (it all depends on your file sizes and disk sizes). -- Best regards, Robert Milkowski mailto:milek@task.gda.pl http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss