One question that has come up a number of times when I''ve been speaking with people (read: evangelizing :) ) about ZFS is about database storage. In conventional use storage has separated redo logs from table space, on a spindle basis. I''m not a database expert but I believe the reasons boil down to a combination of: - Separation for redundancy - Separation for reduction of bottlenecks (most write ops touch both the logs and the table) - Separation of usage patterns (logs are mostly sequential writes, tables are random). The question then comes up about whether in a ZFS world this separation is still needed. It seems to me that each of the above reasons is to some extent ameliorated by ZFS: - Redundancy is performed at the filesystem level, probably on all disks in the pool. - Dynamic striping and copy-on-write mean that all write ops can be striped across vdevs and the log writes can go right next to the table writes - Copy-on-write also turns almost all writes into sequential writes anyway. So it seems that the old reasoning may no longer apply. Is my thinking correct here? Have I missed something? Do we have any information to support either the use of a single pool or of separate pools for database usage? Boyd Melbourne, Australia
Hi Boyd, Boyd Adamson wrote:> One question that has come up a number of times when I''ve been speaking > with people (read: evangelizing :) ) about ZFS is about database > storage. In conventional use storage has separated redo logs from table > space, on a spindle basis. > I''m not a database expert but I believe the reasons boil down to a > combination of: > - Separation for redundancycorrect> - Separation for reduction of bottlenecks (most write ops touch both the > logs and the table)correct> - Separation of usage patterns (logs are mostly sequential writes, > tables are random).correct> The question then comes up about whether in a ZFS world this separation > is still needed.I don''t think it is. > It seems to me that each of the above reasons is to> some extent ameliorated by ZFS: > - Redundancy is performed at the filesystem level, probably on all disks > in the pool.more at the pool level iirc, but yes, over all the disks where you have them mirrored or raid/raidZ-ed> - Dynamic striping and copy-on-write mean that all write ops can be > striped across vdevs and the log writes can go right next to the table > writesYes. No need to separate metadata (and archive/rollback logs are just that)> - Copy-on-write also turns almost all writes into sequential writes anyway.yup.> So it seems that the old reasoning may no longer apply. Is my thinking > correct here? Have I missed something? Do we have any information to > support either the use of a single pool or of separate pools for > database usage?To my way of thinking, you can still separate things out if you''re not comfortable with having everything all together in the one pool. My take on that though is that it stems from an inability to appreciate just how different zfs is - a lack of paradigm shifting lets you down. If I was setting up a db server today and could use ZFS, then I''d be making sure that the DBAs didn''t get a say in how the filesystems were laid out. I''d ask them what they want to see in a directory structure and provide that. If they want raw ("don''t you know that everything is faster on raw?!?!") then I''d carve a zvol for them. Anything else would be carefully delineated - they stick to the rdbms and don''t tell me how to do my job, and vice versa. cheers, James C. McPherson -- Solaris Datapath Engineering Data Management Group Sun Microsystems
One word of caution about random writes. From my experience, they are not nearly as fast as sequential writes (like 10 to 20 times slower) unless they are carefully aligned on the same boundary as the file system record size. Otherwise, there is a heavy read penalty that you can easily observe by doing a zpool iostat. So, depending on the workload, it''s really a stretch to say random writes can be done at sequential speed. Chuck -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of James C. McPherson Sent: Wednesday, May 10, 2006 5:18 PM To: Boyd Adamson Cc: ZFS filesystem discussion list Subject: Re: [zfs-discuss] ZFS and databases Hi Boyd, Boyd Adamson wrote:> One question that has come up a number of times when I''ve been > speaking with people (read: evangelizing :) ) about ZFS is about > database storage. In conventional use storage has separated redo logs > from table space, on a spindle basis. > I''m not a database expert but I believe the reasons boil down to a > combination of: > - Separation for redundancycorrect> - Separation for reduction of bottlenecks (most write ops touch both > the logs and the table)correct> - Separation of usage patterns (logs are mostly sequential writes, > tables are random).correct> The question then comes up about whether in a ZFS world this > separation is still needed.I don''t think it is. > It seems to me that each of the above reasons is to> some extent ameliorated by ZFS: > - Redundancy is performed at the filesystem level, probably on all > disks in the pool.more at the pool level iirc, but yes, over all the disks where you have them mirrored or raid/raidZ-ed> - Dynamic striping and copy-on-write mean that all write ops can be > striped across vdevs and the log writes can go right next to the table> writesYes. No need to separate metadata (and archive/rollback logs are just that)> - Copy-on-write also turns almost all writes into sequential writesanyway. yup.> So it seems that the old reasoning may no longer apply. Is my thinking> correct here? Have I missed something? Do we have any information to > support either the use of a single pool or of separate pools for > database usage?To my way of thinking, you can still separate things out if you''re not comfortable with having everything all together in the one pool. My take on that though is that it stems from an inability to appreciate just how different zfs is - a lack of paradigm shifting lets you down. If I was setting up a db server today and could use ZFS, then I''d be making sure that the DBAs didn''t get a say in how the filesystems were laid out. I''d ask them what they want to see in a directory structure and provide that. If they want raw ("don''t you know that everything is faster on raw?!?!") then I''d carve a zvol for them. Anything else would be carefully delineated - they stick to the rdbms and don''t tell me how to do my job, and vice versa. cheers, James C. McPherson -- Solaris Datapath Engineering Data Management Group Sun Microsystems _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 11/05/2006, at 9:17 AM, James C. McPherson wrote:>> - Redundancy is performed at the filesystem level, probably on all >> disks in the pool. > > more at the pool level iirc, but yes, over all the disks where you > have them mirrored or raid/raidZ-edYes, of course. I meant at the filesystem level (ZFS as a whole) rather than at the sysadmin/application data layout level.> To my way of thinking, you can still separate things out if you''re not > comfortable with having everything all together in the one pool. My > take on that though is that it stems from an inability to appreciate > just how different zfs is - a lack of paradigm shifting lets you down. > > If I was setting up a db server today and could use ZFS, then I''d be > making sure that the DBAs didn''t get a say in how the filesystems > were laid out. I''d ask them what they want to see in a directory > structure and provide that. If they want raw ("don''t you know that > everything is faster on raw?!?!") then I''d carve a zvol for them. > Anything else would be carefully delineated - they stick to the rdbms > and don''t tell me how to do my job, and vice versa.Old dogma dies hard. What we need is some clear blueprints/best practices docs on this, I think.
On 5/10/06, Boyd Adamson <boyd-adamson at usa.net> wrote:> What we need is some clear blueprints/best practices docs on this, I > think. >Most definitely. Key things that people I work with (including me...) would like to see are... - Some success stories of people running large databases (working set much larger than RAM) on ZFS - Configuration/tuning best practices - Description of why I don''t need directio, quickio, or ODM. - Performance comparisons of ZFS vs. SVM/UFS, VxVM/VxFS/ODM, and ASM using standard benchmarks for OLTP and DSS workloads - Same as above but with real workloads. - How the ZFS feature set improves the lives of system administrators, DBA''s, storage, and backup admins. For general purpse file systems (especially zones and root) I am very eager to put zfs to work. It has great potential to simplify things like live upgrade and answering the question "what changed?" I don''t yet have eagerness to propose that I get a cross-functional team together to perform purely exploratory database load tests on zfs. Mike -- Mike Gerdts http://mgerdts.blogspot.com/
On Wed, 2006-05-10 at 20:42 -0500, Mike Gerdts wrote:> On 5/10/06, Boyd Adamson <boyd-adamson at usa.net> wrote: > > What we need is some clear blueprints/best practices docs on this, I > > think. > >In due time... it was only recently that some of the performance enhancements were put back. Note: ideally, this info makes the main doc set. BluePrints and Infodocs can fill in the missing bits because they have traditionally had a much shorter time-to-market. Ultimately, we''d like the info rolled into the main doc set. In other words, go for the main docs first.> Most definitely. Key things that people I work with (including me...) > would like to see are... > > - Some success stories of people running large databases (working set > much larger than RAM) on ZFS > - Configuration/tuning best practices > - Description of why I don''t need directio, quickio, or ODM. > - Performance comparisons of ZFS vs. SVM/UFS, VxVM/VxFS/ODM, and ASM > using standard benchmarks for OLTP and DSS workloads > - Same as above but with real workloads. > - How the ZFS feature set improves the lives of system > administrators, DBA''s, storage, and backup admins.I''m working on some RAS stuff... I can''t talk for the performance guys, but I know some have been working on it. Don''t expect an audited TPC-like benchmark as we tend to not use file systems for those. -- richard
Roch Bourbonnais - Performance Engineering
2006-May-11 07:18 UTC
[zfs-discuss] ZFS and databases
Gehr, Chuck R writes: > One word of caution about random writes. From my experience, they are > not nearly as fast as sequential writes (like 10 to 20 times slower) > unless they are carefully aligned on the same boundary as the file > system record size. Otherwise, there is a heavy read penalty that you > can easily observe by doing a zpool iostat. So, depending on the > workload, it''s really a stretch to say random writes can be done at > sequential speed. > > Chuck > Could we agree on saying that partial writes to blocks that are not in cache are much slower than writes to blocks that are. Then given that Sequential pattern can benefit from readahead, then those will fall in the fast category most of the time. Performance of Random writes will depend on the cached ratio. For DB working sets that greatly exceeds system memory, which is common, then this fall in the slower case and this stays true for any filesystem. Or said otherwise, There is no free lunch. -r
Roch Bourbonnais - Performance Engineering
2006-May-11 10:28 UTC
[zfs-discuss] ZFS and databases
- Description of why I don''t need directio, quickio, or ODM. The 2 main benefits that cames out of using directio was reducing memory consumption by avoiding the page cache AND bypassing the UFS single writer behavior. ZFS does not have the single writer lock. As for memory, the UFS code path would I/O straight from user buffer to disk overwritting live data. So we won''t do this. ZFS will hold the data in memory for the time it takes to insure data-integrity. ZFS concurrent O_DSYNC writes will gang together in the ZIL (ZFS intent log) and be released after the I/Os are done to the log. Performance characteristics will be different between the Filesystems and certainly dynamic data from real workloads, as you point out, will be enlightening. -r
Absolutely, I have done hot spot tests using a Poisson random distribution. With that pattern (where there are many cache hits), the writes are 3-10 times faster than sequential speed. My comment was regarding purely random i/o across a large (at least much larger than available memory cache) area. A real workload is likely to have a combination of patterns, i.e. some fairly random, some hot spot, and some sequential. Chuck -----Original Message----- From: Roch Bourbonnais - Performance Engineering [mailto:Roch.Bourbonnais at Sun.Com] Sent: Thursday, May 11, 2006 1:18 AM To: Gehr, Chuck R Cc: James.McPherson at Sun.Com; Boyd Adamson; ZFS filesystem discussion list Subject: RE: [zfs-discuss] ZFS and databases Gehr, Chuck R writes: > One word of caution about random writes. From my experience, they are > not nearly as fast as sequential writes (like 10 to 20 times slower) > unless they are carefully aligned on the same boundary as the file > system record size. Otherwise, there is a heavy read penalty that you > can easily observe by doing a zpool iostat. So, depending on the > workload, it''s really a stretch to say random writes can be done at > sequential speed. > > Chuck > Could we agree on saying that partial writes to blocks that are not in cache are much slower than writes to blocks that are. Then given that Sequential pattern can benefit from readahead, then those will fall in the fast category most of the time. Performance of Random writes will depend on the cached ratio. For DB working sets that greatly exceeds system memory, which is common, then this fall in the slower case and this stays true for any filesystem. Or said otherwise, There is no free lunch. -r
A couple of points/additions with regard to oracle in particular: When talking about large database installations, copy-on-write may or may not apply. The files are never completely rewritten, only changed internally via mmap(). When you lay down your database, you will generally allocate the storage for the anticipated capacity required. That will result in sparse files in the actual filesystems. This brings up the question: How does ZFS allocate sparse files, and, how does the allocation occur as the sparse files have data added? Regarding the separation of data files, you *really* want your logs to be in a different place (spindles-wise) than your DB. After all, should you have a catastrophic failure (crash, disk hiccup, etc.), your redo and transaction logs are your recovery system. With this in mind, I''d envisioned using zfs as such: - Allocate a number of database filesystems using a ''db'' or standard pool. Generally 1 per CPU, as oracle will use parallel queries if the tables are spread across multiple filesystems. - Allocate another pool from another storage system (internal, another disk array, etc.) the log areas. Name the pool something like ''dblog''. That would guarantee that you don''t mix your data types. I''m interested to see how a zfs pool using multiple LUNs on a large storage array will behave when using a database. I think the performance spreading across multiple luns will result in increased db performance. On a side note, does anybody know of a way to track hot areas on the disk? One concern I''ve got with the zfs pool layout is multiple tables being allocated on the same LUN and both tables being used heavily. The need to move the data around within the pool may be required to address a single LUN bottleneck. Has anybody thought of this situation? On May 10, 2006, at 4:48 PM, Boyd Adamson wrote:> One question that has come up a number of times when I''ve been > speaking with people (read: evangelizing :) ) about ZFS is about > database storage. In conventional use storage has separated redo > logs from table space, on a spindle basis. > > I''m not a database expert but I believe the reasons boil down to a > combination of: > - Separation for redundancy > > - Separation for reduction of bottlenecks (most write ops touch > both the logs and the table) > > - Separation of usage patterns (logs are mostly sequential writes, > tables are random). > > The question then comes up about whether in a ZFS world this > separation is still needed. It seems to me that each of the above > reasons is to some extent ameliorated by ZFS: > - Redundancy is performed at the filesystem level, probably on all > disks in the pool. > > - Dynamic striping and copy-on-write mean that all write ops can be > striped across vdevs and the log writes can go right next to the > table writes > > - Copy-on-write also turns almost all writes into sequential writes > anyway. > > So it seems that the old reasoning may no longer apply. Is my > thinking correct here? Have I missed something? Do we have any > information to support either the use of a single pool or of > separate pools for database usage? > > Boyd > Melbourne, Australia > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote:> A couple of points/additions with regard to oracle in particular: > > When talking about large database installations, copy-on-write may > or may not apply. The files are never completely rewritten, only > changed internally via mmap(). When you lay down your database, you > will generally allocate the storage for the anticipated capacity > required. That will result in sparse files in the actual filesystems.This is counter to my Oracle experience, which I''ll admit is dated. Oracle will zero-fill the tablespace with 128kByte iops -- it is not sparse. I''ve got a scar. Has this changed in the past few years? Now, if compression is enabled, then this represents and interesting challenge... -- richard
Richard Elling wrote:> On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote: >> A couple of points/additions with regard to oracle in particular: >> >> When talking about large database installations, copy-on-write may >> or may not apply. The files are never completely rewritten, only >> changed internally via mmap(). When you lay down your database, you >> will generally allocate the storage for the anticipated capacity >> required. That will result in sparse files in the actual filesystems. > > This is counter to my Oracle experience, which I''ll admit is dated. > Oracle will zero-fill the tablespace with 128kByte iops -- it is not > sparse. I''ve got a scar. Has this changed in the past few years?I have the same scar (from a different company). We actually wound up calling it the "tablespace create benchmark". And that was valid as of work done just over a year ago. Multiple parallel tablespace creates is usually a big pain point for filesystem / cache interaction, and also fragmentation once in a while. The latter ZFS should take care of; the former, well, I dunno.> Now, if compression is enabled, then this represents and interesting > challenge...Indeed. One would think the intents of these two are rather divergent... How about filling with /dev/random? ;) - Pete
On Thu, 2006-05-11 at 10:27 -0700, Richard Elling wrote:> On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote: > > A couple of points/additions with regard to oracle in particular: > > > > When talking about large database installations, copy-on-write may > > or may not apply. The files are never completely rewritten, only > > changed internally via mmap(). When you lay down your database, you > > will generally allocate the storage for the anticipated capacity > > required. That will result in sparse files in the actual filesystems. > > This is counter to my Oracle experience, which I''ll admit is dated. > Oracle will zero-fill the tablespace with 128kByte iops -- it is not > sparse. I''ve got a scar. Has this changed in the past few years?I hit [send] too soon. Here is a writeup on how I got scarred, and healed :-) I wrote it up so that hopefully someone else would be spared the agony. I guess Peter didn''t read Sun BluePrints :-O http://www.sun.com/blueprints/0400/ram-vxfs.pdf -- richard
Richard Elling wrote:> On Thu, 2006-05-11 at 10:27 -0700, Richard Elling wrote: >> On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote: >>> A couple of points/additions with regard to oracle in particular: >>> >>> When talking about large database installations, copy-on-write may >>> or may not apply. The files are never completely rewritten, only >>> changed internally via mmap(). When you lay down your database, you >>> will generally allocate the storage for the anticipated capacity >>> required. That will result in sparse files in the actual filesystems. >> This is counter to my Oracle experience, which I''ll admit is dated. >> Oracle will zero-fill the tablespace with 128kByte iops -- it is not >> sparse. I''ve got a scar. Has this changed in the past few years? > > I hit [send] too soon. Here is a writeup on how I got scarred, and > healed :-) I wrote it up so that hopefully someone else would be > spared the agony. I guess Peter didn''t read Sun BluePrints :-O > http://www.sun.com/blueprints/0400/ram-vxfs.pdfDon''t blame me, I wasn''t at Sun at the time. :) It does, however, sound much like what happened to me other than the fact the server and database were a couple orders of magnitude bigger in my case. And we weren''t using VxFS so discovered_direct_io wasn''t an option. Having a multi-terabyte database lock up is a great way to learn how important performance is to some people... ;) I just hope ZFS doesn''t have this problem with multi-terabyte databases. - Pete
On 5/11/06, Peter Rival <peter.rival at sun.com> wrote:> Richard Elling wrote: > > Oracle will zero-fill the tablespace with 128kByte iops -- it is not > > sparse. I''ve got a scar. Has this changed in the past few years? > > Multiple parallel tablespace creates is usually a big pain point for filesystem / cache interaction, and also fragmentation once in a while. The latter ZFS should take care of; the former, well, I dunno. >The purpose of zero-filled tablespace is to prevent fragmentation by future writes, in the case when multiple tablespaces are being updated/filled on the same disk, correct? This becomes pointless on ZFS, since it never overwrites the same pre-allocated block, i.e. the tablespace becomes fragmented in that case no matter what. Also, in order to write a partial update to a new block, zfs needs the rest of the orignal block, hence the notion by Roch: "partial writes to blocks that are not in cache are much slower than writes to blocks that are." Fortunately I think DB almost always does aligned full block I/O, or is that right? Tao
Regarding directio and quickio, is there a way with ZFS to skip the system buffer cache? I''ve seen big benefits for using directio when the data files have been segregated from the log files. Having the system compete with the DB for read-ahead results in double work. On May 10, 2006, at 7:42 PM, Mike Gerdts wrote:> On 5/10/06, Boyd Adamson <boyd-adamson at usa.net> wrote: >> What we need is some clear blueprints/best practices docs on this, I >> think. >> > > Most definitely. Key things that people I work with (including me...) > would like to see are... > > - Some success stories of people running large databases (working set > much larger than RAM) on ZFS > - Configuration/tuning best practices > - Description of why I don''t need directio, quickio, or ODM. > - Performance comparisons of ZFS vs. SVM/UFS, VxVM/VxFS/ODM, and ASM > using standard benchmarks for OLTP and DSS workloads > - Same as above but with real workloads. > - How the ZFS feature set improves the lives of system > administrators, DBA''s, storage, and backup admins. > > For general purpse file systems (especially zones and root) I am very > eager to put zfs to work. It has great potential to simplify things > like live upgrade and answering the question "what changed?" I don''t > yet have eagerness to propose that I get a cross-functional team > together to perform purely exploratory database load tests on zfs. > > Mike > > -- > Mike Gerdts > http://mgerdts.blogspot.com/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
This thread is useless without data. This thread is useless without data. This thread is useless without data. This thread is useless without data. This thread is useless without data. :-P
On 12/05/2006, at 3:59 AM, Richard Elling wrote:> On Thu, 2006-05-11 at 10:27 -0700, Richard Elling wrote: >> On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote: >>> A couple of points/additions with regard to oracle in particular: >>> >>> When talking about large database installations, copy-on-write may >>> or may not apply. The files are never completely rewritten, only >>> changed internally via mmap(). When you lay down your database, you >>> will generally allocate the storage for the anticipated capacity >>> required. That will result in sparse files in the actual >>> filesystems.Sorry, I didn''t receive Greg or Richard''s original emails. Apologies for messing up threading (such as it is). Are you saying that copy-on-write doesn''t apply for mmap changes, but only file re-writes? I don''t think that gels with anything else I know about ZFS. Boyd Melbourne, Australia
> Are you saying that copy-on-write doesn''t apply for mmap changes, but > only file re-writes? I don''t think that gels with anything else I > know about ZFS.No, you''re correct -- everything is copy-on-write. Jeff
Roch Bourbonnais - Performance Engineering
2006-May-12 06:53 UTC
[zfs-discuss] ZFS and databases
From: Gregory Shaw <greg.shaw at Sun.COM> Sender: zfs-discuss-bounces at opensolaris.org To: Mike Gerdts <mgerdts at gmail.com> Cc: ZFS filesystem discussion list <zfs-discuss at opensolaris.org>, James.McPherson at Sun.COM Subject: Re: [zfs-discuss] ZFS and databases Date: Thu, 11 May 2006 13:15:48 -0600 Regarding directio and quickio, is there a way with ZFS to skip the system buffer cache? I''ve seen big benefits for using directio when the data files have been segregated from the log files. Were the benefits coming from extra concurrency (no single writter lock) or avoiding the extra copy to page cache or from too much readahead that is not used before pages need to be recycled. ZFS already has the concurrency. The page cache copy is really rather cheap and I assert somewhat necessary to insure data integrity The extra readahead is somewhat of a bug in UFS (read 2 pages get a maxcontig chunk (1MB)). ZFS is new, conventional wisdom, may or may not apply. -r
Roch Bourbonnais - Performance Engineering
2006-May-12 07:24 UTC
[zfs-discuss] ZFS and databases
Jeff Bonwick writes: > > Are you saying that copy-on-write doesn''t apply for mmap changes, but > > only file re-writes? I don''t think that gels with anything else I > > know about ZFS. > > No, you''re correct -- everything is copy-on-write. > Maybe the confusion comes from: mmap changes : application interaction with memory COW : memory(ZFS) interaction with storage There is an fsync() or fsflush deamon in between. -r > Jeff > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Roch Bourbonnais - Performance Engineering
2006-May-12 07:30 UTC
[zfs-discuss] ZFS and databases
Tao Chen writes: > On 5/11/06, Peter Rival <peter.rival at sun.com> wrote: > > Richard Elling wrote: > > > Oracle will zero-fill the tablespace with 128kByte iops -- it is not > > > sparse. I''ve got a scar. Has this changed in the past few years? > > > > Multiple parallel tablespace creates is usually a big pain point for filesystem / cache interaction, and also fragmentation once in a while. The latter ZFS should take care of; the former, well, I dunno. > > > > The purpose of zero-filled tablespace is to prevent fragmentation by > future writes, in the case when multiple tablespaces are being > updated/filled on the same disk, correct? That and also there was a need for block reservation. Thus posix_fallocate was added (recently). > This becomes pointless on ZFS, since it never overwrites the same > pre-allocated block, i.e. the tablespace becomes fragmented in that > case no matter what. is fragmented the right word here ? Anyway: random writes can be turned into sequential. > > Also, in order to write a partial update to a new block, zfs needs the > rest of the orignal block, hence the notion by Roch: > "partial writes to blocks that are not in cache are much slower than > writes to blocks that are." > Fortunately I think DB almost always does aligned full block I/O, or > is that right? That''s my understanding also. -r > > Tao > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 5/12/06, Roch Bourbonnais - Performance Engineering <Roch.Bourbonnais at sun.com> wrote:> > From: Gregory Shaw <greg.shaw at Sun.COM> > Regarding directio and quickio, is there a way with ZFS to skip the > system buffer cache? I''ve seen big benefits for using directio when > the data files have been segregated from the log files. > > > Were the benefits coming from extra concurrency (no single writter lock)Does DIO bypass "writter lock" on Solaris? Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks at filesystem level: http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582> or avoiding the extra copy to page cacheCertainly. Also to avoid VM overhead (DB does like raw devices).> or from too much readahead that is not used before pages need to be recycled.Not sure what you mean ( avoid unnecessary readahead? )> ZFS already has the concurrency.Interesting, would like to find more on this.> The page cache copy is really rather cheapVM as a whole is certainly not cheap.> and I assert somewhat necessary to insure data integrity.Not following you.> The extra readahead is somewhat of a bug in UFS (read 2 > pages get a maxcontig chunk (1MB)).Ouch.> > ZFS is new, conventional wisdom, may or may not apply. >This (zfs-discuss) is the place where we can be enlightened :-) Tao
Roch Bourbonnais - Performance Engineering
2006-May-12 09:20 UTC
[zfs-discuss] ZFS and databases
Tao Chen writes: > On 5/12/06, Roch Bourbonnais - Performance Engineering > <Roch.Bourbonnais at sun.com> wrote: > > > > From: Gregory Shaw <greg.shaw at Sun.COM> > > Regarding directio and quickio, is there a way with ZFS to skip the > > system buffer cache? I''ve seen big benefits for using directio when > > the data files have been segregated from the log files. > > > > > > Were the benefits coming from extra concurrency (no single writter lock) > > Does DIO bypass "writter lock" on Solaris? Yep. > Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks > at filesystem level: > http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582 > > > or avoiding the extra copy to page cache > > Certainly. Also to avoid VM overhead (DB does like raw devices). OK, but again, is it to avoid badly configured readahead, or get extra concurrency, or something else ? I have a hard time that managing the page cache represents a cost when you compare this to a 5ms I/O. > > > or from too much readahead that is not used before pages need to be recycled. > > Not sure what you mean ( avoid unnecessary readahead? ) There is thing thing where a 2K read over UFS if it crosses a page boundary can lead UFS to assert sequential access to the file and do a clustered readahead. Since clusters are often set to 1MB then you can get a lot of spurious I/O form this 2K read. If the data readahead turns out to never be used later because of memory pressure then you have a suboptimal configuration. This is one kind of issue that DIO would not have. > > > ZFS already has the concurrency. > > Interesting, would like to find more on this. > I''ll have to blog this down one day. > > The page cache copy is really rather cheap > > VM as a whole is certainly not cheap. In some aspect certainly. Compared to I/O I''d say it''s really cheap minus bugs. My point is to be cautious of this syllogism: DIO, a VM bypass mechanism, can be much faster than regular I/O. Thus the VM is costly. DIO is a VM bypass _AND_ a different UFS codepath. > > > and I assert somewhat necessary to insure data integrity. > > Not following you. I''m on thin ground here. But I believe that you can''t directly write a disk block and it''s checksum in the refering block in the way ZFS wants to do; or at least you couldn''t do this and hold up application in a way that is acceptable performance wise. So to insure the data-integrity that ZFS delivers is has to have the data cached for the time it takes to update the on-disk format properly. I''m willing to be corrected on this (and anything else for that matters, we live in a complex world). > > > The extra readahead is somewhat of a bug in UFS (read 2 > > pages get a maxcontig chunk (1MB)). > > Ouch. You said it. But people have learned to tune it down when this hits (tunefs -a) which is not that often. > > > > > ZFS is new, conventional wisdom, may or may not apply. > > > > This (zfs-discuss) is the place where we can be enlightened :-) > > Tao > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -r ____________________________________________________________________________________ Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L''Europe, 38330, Montbonnot Saint Martin, France Performance & Availability Engineering http://icncweb.france/~rbourbon http://blogs.sun.com/roller/page/roch Roch.Bourbonnais at Sun.Com (+33).4.76.18.83.20
Roch Bourbonnais - Performance Engineering wrote On 05/12/06 09:30,:> Tao Chen writes: > > On 5/11/06, Peter Rival <peter.rival at sun.com> wrote: > > > Richard Elling wrote: > > > > Oracle will zero-fill the tablespace with 128kByte iops -- it is not > > > > sparse. I''ve got a scar. Has this changed in the past few years? > > > > > > Multiple parallel tablespace creates is usually a big pain point for filesystem / cache interaction, and also fragmentation once in a while. The latter ZFS should take care of; the former, well, I dunno. > > > > > > > The purpose of zero-filled tablespace is to prevent fragmentation by > > future writes, in the case when multiple tablespaces are being > > updated/filled on the same disk, correct? > > That and also there was a need for block reservation. Thus > posix_fallocate was added (recently).> > > This becomes pointless on ZFS, since it never overwrites the same > > pre-allocated block, i.e. the tablespace becomes fragmented in that > > case no matter what. > > is fragmented the right word here ? > Anyway: random writes can be turned into sequential. This really optimizes the data on disk for full table scans (sequential reads of the whole tables or portions of them). Random access may be supported from a DMBS perspective using indexes - then read access patterns are random anyway. In contrast to usual files DBMS files exhibit different access patterns: They are often loaded sequentially (resulting in a nice layout for later sequential reads) yet updates to the tables will erode this sequential layout over time (e.g. updating accounts as they are credited and debited online), so later full table scans will suffer. ZFS optimizes random writes versus potential sequential reads. this may hurt if there are many full table scans - but DBMS designers try to avoid unnecessary full table scans anyway. On the other hand there are use cases in which tables are updated randomly and read sequentially many times (e.g. in batch runs) - here overall performance may suffer. An way to resolve this could be an option to transparently "reorganize" a file (a tool triggering an internal rewrite - such that the data become sequential on disk again - on-the-fly while the DBMS is running/file is open). - Franz
Gregory Shaw wrote On 05/11/06 21:15,:> Regarding directio and quickio, is there a way with ZFS to skip the > system buffer cache? I''ve seen big benefits for using directio when > the data files have been segregated from the log files. > > Having the system compete with the DB for read-ahead results in double > work. >Getting rid of the extra copy of data in a filesystem buffer also reduces the memory footprint. As the data are being cached in the DBMS buffer the extra copy in the filesystem cache is useless (except occasionly as a work around for deficiencies in the DBMS'' cacheing mechanisms). This is also important if there is non-DBMS related filesystem activity which can benefit from having data in the filesystem buffer as these compete with the (useless) DBMS data. Is there a good short description about how the ZFS buffer cache works? Thanks, Franz
Roch Bourbonnais - Performance Engineering
2006-May-12 12:49 UTC
[zfs-discuss] ZFS and databases
''ZFS optimizes random writes versus potential sequential reads.'' Now I don''t think the current readahead code is where we want it to be yet but, in the same way that enough concurrent 128K I/O can saturate a disk (I sure hope that Milkowski''s data will confirm this, otherwise I''m dead), enough concurrent read I/O will do the same. So It''s a simple matter of programming to detect file sequential access an issue enough I/Os early enough. With UFS, we had a simple algorithm and one tunable. Touch 2 sequential page, read a cluster ahead. Then, don''t do any other I/O until all the data is processed. This is flawed in many respect. And it certainly requires large cluster size to get good I/O throughput because it had stop and go behavior. With ZFS (again, prefetch code being looked upon), I think we can manage get good I/O throughput using 128K, through enough concurrency and intelligent coding. -r
Roch Bourbonnais - Performance Engineering wrote:> Tao Chen writes: > > On 5/12/06, Roch Bourbonnais - Performance Engineering > > <Roch.Bourbonnais at sun.com> wrote: > > > > > > From: Gregory Shaw <greg.shaw at Sun.COM> > > > Regarding directio and quickio, is there a way with ZFS to skip the > > > system buffer cache? I''ve seen big benefits for using directio when > > > the data files have been segregated from the log files. > > > > > > > > > Were the benefits coming from extra concurrency (no single writter lock) > > > > Does DIO bypass "writter lock" on Solaris? > > Yep. > > > Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks > > at filesystem level: > > http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582 > > > > > or avoiding the extra copy to page cache > > > > Certainly. Also to avoid VM overhead (DB does like raw devices). > > OK, but again, is it to avoid badly configured readahead, or > get extra concurrency, or something else ? I have a hard > time that managing the page cache represents a cost when you > compare this to a 5ms I/O.I think you''re missing one other thing - handling the memory overload of having orders of magnitude more accessed data than you have memory. Think about how you can handle having a couple hundred GB of dirty data being written by many threads (say either tablespace creates or temp table creation for a large table join) - fsflush and writebehind et. al. just can''t keep up with it. Of course, I know ZFS is "better" but to be useable in those situations it needs to be probably an order of magnitude better or so, and I haven''t seen any data on a decently big rig with a proper storage config that shows that it is. I''m not saying it''s not, I''m just saying I haven''t seen the data. :) Like you said, Roch, I''ve been down this road before and don''t want to go down it again. ;) - Pete
Roch Bourbonnais - Performance Engineering
2006-May-12 12:56 UTC
[zfs-discuss] ZFS and databases
You could start with the ARC paper, Megiddo/Modha FAST''03 conference. ZFS uses a variation of that. It''s an interesting read. -r Franz Haberhauer writes: > Gregory Shaw wrote On 05/11/06 21:15,: > > Regarding directio and quickio, is there a way with ZFS to skip the > > system buffer cache? I''ve seen big benefits for using directio when > > the data files have been segregated from the log files. > > > > Having the system compete with the DB for read-ahead results in double > > work. > > > > Getting rid of the extra copy of data in a filesystem buffer also > reduces the memory footprint. As the data are being cached in the > DBMS buffer the extra copy in the filesystem cache is useless (except > occasionly as a work around for deficiencies in the DBMS'' cacheing > mechanisms). This is also important if there is non-DBMS related > filesystem activity which can benefit from having data in the > filesystem buffer as these compete with the (useless) DBMS data. > > Is there a good short description about how the ZFS buffer cache > works? > > Thanks, > Franz > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> ''ZFS optimizes random writes versus potential sequential reads.''This remark focused on the allocation policy during writes, not the readahead that occurs during reads. Data that are rewritten randomly but in place in a sequential, contiguos file (like a preallocated UFS file) are not optimized for these writes, but for later sequential read accesses. Now with ZFS the writes are fast, but the later sequential reads probably not - readahead may help with this wrt. latency (data may already be available in the file buffer when the DBMS requests them - yet the DBMS does readaheads as well). But it will still be random IO to the disk (higher utilization compared to a sequential pattern). this is not an issue for a single user, but could be one if there are many. - Franz Roch Bourbonnais - Performance Engineering wrote On 05/12/06 14:49,:> > ''ZFS optimizes random writes versus potential sequential reads.'' > > Now I don''t think the current readahead code is where we > want it to be yet but, in the same way that enough > concurrent 128K I/O can saturate a disk (I sure hope that > Milkowski''s data will confirm this, otherwise I''m dead), > enough concurrent read I/O will do the same. So It''s a > simple matter of programming to detect file sequential > access an issue enough I/Os early enough. > > > With UFS, we had a simple algorithm and one tunable. Touch 2 > sequential page, read a cluster ahead. Then, don''t do any > other I/O until all the data is processed. This is flawed in > many respect. And it certainly requires large cluster size > to get good I/O throughput because it had stop and go behavior. > > With ZFS (again, prefetch code being looked upon), I think > we can manage get good I/O throughput using 128K, through > enough concurrency and intelligent coding. > > > -r >
Roch Bourbonnais - Performance Engineering
2006-May-12 13:19 UTC
[zfs-discuss] ZFS and databases
Peter Rival writes: > Roch Bourbonnais - Performance Engineering wrote: > > Tao Chen writes: > > > On 5/12/06, Roch Bourbonnais - Performance Engineering > > > <Roch.Bourbonnais at sun.com> wrote: > > > > > > > > From: Gregory Shaw <greg.shaw at Sun.COM> > > > > Regarding directio and quickio, is there a way with ZFS to skip the > > > > system buffer cache? I''ve seen big benefits for using directio when > > > > the data files have been segregated from the log files. > > > > > > > > > > > > Were the benefits coming from extra concurrency (no single writter lock) > > > > > > Does DIO bypass "writter lock" on Solaris? > > > > Yep. > > > > > Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks > > > at filesystem level: > > > http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582 > > > > > > > or avoiding the extra copy to page cache > > > > > > Certainly. Also to avoid VM overhead (DB does like raw devices). > > > > OK, but again, is it to avoid badly configured readahead, or > > get extra concurrency, or something else ? I have a hard > > time that managing the page cache represents a cost when you > > compare this to a 5ms I/O. > I think you''re missing one other thing - handling the memory overload of > having orders of magnitude more accessed data than you have memory. > Think about how you can handle having a couple hundred GB of dirty data > being written by many threads (say either tablespace creates or temp > table creation for a large table join) - fsflush and writebehind et. al. When you dirty data enough, ZFS will start to throttle those writters; a bit like ufs_HW but at the system level. So most data in the ARC cache should be evictable on demand. There are issues in current state of code that makes the amount of dirty data greater that we''d like but it''s limited by design. > just can''t keep up with it. Of course, I know ZFS is "better" but to be > useable in those situations it needs to be probably an order of > magnitude better or so, and I haven''t seen any data on a decently big > rig with a proper storage config that shows that it is. I''m not saying > it''s not, I''m just saying I haven''t seen the data. :) > Like you said, Roch, I''ve been down this road before and don''t want to > go down it again. ;) Yes, performance wise, ZFS is already fast on lots of test _and_ a big moving target. That''s another great thing about it. But keep those scenarios coming it''s always interesting to make sure they''re covered. -r > > - Pete
Roch Bourbonnais - Performance Engineering
2006-May-12 13:36 UTC
[zfs-discuss] ZFS and databases
Franz Haberhauer writes: > > ''ZFS optimizes random writes versus potential sequential reads.'' > > This remark focused on the allocation policy during writes, > not the readahead that occurs during reads. > Data that are rewritten randomly but in place in a sequential, > contiguos file (like a preallocated UFS file) are not optimized > for these writes, but for later sequential read accesses. > > Now with ZFS the writes are fast, but the later sequential reads > probably not - readahead may help with this wrt. latency (data may > already be available in the file buffer when the DBMS requests them - > yet the DBMS does readaheads as well). But it will still be random IO > to the disk (higher utilization compared to a sequential pattern). > this is not an issue for a single user, but could be one if there are > many. Last summer, a little experiment took me by surprise. We had a tight loop issuing single synchroneous I/O to raw. Results where:> > > size: 2048, count 1000, secs 3.96 :random (same cyl ?) > > > size: 2048, count 1000, secs 6.02 :sequential > > > size: 2048, count 1000, secs 6.34 :random (random cyl ?) > > > > > > So it looks like for a 2K write we have in order: > > > > > > write to same cylinder random offset (fastest) > > > write to same cylinder sequential offset (slower) > > > write to random cylinder (slowest)So it kind of makes sense; if I issue a write just after one completes then it''s will take a full rotational latency for it to get going. If it''s random same cylinder it will be more like half that. Sequential is good _if_ you can keep a pipe of I/Os hitting in stride. But with a pipe of enough concurrent I/Os we can be close to that kind of performance; or at least this has not been proven wrong yet. -r
>Were the benefits coming from extra concurrency (no >single writer lock) or avoiding the extra copy to page cache or >from too much readahead that is not used before pages need to >be recycled.With QFS, a major benefit we see for databases and direct I/O is an effective doubling of the memory available to the database for caching. Without direct I/O, every page read winds up in the file system cache and the database cache. For large databases, this is the difference between retaining key indexes in memory, or not. (The block copy into user space is also not cheap.) Anton This message posted from opensolaris.org
Roch Bourbonnais - Performance Engineering
2006-May-12 15:23 UTC
[zfs-discuss] Re: ZFS and databases
Anton B. Rang writes: > >Were the benefits coming from extra concurrency (no > >single writer lock) or avoiding the extra copy to page cache or > >from too much readahead that is not used before pages need to > >be recycled. > > With QFS, a major benefit we see for databases and direct I/O is an > effective doubling of the memory available to the database for > caching. Without direct I/O, every page read winds up in the file > system cache and the database cache. For large databases, this is the > difference between retaining key indexes in memory, or not. For read it is an interesting concept. Since Reading into cache Then copy into user space then keep data around but never use it is not optimal. So 2 issues, there is the cost of copy and there is the memory. Now could we detect the pattern that cause holding to the cached block not optimal and do a quick freebehind after the copyout ? Something like Random access + very large file + poor cache hit ratio ? Now about avoiding the copy; That would mean dma straight into user space ? But if the checksum does not validate the data, what do we do ? If storage is not raid-protected and we have to return EIO, I don''t think we can do this _and_ corrupt the user buffer also, not sure what POSIX says for this situation. Now latency wise, the cost of copy is small compared to the I/O; right ? So it now turns into an issue of saving some CPU cycles. -r
> Now could we detect the pattern that cause holding to the > cached block not optimal and do a quick freebehind after the > copyout ? Something like Random access + very large file + > poor cache hit ratio ?We might detect it ... or we could let the application give us the hint, via the directio ioctl, which for ZFS might mean not "bypass the cache" but "free cache as soon as possible." (The problem with detecting this situation is that we don''t know future access patterns, and we don''t know whether the application is doing its own caching, in which case any that we do isn''t particularly useful ... unless there are subblock writes in future, in which case our cache can be used to avoid the read-modify-write.)> Now about avoiding the copy; That would mean dma straight > into user space ? But if the checksum does not validate the > data, what do we do ? If storage is not raid-protected and we > have to return EIO, I don''t think we can do this _and_ > corrupt the user buffer also, not sure what POSIX says for > this situation.Well, direct I/O behaves that way today. Actually, paged I/O does as well -- we move one page at a time into user space, so if we encounter an error while reading a later portion of the request, the earlier portion of the user buffer will already have been overwritten. SUSv3 doesn''t specify anything about buffer contents in the event of an error. (It even leaves the file offset undefined.) So I think we''re safe here.> Now latency wise, the cost of copy is small compared to the > I/O; right ? So it now turns into an issue of saving some > CPU cycles.CPU cycles and memory bandwidth (which both can be in short supply on a database server). Anton
I thought the benefits were from skipping the read-ahead logic. What was seen prior to the implementation of directio was this: - System running a high(er) load. It was difficult to see why the load was higher, as oracle was the primary process(es). After the implementation, the load on the server dropped from a load of ~6 (on a 6 way box) to a load of 1.5 to 2. The system ''felt'' faster, as well. It should be noted that traditional filesystem access (creating tables, etc.) dropped in performance by a factor of about 10. Again, I attributed this to the lack of a read ahead capability. Luckily, creating new database tables is a relatively infrequent event. Does anybody else have any other points on this? On May 12, 2006, at 12:53 AM, Roch Bourbonnais - Performance Engineering wrote:> > From: Gregory Shaw <greg.shaw at Sun.COM> > Sender: zfs-discuss-bounces at opensolaris.org > To: Mike Gerdts <mgerdts at gmail.com> > Cc: ZFS filesystem discussion list <zfs-discuss at opensolaris.org>, > James.McPherson at Sun.COM > Subject: Re: [zfs-discuss] ZFS and databases > Date: Thu, 11 May 2006 13:15:48 -0600 > > Regarding directio and quickio, is there a way with ZFS to skip the > system buffer cache? I''ve seen big benefits for using directio when > the data files have been segregated from the log files. > > > Were the benefits coming from extra concurrency (no > single writter lock) or avoiding the extra copy to page cache or > from too much readahead that is not used before pages need to > be recycled. > > ZFS already has the concurrency. > > The page cache copy is really rather cheap and I assert > somewhat necessary to insure data integrity > > The extra readahead is somewhat of a bug in UFS (read 2 > pages get a maxcontig chunk (1MB)). > > > ZFS is new, conventional wisdom, may or may not apply. > > -r >----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance Engineering wrote:> For read it is an interesting concept. Since > > Reading into cache > Then copy into user space > then keep data around but never use it > > is not optimal. > So 2 issues, there is the cost of copy and there is the memory. > > Now could we detect the pattern that cause holding to the > cached block not optimal and do a quick freebehind after the > copyout ? Something like Random access + very large file + poor cache hit > ratio ?An interface to request no caching on a per-file basis would be good (madvise(2) should do for mmap''ed files, an fcntl(2) or open(2) flag would be better).> Now about avoiding the copy; That would mean dma straight > into user space ? But if the checksum does not validate the > data, what do we do ?Who cares? You DMA into user-space, check the checksum and if there''s a problem return an error; so there''s [corrupted] data in the user space buffer... but the app knows it, so what''s the problem (see below)?> If storage is not raid-protected and we > have to return EIO, I don''t think we can do this _and_ > corrupt the user buffer also, not sure what POSIX says for > this situation.If POSIX compliance is an issue just add new interfaces (possibly as simple as an open(2) flag).> Now latency wise, the cost of copy is small compared to the > I/O; right ? So it now turns into an issue of saving some > CPU cycles.Can you build a system where the cost of the copy adds significantly to the latency numbers? (Think RAM disks.) Nico --
Roch Bourbonnais - Performance Engineering
2006-May-12 16:33 UTC
[zfs-discuss] Re: ZFS and databases
Nicolas Williams writes: > On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance Engineering wrote: > > For read it is an interesting concept. Since > > > > Reading into cache > > Then copy into user space > > then keep data around but never use it > > > > is not optimal. > > So 2 issues, there is the cost of copy and there is the memory. > > > > Now could we detect the pattern that cause holding to the > > cached block not optimal and do a quick freebehind after the > > copyout ? Something like Random access + very large file + poor cache hit > > ratio ? > > An interface to request no caching on a per-file basis would be good > (madvise(2) should do for mmap''ed files, an fcntl(2) or open(2) flag > would be better). > > > Now about avoiding the copy; That would mean dma straight > > into user space ? But if the checksum does not validate the > > data, what do we do ? > > Who cares? You DMA into user-space, check the checksum and if there''s a > problem return an error; so there''s [corrupted] data in the user space > buffer... but the app knows it, so what''s the problem (see below)? > > > If storage is not raid-protected and we > > have to return EIO, I don''t think we can do this _and_ > > corrupt the user buffer also, not sure what POSIX says for > > this situation. > > If POSIX compliance is an issue just add new interfaces (possibly as > simple as an open(2) flag). > > > Now latency wise, the cost of copy is small compared to the > > I/O; right ? So it now turns into an issue of saving some > > CPU cycles. > > Can you build a system where the cost of the copy adds significantly to > the latency numbers? (Think RAM disks.) > > Nico > -- Finally I can agree with somebody today. Directio is non-posix anyway and given that people have been train to inform the system that the cache won''t be useful, that it''s a hard problem to detect automatically, let''s avoid the copy and save memory all at once for the read path. We could use the directio() call for that ... -r
On Fri, May 12, 2006 at 06:33:00PM +0200, Roch Bourbonnais - Performance Engineering wrote:> Directio is non-posix anyway and given that people have been > train to inform the system that the cache won''t be useful, > that it''s a hard problem to detect automatically, let''s > avoid the copy and save memory all at once for the read path. > > We could use the directio() call for that ...I had no idea about directio(3C)! We might want an interface for the app to know what the natural block size of the file is, so it can read at proper file offsets. Of course, if that block size is smaller than the ZFS filesystem record size then ZFS may yet grow it. How to deal with this? (One option: don''t grow it as long as an app has turned direct I/O on for a fildes and the fildes remains open.) Nico --
On Fri, 2006-05-12 at 10:42 -0500, Anton Rang wrote:> > Now latency wise, the cost of copy is small compared to the > > I/O; right ? So it now turns into an issue of saving some > > CPU cycles. > > CPU cycles and memory bandwidth (which both can be in short > supply on a database server).We can throw hardware at that :-) Imagine a machine with lots of extra CPU cycles and lots of parallel access to multiple memory banks. This is the strategy behind CMT. In the future, you will have many more CPU cycles and even better memory bandwidth than you do now, perhaps by an order of magnitude in the next few years. -- richard
On Fri, May 12, 2006 at 09:59:56AM -0700, Richard Elling wrote:> On Fri, 2006-05-12 at 10:42 -0500, Anton Rang wrote: > > > Now latency wise, the cost of copy is small compared to the > > > I/O; right ? So it now turns into an issue of saving some > > > CPU cycles. > > > > CPU cycles and memory bandwidth (which both can be in short > > supply on a database server). > > We can throw hardware at that :-) Imagine a machine with lots > of extra CPU cycles and lots of parallel access to multiple > memory banks. This is the strategy behind CMT. In the future, > you will have many more CPU cycles and even better memory > bandwidth than you do now, perhaps by an order of magnitude > in the next few years.Well, yes, of course, but I think the arguments for direct I/O are excellent. Another thing that I see an argument for is limiting the size of various caches, to avoid paging (even having no swap isn''t enough as you don''t want memory pressure evicting hot text pages). Nico --
> We might want an interface for the app to know what the natural block > size of the file is, so it can read at proper file offsets.Seems that stat(2) could be used for this ... long st_blksize; /* Preferred I/O block size */ This isn''t particularly useful for databases if they already have a fixed page size, though. There isn''t a comparable way for the application to indicate a record size to the file system, and there probably ought to be, particularly since the size of writes used to initially create a file (if any) may be quite different than the size of writes used for updating the file. (In the long term, it might be interesting to study dynamically splitting blocks which are written using small record sizes.) -- Anton
On May 12, 2006, at 11:59 AM, Richard Elling wrote:>> CPU cycles and memory bandwidth (which both can be in short >> supply on a database server). > > We can throw hardware at that :-) Imagine a machine with lots > of extra CPU cycles [ ... ]Yes, I''ve heard this story before, and I won''t believe it this time. ;-) Seriously, I believe a database can perform very well on a CMT system, but there won''t be any "extra" CPU cycles or memory bandwidth, because the demand for transaction rates will always exceed what we can supply. Anton
I really like the below idea: - the ability to defragment a file ''live''. I can see instances where that could be very useful. For instance, if you have multiple LUNs (or spindles, whatever) using ZFS, you could re-optimize large files to spread the chunks across as many spindles as possible. Or, as stated below, make it contiguous. I don''t know if that is automatic with ZFS today, but it''s an idea. On May 12, 2006, at 6:18 AM, Franz Haberhauer wrote:> > > Roch Bourbonnais - Performance Engineering wrote On 05/12/06 09:30,: >> Tao Chen writes: >> > On 5/11/06, Peter Rival <peter.rival at sun.com> wrote: >> > > Richard Elling wrote: >> > > > Oracle will zero-fill the tablespace with 128kByte iops -- >> it is not >> > > > sparse. I''ve got a scar. Has this changed in the past few >> years? >> > > >> > > Multiple parallel tablespace creates is usually a big pain >> point for filesystem / cache interaction, and also fragmentation >> once in a while. The latter ZFS should take care of; the former, >> well, I dunno. >> > > >> > > The purpose of zero-filled tablespace is to prevent >> fragmentation by >> > future writes, in the case when multiple tablespaces are being >> > updated/filled on the same disk, correct? >> That and also there was a need for block reservation. Thus >> posix_fallocate was added (recently). > > > > > This becomes pointless on ZFS, since it never overwrites the same > > > pre-allocated block, i.e. the tablespace becomes fragmented in > that > > > case no matter what. > > > > is fragmented the right word here ? > > Anyway: random writes can be turned into sequential. > > This really optimizes the data on disk for full table scans > (sequential reads of the whole tables or portions of them). > Random access may be supported from a DMBS perspective using > indexes - then read access patterns are random anyway. > In contrast to usual files DBMS files exhibit different access > patterns: > They are often loaded sequentially (resulting in a nice layout for > later sequential reads) yet updates to the tables will erode this > sequential layout over time (e.g. updating accounts as they are > credited > and debited online), so later full table scans will suffer. > ZFS optimizes random writes versus potential sequential reads. > this may hurt if there are many full table scans - but DBMS > designers try to avoid unnecessary full table scans anyway. > On the other hand there are use cases in which tables are updated > randomly and read sequentially many times (e.g. in batch runs) - > here overall performance may suffer. > An way to resolve this could be an option to transparently > "reorganize" > a file (a tool triggering an internal rewrite - such that the > data become sequential on disk again - on-the-fly while the > DBMS is running/file is open). > > - Franz > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Fri, May 12, 2006 at 12:36:53PM -0500, Anton Rang wrote:> >We might want an interface for the app to know what the natural block > >size of the file is, so it can read at proper file offsets. > > Seems that stat(2) could be used for this ... > > long st_blksize; /* Preferred I/O block size */And in fact, it is! :-) In general, this will be 128k on ZFS filesystems. If you have changed the ''recordsize'' property, then it will be that value (for files created after the property was changed). --matt
Given that ISV apps can be only changed by the ISV who may or may not be willing to use such a new interface, having a "no cache" property for the file - or given that filesystems are now really cheap with ZFS - for the filesystem would be important as well, like the forcedirectio mount option for UFS. No caching at the filesystem level is always appropriate if the application itself maintains a buffer of application data and does their own application specific buffer management like DBMSes or large matrix solvers. Double caching these typicaly huge amounts data in the filesystem is always a waste of RAM. - Franz Nicolas Williams wrote:>On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance Engineering wrote: > > >>For read it is an interesting concept. Since >> >> Reading into cache >> Then copy into user space >> then keep data around but never use it >> >>is not optimal. >>So 2 issues, there is the cost of copy and there is the memory. >> >>Now could we detect the pattern that cause holding to the >>cached block not optimal and do a quick freebehind after the >>copyout ? Something like Random access + very large file + poor cache hit >>ratio ? >> >> > >An interface to request no caching on a per-file basis would be good >(madvise(2) should do for mmap''ed files, an fcntl(2) or open(2) flag >would be better). > > > >>Now about avoiding the copy; That would mean dma straight >>into user space ? But if the checksum does not validate the >>data, what do we do ? >> >> > >Who cares? You DMA into user-space, check the checksum and if there''s a >problem return an error; so there''s [corrupted] data in the user space >buffer... but the app knows it, so what''s the problem (see below)? > > > >> If storage is not raid-protected and we >>have to return EIO, I don''t think we can do this _and_ >>corrupt the user buffer also, not sure what POSIX says for >>this situation. >> >> > >If POSIX compliance is an issue just add new interfaces (possibly as >simple as an open(2) flag). > > > >>Now latency wise, the cost of copy is small compared to the >>I/O; right ? So it now turns into an issue of saving some >>CPU cycles. >> >> > >Can you build a system where the cost of the copy adds significantly to >the latency numbers? (Think RAM disks.) > >Nico > >
On Sat, May 13, 2006 at 08:23:55AM +0200, Franz Haberhauer wrote:> Given that ISV apps can be only changed by the ISV who may or may not be > willing to > use such a new interface, having a "no cache" property for the file - or > given that filesystems > are now really cheap with ZFS - for the filesystem would be important > as well, > like the forcedirectio mount option for UFS. > No caching at the filesystem level is always appropriate if the > application itself > maintains a buffer of application data and does their own application > specific buffer management > like DBMSes or large matrix solvers. Double caching these typicaly huge > amounts data > in the filesystem is always a waste of RAM.Yes, but remember, DB vendors have adopted new features before -- they want to have the fastest DB. Same with open source web servers. So I''m a bit optimistic. Also, an LD_PRELOAD library could be provided to enable direct I/O as necessary. Nico --
Roch Bourbonnais - Performance Engineering
2006-May-15 14:47 UTC
[zfs-discuss] ZFS and databases
Gregory Shaw writes: > I really like the below idea: > - the ability to defragment a file ''live''. > > I can see instances where that could be very useful. For instance, > if you have multiple LUNs (or spindles, whatever) using ZFS, you > could re-optimize large files to spread the chunks across as many > spindles as possible. Or, as stated below, make it contiguous. > I don''t know if that is automatic with ZFS today, but it''s an idea. I think the expected benefits of making it contiguous is rooted in the belief that bigger I/Os is the only way to reach top performance. I think that before ZFS, both physical and logical contiguity was required to enable sufficient readahead and get performance. Once we have good readahead based on detected logical contiguous accesses, It may well be possible to get top device speed through reasonably-sized I/O concurrency. -r
Rich, correct me if I''m wrong, but here''s the scenario I was thinking of: - A large file is created. - Over time, the file grows and shrinks. The anticipated layout on disk due to this is that extents are allocated as the file changes. The extents may or may not be on multiple spindles. I envision a fragmentation over time that will cause sequential access to jump all over the place. If you use smart controllers or disks with read caching, their use of stripes and read-ahead (if enabled) could cause performance to be bad. So, my thought was to de-fragment the file to make it more contiguous and to allow hardware read-ahead to be effective. An additional benefit would be to spread it across multiple spindles in a contiguous fashion, such as: disk0: 32mb disk1: 32mb disk2: 32mb ... etc. Perhaps this is unnecessary. I''m simply trying to grasp the long term performance implications of COW data. On May 15, 2006, at 8:47 AM, Roch Bourbonnais - Performance Engineering wrote:> > Gregory Shaw writes: > >> I really like the below idea: >> - the ability to defragment a file ''live''. >> >> I can see instances where that could be very useful. For instance, >> if you have multiple LUNs (or spindles, whatever) using ZFS, you >> could re-optimize large files to spread the chunks across as many >> spindles as possible. Or, as stated below, make it contiguous. >> I don''t know if that is automatic with ZFS today, but it''s an idea. > > I think the expected benefits of making it contiguous is > rooted in the belief that bigger I/Os is the only way to > reach top performance. > > I think that before ZFS, both physical and logical > contiguity was required to enable sufficient readahead and > get performance. > > Once we have good readahead based on detected logical > contiguous accesses, It may well be possible to get top > device speed through reasonably-sized I/O concurrency. > > -r >----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
The problem I see with "sequential access jump all over the place" is that this increases the utilization of the disks - over the years disks have become even faster for sequential access, whereas random access (as they have to move the actuator) has not improved at the same pace - this is what ZFS exploits when writing. With its fancy detection of sequential access patterns and improved readahead, ZFS should be able to deal with the latency aspect of randomized read accesses - but at the expense of a higher disk utilization. If you think of many processes accessing the same disks this may result in disks "running out of IOPS" earlier than in an environment with sequential accesses (though contiguos data). Obviously this heavily depends on the workload - but with the trend towards even higher capacity disks, IOPS become a valuable resource and it may be worth to think about how to most efficiently use disks - a "self-optimizing" mechanism that in the background or on request rearranges files to become contigous may therefore be useful. - Franz Gregory Shaw wrote:> Rich, correct me if I''m wrong, but here''s the scenario I was thinking > of: > > - A large file is created. > - Over time, the file grows and shrinks. > > The anticipated layout on disk due to this is that extents are > allocated as the file changes. The extents may or may not be on > multiple spindles. > > I envision a fragmentation over time that will cause sequential > access to jump all over the place. If you use smart controllers or > disks with read caching, their use of stripes and read-ahead (if > enabled) could cause performance to be bad. > > So, my thought was to de-fragment the file to make it more contiguous > and to allow hardware read-ahead to be effective. > > An additional benefit would be to spread it across multiple spindles > in a contiguous fashion, such as: > > disk0: 32mb > disk1: 32mb > disk2: 32mb > ... etc. > > Perhaps this is unnecessary. I''m simply trying to grasp the long > term performance implications of COW data. > > On May 15, 2006, at 8:47 AM, Roch Bourbonnais - Performance > Engineering wrote: > >> >> Gregory Shaw writes: >> >>> I really like the below idea: >>> - the ability to defragment a file ''live''. >>> >>> I can see instances where that could be very useful. For instance, >>> if you have multiple LUNs (or spindles, whatever) using ZFS, you >>> could re-optimize large files to spread the chunks across as many >>> spindles as possible. Or, as stated below, make it contiguous. >>> I don''t know if that is automatic with ZFS today, but it''s an idea. >> >> >> I think the expected benefits of making it contiguous is >> rooted in the belief that bigger I/Os is the only way to >> reach top performance. >> >> I think that before ZFS, both physical and logical >> contiguity was required to enable sufficient readahead and >> get performance. >> >> Once we have good readahead based on detected logical >> contiguous accesses, It may well be possible to get top >> device speed through reasonably-sized I/O concurrency. >> >> -r >> > > ----- > Gregory Shaw, IT Architect > Phone: (303) 673-8273 Fax: (303) 673-8273 > ITCTO Group, Sun Microsystems Inc. > 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) > Louisville, CO 80028-4382 shaw at fmsoft.com (home) > "When Microsoft writes an application for Linux, I''ve Won." - Linus > Torvalds > >
Nicolas Williams wrote:>On Sat, May 13, 2006 at 08:23:55AM +0200, Franz Haberhauer wrote: > > >>Given that ISV apps can be only changed by the ISV who may or may not be >>willing to >>use such a new interface, having a "no cache" property for the file - or >>given that filesystems >>are now really cheap with ZFS - for the filesystem would be important >>as well, >>like the forcedirectio mount option for UFS. >>No caching at the filesystem level is always appropriate if the >>application itself >>maintains a buffer of application data and does their own application >>specific buffer management >>like DBMSes or large matrix solvers. Double caching these typicaly huge >>amounts data >>in the filesystem is always a waste of RAM. >> >> > >Yes, but remember, DB vendors have adopted new features before -- they >want to have the fastest DB. Same with open source web servers. So I''m >a bit optimistic. > >Yes, but they usually adopt it only with their latest releases, but it takes time until these are adopted by customers. And it''s not just DB vendors, there are other apps around which could benefit, and there are always some who may not adopt a new feature in Solaris at all. Remember when UFS Directio was introduced - forcedirectio was in much wider use than apps which used the API directly.>Also, an LD_PRELOAD library could be provided to enable direct I/O as >necessary. > >This would work technically, but wether ISVs are willing to support such usage is a different topic (there may be startup scripts involved making it a little tricky to pass an library path to the app). So while having the app request no caching may be the architecturally cleaner approach, having it as a property on a file or filesystem is a pragmatic approach, with faster time-to-market and a potential for a much broader use. - Franz>Nico > >
Franz Haberhauer wrote:> This would work technically, but wether ISVs are willing to support such > usage is a different > topic (there may be startup scripts involved making it a little tricky > to pass an library path > to the app).Yet another reason to start the applications from SMF. -- Darren J Moffat
On Mon, May 15, 2006 at 07:16:38PM +0200, Franz Haberhauer wrote:> Nicolas Williams wrote: > >Yes, but remember, DB vendors have adopted new features before -- they > >want to have the fastest DB. Same with open source web servers. So I''m > >a bit optimistic. > > > > > Yes, but they usually adopt it only with their latest releases, but it > takes time until these are > adopted by customers. And it''s not just DB vendors, there are other apps > around which could > benefit, and there are always some who may not adopt a new feature in > Solaris at all. > Remember when UFS Directio was introduced - forcedirectio was in much > wider use than > apps which used the API directly.I (but I''m not in the ZFS team) don''t oppose a file attribute of some sort to provide hints to the FS about the utility of direct I/O to processes that open such files. Ideally the OS could just figure it out every time with enough accuracy that no interface should be necessary at all, but I''m not sure that this is possible. But really, the right interface is for the application to tell the OS. I don''t know what others (marketing particularly -- you may well be right about time to market) here think of it but if we could just stick to proper interfaces that would be best. Nico --
Nicolas Williams wrote:> On Mon, May 15, 2006 at 07:16:38PM +0200, Franz Haberhauer wrote: >> Nicolas Williams wrote: >>> Yes, but remember, DB vendors have adopted new features before -- they >>> want to have the fastest DB. Same with open source web servers. So I''m >>> a bit optimistic. >>> >>> >> Yes, but they usually adopt it only with their latest releases, but it >> takes time until these are >> adopted by customers. And it''s not just DB vendors, there are other apps >> around which could >> benefit, and there are always some who may not adopt a new feature in >> Solaris at all. >> Remember when UFS Directio was introduced - forcedirectio was in much >> wider use than >> apps which used the API directly. > > I (but I''m not in the ZFS team) don''t oppose a file attribute of some > sort to provide hints to the FS about the utility of direct I/O to > processes that open such files. > > Ideally the OS could just figure it out every time with enough accuracy > that no interface should be necessary at all, but I''m not sure that this > is possible. > > But really, the right interface is for the application to tell the OS. > I don''t know what others (marketing particularly -- you may well be > right about time to market) here think of it but if we could just stick > to proper interfaces that would be best. > > NicoPerhaps an fadvise call is in order? - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
On Mon, May 15, 2006 at 11:17:17AM -0700, Bart Smaalders wrote:> Perhaps an fadvise call is in order?We already have directio(3C). (That was a surprise for me also.)
Roch Bourbonnais - Performance Engineering
2006-May-22 15:07 UTC
[zfs-discuss] ZFS and databases
Gregory Shaw writes: > Rich, correct me if I''m wrong, but here''s the scenario I was thinking > of: > > - A large file is created. > - Over time, the file grows and shrinks. > > The anticipated layout on disk due to this is that extents are > allocated as the file changes. The extents may or may not be on > multiple spindles. > > I envision a fragmentation over time that will cause sequential > access to jump all over the place. If you use smart controllers or > disks with read caching, their use of stripes and read-ahead (if > enabled) could cause performance to be bad. > > So, my thought was to de-fragment the file to make it more contiguous > and to allow hardware read-ahead to be effective. > > An additional benefit would be to spread it across multiple spindles > in a contiguous fashion, such as: > > disk0: 32mb > disk1: 32mb > disk2: 32mb > ... etc. > > Perhaps this is unnecessary. I''m simply trying to grasp the long > term performance implications of COW data. > And my take is that, if I spread the 128K block all over but read then sufficiently ahead (say 2MB) then I shall be very much OK from the performance standpoint. Actually I just ran this (8M reads) and am getting pretty good numbers (single disk pool): rbourbon at crazycanucks(44): zpool iostat 1 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- zfs 24.4G 9.38G 0 0 59.5K 42 zfs 24.4G 9.38G 496 0 62.1M 0 zfs 24.4G 9.38G 497 0 62.1M 0 zfs 24.4G 9.38G 496 0 62.0M 0 zfs 24.4G 9.38G 496 0 62.0M 0 zfs 24.4G 9.38G 497 0 62.1M 0 zfs 24.4G 9.38G 493 0 61.6M 0 zfs 24.4G 9.38G 491 0 61.4M 0 zfs 24.4G 9.38G 492 0 61.5M 0 zfs 24.4G 9.38G 485 0 60.6M 0 So what benefit do you see from relaying-out the file, is it just performance, or something else ? One benefit that ZFS gets out of working with smaller chunks is that everytime one I/O completes then ZFS can decide which of the in-kernel queue ZIO has the highest priority. If you swamp a disk with a ton of very large I/Os; they will take more time to complete and high priority ones that show up in-between will just have to wait more. -r
On May 22, 2006, at 9:07 AM, Roch Bourbonnais - Performance Engineering wrote:> > Gregory Shaw writes: >> Rich, correct me if I''m wrong, but here''s the scenario I was thinking >> of: >> >> - A large file is created. >> - Over time, the file grows and shrinks. >> >> The anticipated layout on disk due to this is that extents are >> allocated as the file changes. The extents may or may not be on >> multiple spindles. >> >> I envision a fragmentation over time that will cause sequential >> access to jump all over the place. If you use smart controllers or >> disks with read caching, their use of stripes and read-ahead (if >> enabled) could cause performance to be bad. >> >> So, my thought was to de-fragment the file to make it more contiguous >> and to allow hardware read-ahead to be effective. >> >> An additional benefit would be to spread it across multiple spindles >> in a contiguous fashion, such as: >> >> disk0: 32mb >> disk1: 32mb >> disk2: 32mb >> ... etc. >> >> Perhaps this is unnecessary. I''m simply trying to grasp the long >> term performance implications of COW data. >> > > And my take is that, if I spread the 128K block all over but > read then sufficiently ahead (say 2MB) then I shall be very > much OK from the performance standpoint. > > Actually I just ran this (8M reads) and am getting pretty > good numbers (single disk pool): > > rbourbon at crazycanucks(44): zpool iostat 1 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > zfs 24.4G 9.38G 0 0 59.5K 42 > zfs 24.4G 9.38G 496 0 62.1M 0 > zfs 24.4G 9.38G 497 0 62.1M 0 > zfs 24.4G 9.38G 496 0 62.0M 0 > zfs 24.4G 9.38G 496 0 62.0M 0 > zfs 24.4G 9.38G 497 0 62.1M 0 > zfs 24.4G 9.38G 493 0 61.6M 0 > zfs 24.4G 9.38G 491 0 61.4M 0 > zfs 24.4G 9.38G 492 0 61.5M 0 > zfs 24.4G 9.38G 485 0 60.6M 0 > > > So what benefit do you see from relaying-out the file, is it > just performance, or something else ? > > One benefit that ZFS gets out of working with smaller chunks > is that everytime one I/O completes then ZFS can decide > which of the in-kernel queue ZIO has the highest > priority. If you swamp a disk with a ton of very large I/Os; > they will take more time to complete and high priority ones > that show up in-between will just have to wait more. > > -r >Was the above random access reads? I was thinking of the case of a database. I''ve got a number of 2g (or larger) files that over time get written in a random fashion. With COW, I would expect the re- write to cause the file extents to slowly migrate to be all over the disk. I think this would be sub-optimal, as the disk would have to do a lot of seeks to access data that may be contiguous in the file, but not on the disk. An example of this would be to create a 2g file, and re-write the file in random chunks. After a set percentage of the data were replace (say 10% intervals), a comparison sequential read test would be performed. If the numbers grow as the file is updated, that would indicate changes due to COW and file fragmentation. Have you heard of stkio? It''s a simple tool to simulate different types of disk loads that could be used for this testing. I''ve attached it for reference: -------------- next part -------------- A non-text attachment was scrubbed... Name: stkio.tar Type: application/x-tar Size: 133120 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060522/09dd0c5c/attachment.tar> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: stkio.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060522/09dd0c5c/attachment.txt> -------------- next part -------------- ----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
Roch Bourbonnais - Performance Engineering
2006-May-22 15:50 UTC
[zfs-discuss] ZFS and databases
Cool, I''ll try the tool and for good measure the data I posted was sequential access (from logical point of view). As for the physical layout, I don''t know, it''s quite possible that ZFS has layed out all blocks sequentially on the physical side; so certainly this is not a good way to test random-read access. Looked too good. -r
Sorry for resurrecting this interesting discussion so late: I''m skinning backwards through the forum. One comment about segregating database logs is that people who take their data seriously often want a ''belt plus suspenders'' approach to recovery. Conventional RAID, even supplemented with ZFS''s self-healing scrubbing, isn''t sufficient (though RAID-6 might be): they want at least the redo logs separate so that in the extremely unlikely event that they lose something in the (already replicated) database the failure is guaranteed not to have affected the redo logs as well, from which they can reconstruct the current database state from a backup. True, this will mean that you can''t aggregate redo log activity with other transaction bulk-writes, but that''s at least partly good as well: databases are often extremely sensitive to redo log write latency and would not want such writes delayed by combination with other updates, let alone by up to a 5-second delay. ZFS''s synchronous write intent log could help here (if you replicate it: serious database people would consider even the very temporary exposure to a single failure inherent in an unmirrored log completely unacceptable), but that could also be slowed by other synch small write activity; conversely, databases often couldn''t care less about the latency of many of their other writes, because their own (replicated) redo log has already established the persistence that they need. As for direct I/O, it''s not clear why ZFS couldn''t support it: it could verify each read in user memory against its internal checksum and perform its self-healing magic if necessary before returning completion status (which would be the same status it would return if the same situation occurred during its normal mode of operation: either unconditional success or success-after-recovery if the application might care to know that); it could handle each synchronous write analogously, and if direct I/O mechanisms support lazy writes then presumably they tie up the user buffer until the write completes such that you could use your normal mechanisms there as well (just operating on the user buffer instead of your cache). In this I''m assuming that ''direct I/O'' refers not to raw device access but to file-oriented access that simply avoids any internal cache use, such that you could still use your no-overwrite approach. Of course, this also assumes that the direct I/O is always being performed in aligned integral multiples of checksum units by the application; if not, you''d either have to bag the checksum facility (this would not be an entirely unreasonable option to offer, given that some sophisticated applictions might want to use their own even higher-level integrity mechanisms, e.g., across geographically-separated sites, and would not need yours) or run everything through cache as you normally do. In suitably-aligned cases where you do validate the data you could avoid half the copy overhead (an issue of memory bandwidth as well as simply operation latency: TPC-C submissions can be affected by this, though it may be rare in real-world use) by integrating the checksum calculation with the copy, but would still have multiple copies of the data taking up memory in a situation (direct I/O) where the application *by definition* does not expect you to be caching the data (quite likely because it is doing any desirable caching itself). Tablespace contiguity may, however, be a deal-breaker for some users: it is common for tablespaces to be scanned sequentially (when selection criteria don''t mesh with existing indexes, perhaps especially in joins where the smaller tablespace (still too large to be retained in cache, though) is scanned repeatedly in an inner loop, and a DBMS often goes to some effort to keep them defragmented. Until ZFS provides some effective continuous defragmenting mechanisms of its own, its no-overwrite policy may do more harm than good in such cases (since the database''s own logs keep persistence latency low, while the backing tablespaces can then be updated at leisure). I do want to comment on the observation that "enough concurrent 128K I/O can saturate a disk" - the apparent implication being that one could therefore do no better with larger accesses, an incorrect conclusion. Current disks can stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the average-seek-plus-partial-rotation required to get to that 128 KB in the first place. Thus on a full drive serial random accesses to 128 KB chunks will yield only about 20% of the drive''s streaming capability (by contrast, accessing data using serial random accesses in 4 MB contiguous chunks achieves around 90% of a drive''s streaming capability): one can do better on disks that support queuing if one allows queues to form, but this trades significantly increased average operation latency for the increase in throughput (and said increase still falls far short of the 90% utilization one could achieve using 4 MB chunking). Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this says little about effective utilization. Others have touched on several of these points as well - apologies for any repetition arising from writing while I read. - bill This message posted from opensolaris.org
billtodd wrote:> I do want to comment on the observation that "enough concurrent 128K I/O can > saturate a disk" - the apparent implication being that one could therefore do > no better with larger accesses, an incorrect conclusion. Current disks can > stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the > average-seek-plus-partial-rotation required to get to that 128 KB in the first > place. Thus on a full drive serial random accesses to 128 KB chunks will yield > only about 20% of the drive''s streaming capability (by contrast, accessing > data using serial random accesses in 4 MB contiguous chunks achieves around > 90% of a drive''s streaming capability): one can do better on disks that > support queuing if one allows queues to form, but this trades significantly > increased average operation latency for the increase in throughput (and said > increase still falls far short of the 90% utilization one could achieve using > 4 MB chunking). > > Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this > says little about effective utilization.I think I can summarize where we are at on this. This is the classic big-{packet|block|$-line|bikini} versus small-{packet|block|$-line|bikini} argument. One size won''t fit all. The jury is still out on what all of this means for any given application. -- richard
For Output ops, ZFS could setup a 10MB I/O transfer to disk starting at sector X, or chunk that up in 128K while still assigning the same range of disk blocks for the operations. Yes there will be more control information going around, a little more CPU consumed, but the disk will be streaming all right, I would guess. Most heavy output load will behave this way with ZFS, random or not. The throughput will depend more on the availability of contiguous chunk of disk blocks that the actual record size in use. As for random input, the issue is that ZFS does not get a say as to what the application is requesting in terms of size and location. Yes doing a 4M Input of contiguous disk block will be faster than random reading 128K chunks spread out. But if the application is manipulating 4M objects, those will stream out and land on contiguous disk blocks (if available) and those should stream in as well (if our read codepath is clever enough). The fundamental question is really, is there something that ZFS does that causes data that represents an application logical unit of information (likely to be read as one chunk) and so that we would like to have contiguous on disk, to actually spread out everywhere on the platter. -r Richard Elling writes: > billtodd wrote: > > I do want to comment on the observation that "enough concurrent 128K I/O can > > saturate a disk" - the apparent implication being that one could therefore do > > no better with larger accesses, an incorrect conclusion. Current disks can > > stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the > > average-seek-plus-partial-rotation required to get to that 128 KB in the first > > place. Thus on a full drive serial random accesses to 128 KB chunks will yield > > only about 20% of the drive''s streaming capability (by contrast, accessing > > data using serial random accesses in 4 MB contiguous chunks achieves around > > 90% of a drive''s streaming capability): one can do better on disks that > > support queuing if one allows queues to form, but this trades significantly > > increased average operation latency for the increase in throughput (and said > > increase still falls far short of the 90% utilization one could achieve using > > 4 MB chunking). > > > > Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this > > says little about effective utilization. > > I think I can summarize where we are at on this. > > This is the classic big-{packet|block|$-line|bikini} versus > small-{packet|block|$-line|bikini} argument. One size won''t fit all. > > The jury is still out on what all of this means for any given application. > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-r: ZFS''s output aggregation mechanisms seem entirely adequate in terms of throughput, given that the ZIL should mask what would otherwise be poor disk utilization in the event of many small, synchronous writes. The problems are purely on the input side (just as they are with RAID-Z). The read-side fragmentation problem occurs when an application writes at fine grain and subsequently reads at coarse grain, as I mentioned in the example of a tablespace which is updated at fine grain and then streamed back in bulk for sequential scans. Ironically, you already have part of a solution in the ZIL, at least if the fine-grained updates are small enough to place there: once in the ZIL, you no longer need worry about over-writing the original data (ignoring for the moment the impact on your snapshot facility - a drawback of block-oriented snapshots, but one you''ll need to resolve if you ever want to defragment anything), since you can simply reapply the ZIL images until they stick and update checksums (if they''re maintained - see earlier comments) accordingly (this would require using the ZIL as a conventional transaction log to protect this action, but that''s not all that much more a stretch than its current small-update staging process). ZFS does not appear to deal with such situations very well right now: either it uses coarse-grained checksumming, in which case each of those small (e.g., 4 KB) tablespace updates turns into a read/modity/write operation on a 128 KB entity, or it uses fine-grained (4 KB in this case) checksums in which case these small blocks get spread all over the storage as they''re individually updated and the subsequent sequential tablespace scans run at well under 1 MB/sec/disk (even worse if RAID-Z is used). richard: Characterizing the disk-utilization problem as a classic big-block-vs.-small-block argument may be more a Unix mind-set issue than anything else: other file systems (including a few on Unix, for that matter) solve this by using extent-based allocation to aggregate many smaller (though still possibly variable-size) blocks into groups which can be streamed efficiently. - bill This message posted from opensolaris.org