I had to let this go and get on with testing DB2 on Solaris. I had to abandon zfs on local discs in x64 Solaris 10 5/08. The situation was that: * DB2 buffer pools occupied up to 90% of 32GB RAM on each host * DB2 cached the entire database in its buffer pools o having the file system repeat this was not helpful * running high-load DB2 tests for 2 weeks showed 100% file-system writes and practically no reads Having database tables on zfs meant launching applications took minutes instead of sub-second. The test configuration I ended up with was: * transaction logs worked well to zfs on SAN with compression (nearly 1 TB of logs) * database tables worked well with ufs directio to multiple SCSI discs on each of 4 hosts (using DB2 database partitioning feature) I refer to DIRECTIO only as this already provides a reasonable set of hints to the OS: * reads and writes need not be cached * write() should not return until data is in non-volatile storage o DB2 has multiple concurrent write() threads to optimize this strategy * I/O will usually be in whole blocks aligned on block boundaries As an aside: It should be possible to state to zfs that a device cache is non-volatile (e.g. on SAN) and does not need flushing. Otherwise SAN must be configured to ignore all cache flush commands.
Andrew Robb wrote:> I had to let this go and get on with testing DB2 on Solaris. I had to > abandon zfs on local discs in x64 Solaris 10 5/08.This version does not have the modern write throttle code, which should explain much of what you experience. The fix is available in Solaris 10 10/08. For more info, see Roch''s excellent blog http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle One CR to reference is http://bugs.opensolaris.org/view_bug.do?bug_id=6429205 IMHO, if you are trying to make performance measurements on such old releases, then you are at great risk of wasting your time. You would be better served to look at more recent releases, within the constraints of your business, of course. -- richard> > The situation was that: > > * DB2 buffer pools occupied up to 90% of 32GB RAM on each host > * DB2 cached the entire database in its buffer pools > o having the file system repeat this was not helpful > * running high-load DB2 tests for 2 weeks showed 100% file-system > writes and practically no reads > > Having database tables on zfs meant launching applications took > minutes instead of sub-second. > > The test configuration I ended up with was: > > * transaction logs worked well to zfs on SAN with compression > (nearly 1 TB of logs) > * database tables worked well with ufs directio to multiple SCSI > discs on each of 4 hosts (using DB2 database partitioning feature) > > I refer to DIRECTIO only as this already provides a reasonable set of > hints to the OS: > > * reads and writes need not be cached > * write() should not return until data is in non-volatile storage > o DB2 has multiple concurrent write() threads to optimize this > strategy > * I/O will usually be in whole blocks aligned on block boundaries > > As an aside: It should be possible to state to zfs that a device cache > is non-volatile (e.g. on SAN) and does not need flushing. Otherwise > SAN must be configured to ignore all cache flush commands. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Andrew Robb wrote:> As an aside: It should be possible to state to zfs that a device cache > is non-volatile (e.g. on SAN) and does not need flushing. Otherwise SAN > must be configured to ignore all cache flush commands.ZFS already does the right thing. It sends a "flush volatile cache" command. If your storage array misbehaves and flushes non-volatile cache when it receives this command, get your storage vendor to fix their code. Disabling cache flushes entirely is just a hack to work around broken storage controller code. -- Carson
Richard Elling wrote:> Andrew Robb wrote: >> I had to let this go and get on with testing DB2 on Solaris. I had to >> abandon zfs on local discs in x64 Solaris 10 5/08. > > This version does not have the modern write throttle code, which > should explain much of what you experience. The fix is available > in Solaris 10 10/08. For more info, see Roch''s excellent blog > http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle > > One CR to reference is > http://bugs.opensolaris.org/view_bug.do?bug_id=6429205 > > IMHO, if you are trying to make performance measurements on such > old releases, then you are at great risk of wasting your time. You > would be better served to look at more recent releases, within the > constraints of your business, of course. > -- richard >This still misses the BIG point - DIRECTIO primarily tries to avoid data entering the file system cache (the database already caches it in its own much larger buffer pools). For this big-iron cluster, once written, the table data is only read back from file if the database is restarted. I suppose the same is also true of transaction logs, which are only replayed as part of data recovery. Typically, a database will be fastest if it avoids the file system altogether. However, this is difficult to manage and we will benefit greatly if file systems are nearly as fast as raw devices.>> >> The situation was that: >> >> * DB2 buffer pools occupied up to 90% of 32GB RAM on each host >> * DB2 cached the entire database in its buffer pools >> o having the file system repeat this was not helpful >> * running high-load DB2 tests for 2 weeks showed 100% file-system >> writes and practically no reads >> >> Having database tables on zfs meant launching applications took >> minutes instead of sub-second. >> >> The test configuration I ended up with was: >> >> * transaction logs worked well to zfs on SAN with compression >> (nearly 1 TB of logs) >> * database tables worked well with ufs directio to multiple SCSI >> discs on each of 4 hosts (using DB2 database partitioning feature) >> >> I refer to DIRECTIO only as this already provides a reasonable set of >> hints to the OS: >> >> * reads and writes need not be cached >> * write() should not return until data is in non-volatile storage >> o DB2 has multiple concurrent write() threads to optimize this >> strategy >> * I/O will usually be in whole blocks aligned on block boundaries >> >> As an aside: It should be possible to state to zfs that a device >> cache is non-volatile (e.g. on SAN) and does not need flushing. >> Otherwise SAN must be configured to ignore all cache flush commands. >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Andrew Robb wrote:> Richard Elling wrote: >> Andrew Robb wrote: >>> I had to let this go and get on with testing DB2 on Solaris. I had >>> to abandon zfs on local discs in x64 Solaris 10 5/08. >> >> This version does not have the modern write throttle code, which >> should explain much of what you experience. The fix is available >> in Solaris 10 10/08. For more info, see Roch''s excellent blog >> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle >> >> One CR to reference is >> http://bugs.opensolaris.org/view_bug.do?bug_id=6429205 >> >> IMHO, if you are trying to make performance measurements on such >> old releases, then you are at great risk of wasting your time. You >> would be better served to look at more recent releases, within the >> constraints of your business, of course. >> -- richard >> > This still misses the BIG point - DIRECTIO primarily tries to avoid > data entering the file system cache (the database already caches it in > its own much larger buffer pools). For this big-iron cluster, once > written, the table data is only read back from file if the database is > restarted. I suppose the same is also true of transaction logs, which > are only replayed as part of data recovery. Typically, a database will > be fastest if it avoids the file system altogether. However, this is > difficult to manage and we will benefit greatly if file systems are > nearly as fast as raw devices. >There are a lot of misunderstandings surrounding directio. UFS directio offers the following 4 features: 1. no buffer cache (ZFS: primarycache property) 2. concurrent I/O (ZFS: concurrent by design) 3. async I/O code path (ZFS: more modern code path) 4. long urban myth history (ZFS: forgetaboutit ;-) The following pointers might be useful for you. http://blogs.sun.com/bobs/entry/one_i_o_two_i http://blogs.sun.com/roch/entry/people_ask_where_are_we What is missing from the above (note to self: blog this :-) is that you can limit the size of the buffer cache and control what sort of data is cached via the "primarycache" parameter. It really doesn''t make a lot of sense to have zero buffer cache since *any* disk IO is going to be much, much, much more painful than any bcopy/memcpy. If you really don''t want the ARC resizing on your behalf, then you can cap it, as we describe in the Evil Tuning Guide. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache But I think you''ll find that the write throttle change is the big win and that primarycache gives you fine control of cache behaviour. -- richard
Richard Elling wrote:> Andrew Robb wrote: >> Richard Elling wrote: >>> Andrew Robb wrote: >>>> I had to let this go and get on with testing DB2 on Solaris. I had >>>> to abandon zfs on local discs in x64 Solaris 10 5/08. >>> >>> This version does not have the modern write throttle code, which >>> should explain much of what you experience. The fix is available >>> in Solaris 10 10/08. For more info, see Roch''s excellent blog >>> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle >>> >>> One CR to reference is >>> http://bugs.opensolaris.org/view_bug.do?bug_id=6429205 >>> >>> IMHO, if you are trying to make performance measurements on such >>> old releases, then you are at great risk of wasting your time. You >>> would be better served to look at more recent releases, within the >>> constraints of your business, of course. >>> -- richard >>> >> This still misses the BIG point - DIRECTIO primarily tries to avoid >> data entering the file system cache (the database already caches it >> in its own much larger buffer pools). For this big-iron cluster, once >> written, the table data is only read back from file if the database >> is restarted. I suppose the same is also true of transaction logs, >> which are only replayed as part of data recovery. Typically, a >> database will be fastest if it avoids the file system altogether. >> However, this is difficult to manage and we will benefit greatly if >> file systems are nearly as fast as raw devices. >> > > There are a lot of misunderstandings surrounding directio. > UFS directio offers the following 4 features: > 1. no buffer cache (ZFS: primarycache property) > 2. concurrent I/O (ZFS: concurrent by design) > 3. async I/O code path (ZFS: more modern code path) > 4. long urban myth history (ZFS: forgetaboutit ;-) > > The following pointers might be useful for you. > http://blogs.sun.com/bobs/entry/one_i_o_two_i > http://blogs.sun.com/roch/entry/people_ask_where_are_we > > What is missing from the above (note to self: blog this :-) is that > you can limit the size of the buffer cache and control what sort > of data is cached via the "primarycache" parameter. It really > doesn''t make a lot of sense to have zero buffer cache since > *any* disk IO is going to be much, much, much more painful than > any bcopy/memcpy. > If you really don''t want the ARC resizing on your behalf, then you > can cap it, as we describe in the Evil Tuning Guide. > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache > > But I think you''ll find that the write throttle change is the big win > and that primarycache gives you fine control of cache behaviour. > -- richard > >Thanks for that information, Richard. It still doesn''t explain what method an application has available to avoid caching on a particular file handle. On a given file system, most applications will benefit from caching most files. However, some applications will want to give a hint to the OS that it is REALY a BAD idea to cache its file read/write operations on a file handle. The directio() call is a standard mechanism to achieve this on Solaris. In my opinion, cache is good for directories and small files. Cache is bad for sequential access to files larger than physical memory (e.g. appending to transaction logs). In between, it should be up to an application to give a hint to the OS as to whether cache is worthwhile or not. *The problem is that pointlessly caching large files can push small files out of the cache.* Your ''primarycache'' parameter suggestions sounds like it applies to a whole file system. (I hope that ''metadata'' includes directories.) This would be inadequate. If we have to set up a separate file system just for table space files, we might as well use ufs ;-). Actually, your suggestion is pretty good. We would naturally have separate zfs file systems for table spaces, in order to match zfs record size to table space block size. I hope I understand you correctly. Let me summarise your suggestions: 1. upgrade Solaris to 10/08 to enable zfs write-throttling 2. set zfs ''primarycache'' to ''metadata'' for table space and transaction log file systems 3. ensure that SAN complies with ''flush volatile cache'' semantics 4. tune ARC not to break large memory pages 5. tune ARC not to grow into database buffer pools (which are typically 90% of RAM) Also, for table space file systems, I would: 4. set zfs record size to match database block size 5. turn off atime 6. turn off checksum calculation (if not using raidz) Note: when creating a DB2 database, the database is typically restarted several times. It would only be worthwhile changing ''primarycache'' from ''all'' once the database is finally running. (Use ARC to cache the empty database between restarts during configuration.) Regards, Andy Robb.
Andrew Robb wrote:> Richard Elling wrote: >> Andrew Robb wrote: >>> Richard Elling wrote: >>>> Andrew Robb wrote: >>>>> I had to let this go and get on with testing DB2 on Solaris. I had >>>>> to abandon zfs on local discs in x64 Solaris 10 5/08. >>>> >>>> This version does not have the modern write throttle code, which >>>> should explain much of what you experience. The fix is available >>>> in Solaris 10 10/08. For more info, see Roch''s excellent blog >>>> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle >>>> >>>> One CR to reference is >>>> http://bugs.opensolaris.org/view_bug.do?bug_id=6429205 >>>> >>>> IMHO, if you are trying to make performance measurements on such >>>> old releases, then you are at great risk of wasting your time. You >>>> would be better served to look at more recent releases, within the >>>> constraints of your business, of course. >>>> -- richard >>>> >>> This still misses the BIG point - DIRECTIO primarily tries to avoid >>> data entering the file system cache (the database already caches it >>> in its own much larger buffer pools). For this big-iron cluster, >>> once written, the table data is only read back from file if the >>> database is restarted. I suppose the same is also true of >>> transaction logs, which are only replayed as part of data recovery. >>> Typically, a database will be fastest if it avoids the file system >>> altogether. However, this is difficult to manage and we will benefit >>> greatly if file systems are nearly as fast as raw devices. >>> >> >> There are a lot of misunderstandings surrounding directio. >> UFS directio offers the following 4 features: >> 1. no buffer cache (ZFS: primarycache property) >> 2. concurrent I/O (ZFS: concurrent by design) >> 3. async I/O code path (ZFS: more modern code path) >> 4. long urban myth history (ZFS: forgetaboutit ;-) >> >> The following pointers might be useful for you. >> http://blogs.sun.com/bobs/entry/one_i_o_two_i >> http://blogs.sun.com/roch/entry/people_ask_where_are_we >> >> What is missing from the above (note to self: blog this :-) is that >> you can limit the size of the buffer cache and control what sort >> of data is cached via the "primarycache" parameter. It really >> doesn''t make a lot of sense to have zero buffer cache since >> *any* disk IO is going to be much, much, much more painful than >> any bcopy/memcpy. >> If you really don''t want the ARC resizing on your behalf, then you >> can cap it, as we describe in the Evil Tuning Guide. >> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache >> >> But I think you''ll find that the write throttle change is the big win >> and that primarycache gives you fine control of cache behaviour. >> -- richard >> >> > Thanks for that information, Richard. It still doesn''t explain what > method an application has available to avoid caching on a particular > file handle. On a given file system, most applications will benefit > from caching most files. However, some applications will want to give > a hint to the OS that it is REALY a BAD idea to cache its file > read/write operations on a file handle. The directio() call is a > standard mechanism to achieve this on Solaris.I''d like to explore this a bit with real use cases. There are many reasons for the design of directio() that have to do with the limitations of UFS and NFS that may not exist in ZFS.> > In my opinion, cache is good for directories and small files. Cache is > bad for sequential access to files larger than physical memory (e.g. > appending to transaction logs).Eh? Methinks you are getting a little confused with write caching vs read caching. If I append a lot of little 512-byte blocks to a file, then batching those up into a bigger record is a big win. Having to block while waiting for them to commit to disk is a big loss. But most transaction logs are written O_DSYNC, which follows a completely different code path than normal I/O in both UFS and ZFS.> In between, it should be up to an application to give a hint to the OS > as to whether cache is worthwhile or not. *The problem is that > pointlessly caching large files can push small files out of the cache.*This is particularly true for MRU caches, such as UFS. But ZFS implements an ARC, so the effect is mitigated. While I''m sure there are cases where the performance will be very bad, I don''t think that is generally true. Again, a decent use case would be helpful. NB, this was a much bigger problem in the bad old days when 1 GByte of RAM cost $1M. Today, 1 GByte of RAM costs orders of magnitude less.> > Your ''primarycache'' parameter suggestions sounds like it applies to a > whole file system. (I hope that ''metadata'' includes directories.) This > would be inadequate. If we have to set up a separate file system just > for table space files, we might as well use ufs ;-). Actually, your > suggestion is pretty good. We would naturally have separate zfs file > systems for table spaces, in order to match zfs record size to table > space block size.From the use cases, we could determine the best match of workload to dataset, something which is not possible in UFS or NFS.> I hope I understand you correctly. Let me summarise your suggestions: > > 1. upgrade Solaris to 10/08 to enable zfs write-throttling > 2. set zfs ''primarycache'' to ''metadata'' for table space and > transaction log file systems > 3. ensure that SAN complies with ''flush volatile cache'' semantics > 4. tune ARC not to break large memory pages > 5. tune ARC not to grow into database buffer pools (which are > typically 90% of RAM)I haven''t looked at the recommendations for when an app used 90% of RAM lately. It may be time to revisit the recommendations to make sure they make good sense. In the past, there have been issues with large page availability, but that is a generic problem, not ZFS-specific. So the recommendations may not have been available if you were only searching for ZFS. There is, no doubt, opportunity for improvements in the docs here.> > Also, for table space file systems, I would: > 4. set zfs record size to match database block size > 5. turn off atime > 6. turn off checksum calculation (if not using raidz)Don''t do that! Originally there was some concern that the CPU cycles needed for the checksum would be wasted when an app also checksummed its own data. But the measurements didn''t support that claim. IMHO you are much better allowing ZFS to detect problems in the data *in addition* to any verification done by an app.> > Note: when creating a DB2 database, the database is typically > restarted several times. It would only be worthwhile changing > ''primarycache'' from ''all'' once the database is finally running. (Use > ARC to cache the empty database between restarts during configuration.)The use cases may reveal interesting opportunities for the L2ARC as well. I think the predominant feeling is that L2ARC is a big win for databases. This win will be much bigger than turning off all caching. -- richard