thr3ads.net - zfs discuss - [zfs-discuss] simulating directio on zfs? [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Andrew Robb

2009-Apr-20 12:32 UTC

[zfs-discuss] simulating directio on zfs?

I had to let this go and get on with testing DB2 on Solaris. I had to 
abandon zfs on local discs in x64 Solaris 10 5/08.

The situation was that:

    * DB2 buffer pools occupied up to 90% of 32GB RAM on each host
    * DB2 cached the entire database in its buffer pools
          o having the file system repeat this was not helpful
    * running high-load DB2 tests for 2 weeks showed 100% file-system
      writes and practically no reads

Having database tables on zfs meant launching applications took minutes 
instead of sub-second.

The test configuration I ended up with was:

    * transaction logs worked well to zfs on SAN with compression
      (nearly 1 TB of logs)
    * database tables worked well with ufs directio to multiple SCSI
      discs on each of 4 hosts (using DB2 database partitioning feature)

I refer to DIRECTIO only as this already provides a reasonable set of 
hints to the OS:

    * reads and writes need not be cached
    * write() should not return until data is in non-volatile storage
          o DB2 has multiple concurrent write() threads to optimize this
            strategy
    * I/O will usually be in whole blocks aligned on block boundaries

As an aside: It should be possible to state to zfs that a device cache 
is non-volatile (e.g. on SAN) and does not need flushing. Otherwise SAN 
must be configured to ignore all cache flush commands.

Richard Elling

2009-Apr-20 17:15 UTC

head link

[zfs-discuss] simulating directio on zfs?

Andrew Robb wrote:> I had to let this go and get on with testing DB2 on Solaris. I had to 
> abandon zfs on local discs in x64 Solaris 10 5/08.
This version does not have the modern write throttle code, which
should explain much of what you experience.  The fix is available
in Solaris 10 10/08.  For more info, see Roch''s excellent blog
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle

One CR to reference is
http://bugs.opensolaris.org/view_bug.do?bug_id=6429205

IMHO, if you are trying to make performance measurements on such
old releases, then you are at great risk of wasting your time.  You
would be better served to look at more recent releases, within the
constraints of your business, of course.
 -- richard
>
> The situation was that:
>
>    * DB2 buffer pools occupied up to 90% of 32GB RAM on each host
>    * DB2 cached the entire database in its buffer pools
>          o having the file system repeat this was not helpful
>    * running high-load DB2 tests for 2 weeks showed 100% file-system
>      writes and practically no reads
>
> Having database tables on zfs meant launching applications took 
> minutes instead of sub-second.
>
> The test configuration I ended up with was:
>
>    * transaction logs worked well to zfs on SAN with compression
>      (nearly 1 TB of logs)
>    * database tables worked well with ufs directio to multiple SCSI
>      discs on each of 4 hosts (using DB2 database partitioning feature)
>
> I refer to DIRECTIO only as this already provides a reasonable set of 
> hints to the OS:
>
>    * reads and writes need not be cached
>    * write() should not return until data is in non-volatile storage
>          o DB2 has multiple concurrent write() threads to optimize this
>            strategy
>    * I/O will usually be in whole blocks aligned on block boundaries
>
> As an aside: It should be possible to state to zfs that a device cache 
> is non-volatile (e.g. on SAN) and does not need flushing. Otherwise 
> SAN must be configured to ignore all cache flush commands.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Carson Gaspar

2009-Apr-20 18:37 UTC

head link

[zfs-discuss] simulating directio on zfs?

Andrew Robb wrote:
> As an aside: It should be possible to state to zfs that a device cache 
> is non-volatile (e.g. on SAN) and does not need flushing. Otherwise SAN 
> must be configured to ignore all cache flush commands.
ZFS already does the right thing. It sends a "flush volatile cache" 
command. If your storage array misbehaves and flushes non-volatile cache 
when it receives this command, get your storage vendor to fix their 
code. Disabling cache flushes entirely is just a hack to work around 
broken storage controller code.

-- 
Carson

Andrew Robb

2009-Apr-20 18:50 UTC

head link

[zfs-discuss] simulating directio on zfs?

Richard Elling wrote:> Andrew Robb wrote:
>> I had to let this go and get on with testing DB2 on Solaris. I had to 
>> abandon zfs on local discs in x64 Solaris 10 5/08.
>
> This version does not have the modern write throttle code, which
> should explain much of what you experience.  The fix is available
> in Solaris 10 10/08.  For more info, see Roch''s excellent blog
> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
>
> One CR to reference is
> http://bugs.opensolaris.org/view_bug.do?bug_id=6429205
>
> IMHO, if you are trying to make performance measurements on such
> old releases, then you are at great risk of wasting your time.  You
> would be better served to look at more recent releases, within the
> constraints of your business, of course.
> -- richard
>This still misses the BIG point - DIRECTIO primarily tries to avoid data 
entering the file system cache (the database already caches it in its 
own much larger buffer pools). For this big-iron cluster, once written, 
the table data is only read back from file if the database is restarted. 
I suppose the same is also true of transaction logs, which are only 
replayed as part of data recovery. Typically, a database will be fastest 
if it avoids the file system altogether. However, this is difficult to 
manage and we will benefit greatly if file systems are nearly as fast as 
raw devices.
>>
>> The situation was that:
>>
>>    * DB2 buffer pools occupied up to 90% of 32GB RAM on each host
>>    * DB2 cached the entire database in its buffer pools
>>          o having the file system repeat this was not helpful
>>    * running high-load DB2 tests for 2 weeks showed 100% file-system
>>      writes and practically no reads
>>
>> Having database tables on zfs meant launching applications took 
>> minutes instead of sub-second.
>>
>> The test configuration I ended up with was:
>>
>>    * transaction logs worked well to zfs on SAN with compression
>>      (nearly 1 TB of logs)
>>    * database tables worked well with ufs directio to multiple SCSI
>>      discs on each of 4 hosts (using DB2 database partitioning feature)
>>
>> I refer to DIRECTIO only as this already provides a reasonable set of 
>> hints to the OS:
>>
>>    * reads and writes need not be cached
>>    * write() should not return until data is in non-volatile storage
>>          o DB2 has multiple concurrent write() threads to optimize this
>>            strategy
>>    * I/O will usually be in whole blocks aligned on block boundaries
>>
>> As an aside: It should be possible to state to zfs that a device 
>> cache is non-volatile (e.g. on SAN) and does not need flushing. 
>> Otherwise SAN must be configured to ignore all cache flush commands.
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2009-Apr-20 20:28 UTC

head link

[zfs-discuss] simulating directio on zfs?

Andrew Robb wrote:> Richard Elling wrote:
>> Andrew Robb wrote:
>>> I had to let this go and get on with testing DB2 on Solaris. I had 
>>> to abandon zfs on local discs in x64 Solaris 10 5/08.
>>
>> This version does not have the modern write throttle code, which
>> should explain much of what you experience.  The fix is available
>> in Solaris 10 10/08.  For more info, see Roch''s excellent blog
>> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
>>
>> One CR to reference is
>> http://bugs.opensolaris.org/view_bug.do?bug_id=6429205
>>
>> IMHO, if you are trying to make performance measurements on such
>> old releases, then you are at great risk of wasting your time.  You
>> would be better served to look at more recent releases, within the
>> constraints of your business, of course.
>> -- richard
>>
> This still misses the BIG point - DIRECTIO primarily tries to avoid 
> data entering the file system cache (the database already caches it in 
> its own much larger buffer pools). For this big-iron cluster, once 
> written, the table data is only read back from file if the database is 
> restarted. I suppose the same is also true of transaction logs, which 
> are only replayed as part of data recovery. Typically, a database will 
> be fastest if it avoids the file system altogether. However, this is 
> difficult to manage and we will benefit greatly if file systems are 
> nearly as fast as raw devices.
>
There are a lot of misunderstandings surrounding directio.
UFS directio offers the following 4 features:
    1. no buffer cache  (ZFS: primarycache property)
    2. concurrent I/O  (ZFS: concurrent by design)
    3. async I/O code path (ZFS: more modern code path)
    4. long urban myth history (ZFS: forgetaboutit ;-)

The following pointers might be useful for you.
http://blogs.sun.com/bobs/entry/one_i_o_two_i
http://blogs.sun.com/roch/entry/people_ask_where_are_we

What is missing from the above (note to self: blog this :-) is that
you can limit the size of the buffer cache and control what sort
of data is cached via the "primarycache" parameter.  It really
doesn''t make  a lot of sense to have zero buffer cache since
*any* disk IO is going to be much, much, much more painful than
any bcopy/memcpy.

If you really don''t want the ARC resizing on your behalf, then you
can cap it, as we describe in the Evil Tuning Guide.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
But I think you''ll find that the write throttle change is the big win
and that primarycache gives you fine control of cache behaviour.
 -- richard

Andrew Robb

2009-Apr-21 08:07 UTC

head link

[zfs-discuss] simulating directio on zfs?

Richard Elling wrote:> Andrew Robb wrote:
>> Richard Elling wrote:
>>> Andrew Robb wrote:
>>>> I had to let this go and get on with testing DB2 on Solaris. I
had
>>>> to abandon zfs on local discs in x64 Solaris 10 5/08.
>>>
>>> This version does not have the modern write throttle code, which
>>> should explain much of what you experience.  The fix is available
>>> in Solaris 10 10/08.  For more info, see Roch''s excellent
blog
>>> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
>>>
>>> One CR to reference is
>>> http://bugs.opensolaris.org/view_bug.do?bug_id=6429205
>>>
>>> IMHO, if you are trying to make performance measurements on such
>>> old releases, then you are at great risk of wasting your time.  You
>>> would be better served to look at more recent releases, within the
>>> constraints of your business, of course.
>>> -- richard
>>>
>> This still misses the BIG point - DIRECTIO primarily tries to avoid 
>> data entering the file system cache (the database already caches it 
>> in its own much larger buffer pools). For this big-iron cluster, once 
>> written, the table data is only read back from file if the database 
>> is restarted. I suppose the same is also true of transaction logs, 
>> which are only replayed as part of data recovery. Typically, a 
>> database will be fastest if it avoids the file system altogether. 
>> However, this is difficult to manage and we will benefit greatly if 
>> file systems are nearly as fast as raw devices.
>>
>
> There are a lot of misunderstandings surrounding directio.
> UFS directio offers the following 4 features:
>    1. no buffer cache  (ZFS: primarycache property)
>    2. concurrent I/O  (ZFS: concurrent by design)
>    3. async I/O code path (ZFS: more modern code path)
>    4. long urban myth history (ZFS: forgetaboutit ;-)
>
> The following pointers might be useful for you.
> http://blogs.sun.com/bobs/entry/one_i_o_two_i
> http://blogs.sun.com/roch/entry/people_ask_where_are_we
>
> What is missing from the above (note to self: blog this :-) is that
> you can limit the size of the buffer cache and control what sort
> of data is cached via the "primarycache" parameter.  It really
> doesn''t make  a lot of sense to have zero buffer cache since
> *any* disk IO is going to be much, much, much more painful than
> any bcopy/memcpy.
> If you really don''t want the ARC resizing on your behalf, then you
> can cap it, as we describe in the Evil Tuning Guide.
>
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
>
> But I think you''ll find that the write throttle change is the big
win
> and that primarycache gives you fine control of cache behaviour.
> -- richard
>
>Thanks for that information, Richard. It still doesn''t explain what 
method an application has available to avoid caching on a particular 
file handle. On a given file system, most applications will benefit from 
caching most files. However, some applications will want to give a hint 
to the OS that it is REALY a BAD idea to cache its file read/write 
operations on a file handle. The directio() call is a standard mechanism 
to achieve this on Solaris.

In my opinion, cache is good for directories and small files. Cache is 
bad for sequential access to files larger than physical memory (e.g. 
appending to transaction logs). In between, it should be up to an 
application to give a hint to the OS as to whether cache is worthwhile 
or not. *The problem is that pointlessly caching large files can push 
small files out of the cache.*

Your ''primarycache'' parameter suggestions sounds like it
applies to a
whole file system. (I hope that ''metadata'' includes
directories.) This
would be inadequate. If we have to set up a separate file system just 
for table space files, we might as well use ufs ;-). Actually, your 
suggestion is pretty good. We would naturally have separate zfs file 
systems for table spaces, in order to match zfs record size to table 
space block size.

I hope I understand you correctly. Let me summarise your suggestions:

1. upgrade Solaris to 10/08 to enable zfs write-throttling
2. set zfs ''primarycache'' to ''metadata'' for
table space and transaction
log file systems
3. ensure that SAN complies with ''flush volatile cache''
semantics
4. tune ARC not to break large memory pages
5. tune ARC not to grow into database buffer pools (which are typically 
90% of RAM)

Also, for table space file systems, I would:
4. set zfs record size to match database block size
5. turn off atime
6. turn off checksum calculation (if not using raidz)

Note: when creating a DB2 database, the database is typically restarted 
several times. It would only be worthwhile changing
''primarycache'' from
''all'' once the database is finally running. (Use ARC to cache
the empty
database between restarts during configuration.)

Regards,
Andy Robb.

Richard Elling

2009-Apr-22 21:47 UTC

head link

[zfs-discuss] simulating directio on zfs?

Andrew Robb wrote:> Richard Elling wrote:
>> Andrew Robb wrote:
>>> Richard Elling wrote:
>>>> Andrew Robb wrote:
>>>>> I had to let this go and get on with testing DB2 on
Solaris. I had
>>>>> to abandon zfs on local discs in x64 Solaris 10 5/08.
>>>>
>>>> This version does not have the modern write throttle code,
which
>>>> should explain much of what you experience.  The fix is
available
>>>> in Solaris 10 10/08.  For more info, see Roch''s
excellent blog
>>>> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
>>>>
>>>> One CR to reference is
>>>> http://bugs.opensolaris.org/view_bug.do?bug_id=6429205
>>>>
>>>> IMHO, if you are trying to make performance measurements on
such
>>>> old releases, then you are at great risk of wasting your time. 
You
>>>> would be better served to look at more recent releases, within
the
>>>> constraints of your business, of course.
>>>> -- richard
>>>>
>>> This still misses the BIG point - DIRECTIO primarily tries to avoid
>>> data entering the file system cache (the database already caches it
>>> in its own much larger buffer pools). For this big-iron cluster, 
>>> once written, the table data is only read back from file if the 
>>> database is restarted. I suppose the same is also true of 
>>> transaction logs, which are only replayed as part of data recovery.
>>> Typically, a database will be fastest if it avoids the file system 
>>> altogether. However, this is difficult to manage and we will
benefit
>>> greatly if file systems are nearly as fast as raw devices.
>>>
>>
>> There are a lot of misunderstandings surrounding directio.
>> UFS directio offers the following 4 features:
>>    1. no buffer cache  (ZFS: primarycache property)
>>    2. concurrent I/O  (ZFS: concurrent by design)
>>    3. async I/O code path (ZFS: more modern code path)
>>    4. long urban myth history (ZFS: forgetaboutit ;-)
>>
>> The following pointers might be useful for you.
>> http://blogs.sun.com/bobs/entry/one_i_o_two_i
>> http://blogs.sun.com/roch/entry/people_ask_where_are_we
>>
>> What is missing from the above (note to self: blog this :-) is that
>> you can limit the size of the buffer cache and control what sort
>> of data is cached via the "primarycache" parameter.  It
really
>> doesn''t make  a lot of sense to have zero buffer cache since
>> *any* disk IO is going to be much, much, much more painful than
>> any bcopy/memcpy.
>> If you really don''t want the ARC resizing on your behalf, then
you
>> can cap it, as we describe in the Evil Tuning Guide.
>>
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
>>
>> But I think you''ll find that the write throttle change is the
big win
>> and that primarycache gives you fine control of cache behaviour.
>> -- richard
>>
>>
> Thanks for that information, Richard. It still doesn''t explain
what
> method an application has available to avoid caching on a particular 
> file handle. On a given file system, most applications will benefit 
> from caching most files. However, some applications will want to give 
> a hint to the OS that it is REALY a BAD idea to cache its file 
> read/write operations on a file handle. The directio() call is a 
> standard mechanism to achieve this on Solaris.
I''d like to explore this a bit with real use cases.  There are many
reasons for the design of directio() that have to do with the limitations
of UFS and NFS that may not exist in ZFS.
>
> In my opinion, cache is good for directories and small files. Cache is 
> bad for sequential access to files larger than physical memory (e.g. 
> appending to transaction logs).
Eh?  Methinks you are getting a little confused with write caching vs
read caching. If I append a lot of little 512-byte blocks to a file, then
batching those up into a bigger record is a big win.  Having to block
while waiting for them to commit to disk is a big loss.  But most
transaction logs are written O_DSYNC, which follows a completely
different code path than normal I/O in both UFS and ZFS.
> In between, it should be up to an application to give a hint to the OS 
> as to whether cache is worthwhile or not. *The problem is that 
> pointlessly caching large files can push small files out of the cache.*
This is particularly true for MRU caches, such as UFS.  But ZFS
implements an ARC, so the effect is mitigated.  While I''m sure
there are cases where the performance will be very bad, I don''t
think that is generally true.  Again, a decent use case would be
helpful.

NB, this was a much bigger problem in the bad old days when
1 GByte of RAM cost $1M.  Today, 1 GByte of RAM costs orders of
magnitude less.
>
> Your ''primarycache'' parameter suggestions sounds like it
applies to a
> whole file system. (I hope that ''metadata'' includes
directories.) This
> would be inadequate. If we have to set up a separate file system just 
> for table space files, we might as well use ufs ;-). Actually, your 
> suggestion is pretty good. We would naturally have separate zfs file 
> systems for table spaces, in order to match zfs record size to table 
> space block size.
 From the use cases, we could determine the best match of workload
to dataset, something which is not possible in UFS or NFS.
> I hope I understand you correctly. Let me summarise your suggestions:
>
> 1. upgrade Solaris to 10/08 to enable zfs write-throttling
> 2. set zfs ''primarycache'' to ''metadata''
for table space and
> transaction log file systems
> 3. ensure that SAN complies with ''flush volatile cache''
semantics
> 4. tune ARC not to break large memory pages
> 5. tune ARC not to grow into database buffer pools (which are 
> typically 90% of RAM)
I haven''t looked at the recommendations for when an app used 90% of RAM
lately.  It may be time to revisit the recommendations to make sure they 
make
good sense. 

In the past, there have been issues with large page availability, but 
that is a
generic problem, not ZFS-specific.  So the recommendations may not have been
available if you were only searching for ZFS.  There is, no doubt, 
opportunity for
improvements in the docs here.
>
> Also, for table space file systems, I would:
> 4. set zfs record size to match database block size
> 5. turn off atime
> 6. turn off checksum calculation (if not using raidz)
Don''t do that!  Originally there was some concern that the CPU cycles
needed
for the checksum would be wasted when an app also checksummed its own
data.  But the measurements didn''t support that claim.  IMHO you are
much
better allowing ZFS to detect problems in the data *in addition* to any
verification done by an app.
>
> Note: when creating a DB2 database, the database is typically 
> restarted several times. It would only be worthwhile changing 
> ''primarycache'' from ''all'' once the
database is finally running. (Use
> ARC to cache the empty database between restarts during configuration.)
The use cases may reveal interesting opportunities for the L2ARC
as well.  I think the predominant feeling is that L2ARC is a big win
for databases.  This win will be much bigger than turning off all
caching.
 -- richard

Reasonably Related Threads

Search for more maybe matching threads

zfs discuss - Apr 2009 - simulating directio on zfs?

[zfs-discuss] simulating directio on zfs?

[zfs-discuss] simulating directio on zfs?

[zfs-discuss] simulating directio on zfs?

[zfs-discuss] simulating directio on zfs?

[zfs-discuss] simulating directio on zfs?

[zfs-discuss] simulating directio on zfs?

[zfs-discuss] simulating directio on zfs?

Reasonably Related Threads