thr3ads.net - zfs discuss - [zfs-discuss] Performance problems due to smaller ZFS recordsize [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Jim Mauro

2010-Oct-21 19:41 UTC

[zfs-discuss] Performance problems due to smaller ZFS recordsize

There is nothing in here that requires zfs confidential.

cross-posted to zfs discuss.

On Oct 21, 2010, at 3:37 PM, Jim Nissen wrote:
> Cross-posting.
> 
> -------- Original Message --------
> Subject:	Performance problems due to smaller ZFS recordsize
> Date:	Thu, 21 Oct 2010 14:00:42 -0500
> From:	Jim Nissen <jim.nissen at oracle.com>
> To:	PERF-ROUNDTABLE_WW at ORACLE.COM, perf-roundtable at sun.com,
kernel-support at sun.com
> 
> I working with a customer who is having Directory server backup performance
problems, since switching to ZFS.  In short, backups that used to take 1 - 4
hours on UFS are now taking 12+ hours on ZFS.  We''ve figured out that
ZFS reads seem to be throttled, where writes seem really fast.  Backend Storage
is IBM SVC.
> 
> As part of their cutover, they were given the following Best Practice
recommendations from LDAP folks @Sun...
> 
> /etc/system tunables:
> set zfs:zfs_arc_max = 0x100000000 
> set zfs:zfs_vdev_cache_size = 0 
> set zfs:zfs_vdev_cache_bshift = 13 
> set zfs:zfs_prefetch_disable = 1 
> set zfs:zfs_nocacheflush = 1 
> 
> At ZFS filesystem level:
> recordsize = 32K
> noatime
> 
> One of the things they noticed is that simple dd reads from one of the 132K
recordsize filesystems runs much faster (4 - 7 times) than their 32K
filesystems.  I joined a shared-shell where we switched the same filesystem from
32K to 128K, and we could see underlying disks were getting 4x better throughput
(from 1.5 - 2MB/sec to 8 - 10MB/s), whereas a direct dd against one of the disks
shows that the disks were capable of much more (45+ MB/sec).
> 
> Here are some snippets from iostat...
> 
> ZFS recordsize of 32K, dd if=./somelarge5gfile of=/dev/null bs=16k (to
mimic application blocksizes)
> 
>                     extended device statistics              
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    67.6    0.0 2132.7    0.0  0.0  0.3    0.0    4.5   0  30
c6t60050768018E82BDA800000000000565d0
>    67.4    0.0 2156.8    0.0  0.0  0.1    0.0    1.5   0  10
c6t60050768018E82BDA800000000000564d0
>    68.4    0.0 2158.3    0.0  0.0  0.3    0.0    4.5   0  31
c6t60050768018E82BDA800000000000563d0
>    66.2    0.0 2118.4    0.0  0.0  0.2    0.0    3.4   0  22
c6t60050768018E82BDA800000000000562d0
> 
> ZFS recordsize of 128K, dd if=./somelarge5gfile of=/dev/null bs=16k (to
mimic application blocksizes)
> 
>                     extended device statistics              
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    78.2    0.0 10009.6    0.0  0.0  0.2    0.0    1.9   0  15
c6t60050768018E82BDA800000000000565d0
>    78.6    0.0 9960.0    0.0  0.0  0.1    0.0    1.2   0  10
c6t60050768018E82BDA800000000000564d0
>    79.4    0.0 10062.3    0.0  0.0  0.4    0.0    4.4   0  35
c6t60050768018E82BDA800000000000563d0
>    76.6    0.0 9804.8    0.0  0.0  0.2    0.0    2.3   0  17
c6t60050768018E82BDA800000000000562d0
> 
> dd if=/dev/rdsk/c6t60050768018E82BDA800000000000564d0s0 of=/dev/null bs=32k
(to mimic small ZFS blocksize)
>                     extended device statistics              
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  3220.9    0.0 51533.9    0.0  0.0  0.9    0.0    0.3   1  94
c6t60050768018E82BDA800000000000564d0
> 
> So, it''s not like the underlying disk isn''t capable of
much more than what ZFS is asking of it.  I understand the part where it will
have to 4x as much work with 32K blocksize as with 128K, but it doesn''t
seem as if ZFS is doing much at all with underlying disks.
> 
> We''ve ask the customer to rerun the test without /etc/system
tunables.  Anybody else worked a similar issue?  Any hints provided would be
greatly appreciated.
> 
> Thanks!
> 
> Jim
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101021/b75676b1/attachment.html>

Jim Mauro

2010-Oct-25 16:19 UTC

head link

[zfs-discuss] Performance problems due to smaller ZFS recordsize

Hi Jim - cross-posting to zfs-discuss, because 20X is, to say the least,
compelling.

Obviously, it would be awesome if we had the opportunity to whittle-down which
of
the changes made this fly, or if it was a combination of the changes. 
Looking at them individually....

> set zfs:zfs_vdev_cache_size = 0
The default for this is 10MB per vdev, and as I understand it (which may be
wrong)
is part of the device-level prefetch on reads.

> set zfs:zfs_vdev_cache_bshift = 13
This obscurely named parameter defines the amount of data read from disk for
each disk read (I think).  The default value for this parameter is 16, equating
to
64k reads. The value of 13 reduces disk read sizes to 8k.
> set zfs:zfs_prefetch_disable = 1 
The vdev parameters above relate to device-level prefetching.
zfs_prefetch_disable applies to file level prefetching.

With regard to the COW/scattered blocks query, it is certainly a possible
side-effect
of COW that maintaining sequential file block layout can get challenging, but
the TXG model and coalescing writes helps with that.

With regard to the changes (including the ARC size increase), it''s
really impossible to
say without data the extent with which prefetching at one or both layers made
the
difference here. Was it the cumulative effect of both, or was one a much larger
contributing factor?

It would be interesting to reproduce this in a lab.

What release of Solaris 10 is this?

Thanks
/jim

> 
> ... and increased their ARC to 8GB and backups that took 15+ hours now take
45 minutes.  They are still analyzing what effects re-enabling prefetch has on
their applications.
> 
> One other thing they noticed, before removing these tunables, is that the
backups were taking progressly longer, each day.  For instance, at the beginning
of last, they took 12 hours.  By Friday, they were taking 17 hours.  This is
with similar sized datasets.  They will be keeping an eye on this, too, but
I''m interested of any possible causes that might be related to ZFS. 
One thing I''ve been told is that ZFS COW (copy-on write) operations can
cause blocks to be scattered across a disk, where they were once located closer
to one another.
> 
> We''ll see how it behaves in the next week or so.
> 
> Thanks for the feedback,
> Jim
> 
> On 10/21/10 02:49 PM, Amer Ather wrote:
>> 
>> Jim,
>> 
>> For sequential IO read performance, you need file system read ahead. By
setting:
>> 
>> set zfs:zfs_prefetch_disable = 1 
>> 
>> You have disabled zfs prefetch that is needed to boost sequential IO
performance. Normally, we recommend to disable it for Oracle OLTP type of
workload to avoid IO inflation due to read ahead. However, for backups it needs
to be enabled. Take this setting out of etcsystem file and retest.
>> 
>> Amer.
>> 
>> 
>> 
>> On 10/21/10 12:00 PM, Jim Nissen wrote:
>>> 
>>> I working with a customer who is having Directory server backup
performance problems, since switching to ZFS.  In short, backups that used to
take 1 - 4 hours on UFS are now taking 12+ hours on ZFS.  We''ve figured
out that ZFS reads seem to be throttled, where writes seem really fast.  Backend
Storage is IBM SVC.
>>> 
>>> As part of their cutover, they were given the following Best
Practice recommendations from LDAP folks @Sun...
>>> 
>>> /etc/system tunables:
>>> set zfs:zfs_arc_max = 0x100000000 
>>> set zfs:zfs_vdev_cache_size = 0 
>>> set zfs:zfs_vdev_cache_bshift = 13 
>>> set zfs:zfs_prefetch_disable = 1 
>>> set zfs:zfs_nocacheflush = 1 
>>> 
>>> At ZFS filesystem level:
>>> recordsize = 32K
>>> noatime
>>> 
>>> One of the things they noticed is that simple dd reads from one of
the 132K recordsize filesystems runs much faster (4 - 7 times) than their 32K
filesystems.  I joined a shared-shell where we switched the same filesystem from
32K to 128K, and we could see underlying disks were getting 4x better throughput
(from 1.5 - 2MB/sec to 8 - 10MB/s), whereas a direct dd against one of the disks
shows that the disks were capable of much more (45+ MB/sec).
>>> 
>>> Here are some snippets from iostat...
>>> 
>>> ZFS recordsize of 32K, dd if=./somelarge5gfile of=/dev/null bs=16k
(to mimic application blocksizes)
>>> 
>>>                     extended device statistics              
>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>>    67.6    0.0 2132.7    0.0  0.0  0.3    0.0    4.5   0  30
c6t60050768018E82BDA800000000000565d0
>>>    67.4    0.0 2156.8    0.0  0.0  0.1    0.0    1.5   0  10
c6t60050768018E82BDA800000000000564d0
>>>    68.4    0.0 2158.3    0.0  0.0  0.3    0.0    4.5   0  31
c6t60050768018E82BDA800000000000563d0
>>>    66.2    0.0 2118.4    0.0  0.0  0.2    0.0    3.4   0  22
c6t60050768018E82BDA800000000000562d0
>>> 
>>> ZFS recordsize of 128K, dd if=./somelarge5gfile of=/dev/null bs=16k
(to mimic application blocksizes)
>>> 
>>>                     extended device statistics              
>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>>    78.2    0.0 10009.6    0.0  0.0  0.2    0.0    1.9   0  15
c6t60050768018E82BDA800000000000565d0
>>>    78.6    0.0 9960.0    0.0  0.0  0.1    0.0    1.2   0  10
c6t60050768018E82BDA800000000000564d0
>>>    79.4    0.0 10062.3    0.0  0.0  0.4    0.0    4.4   0  35
c6t60050768018E82BDA800000000000563d0
>>>    76.6    0.0 9804.8    0.0  0.0  0.2    0.0    2.3   0  17
c6t60050768018E82BDA800000000000562d0
>>> 
>>> dd if=/dev/rdsk/c6t60050768018E82BDA800000000000564d0s0
of=/dev/null bs=32k (to mimic small ZFS blocksize)
>>>                     extended device statistics              
>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>>  3220.9    0.0 51533.9    0.0  0.0  0.9    0.0    0.3   1  94
c6t60050768018E82BDA800000000000564d0
>>> 
>>> So, it''s not like the underlying disk isn''t
capable of much more than what ZFS is asking of it.  I understand the part where
it will have to 4x as much work with 32K blocksize as with 128K, but it
doesn''t seem as if ZFS is doing much at all with underlying disks.
>>> 
>>> We''ve ask the customer to rerun the test without
/etc/system tunables.  Anybody else worked a similar issue?  Any hints provided
would be greatly appreciated.
>>> 
>>> Thanks!
>>> 
>>> Jim
>>> 
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101025/38f42a72/attachment-0003.html>

Jim Nissen

2010-Oct-25 18:25 UTC

head link

[zfs-discuss] Performance problems due to smaller ZFS recordsize

Jim,
They are running Solaris 10 11/06 (u3) with kernel patch 142900-12.  See 
inline for the rest...

On 10/25/10 11:19 AM, Jim Mauro wrote:>
> Hi Jim - cross-posting to zfs-discuss, because 20X is, to say the 
> least, compelling.
>
> Obviously, it would be awesome if we had the opportunity to 
> whittle-down which of
> the changes made this fly, or if it was a combination of the changes.
> Looking at them individually....
>Not sure we can convince this customer to comply, on the system in 
question.  HOWEVER, they also have another set of ldap servers that are 
experiencing the same types of problems with backups.  I will see if 
they would be willing to make single changes in steps.

I think the best bet would be to reproduce in a lab,
somewhere.>
>> set zfs:zfs_vdev_cache_size = 0
>
> The default for this is 10MB per vdev, and as I understand it (which 
> may be wrong)
> is part of the device-level prefetch on reads.
>
>
>> set zfs:zfs_vdev_cache_bshift = 13
>
> This obscurely named parameter defines the amount of data read from 
> disk for
> each disk read (I think).  The default value for this parameter is 16, 
> equating to
> 64k reads. The value of 13 reduces disk read sizes to 8k.
>
>> set zfs:zfs_prefetch_disable = 1
>
> The vdev parameters above relate to device-level prefetching.
> zfs_prefetch_disable applies to file level prefetching.
>
> With regard to the COW/scattered blocks query, it is certainly a 
> possible side-effect
> of COW that maintaining sequential file block layout can get 
> challenging, but
> the TXG model and coalescing writes helps with that.
>I was describing to the customer how ZFS uses COW for modifications, and 
how it is possible, over time, to get fragmentation.  From a LDAP 
standpoint, they suggested there are lots of cases where modifications 
are made to already existing larger files.  In some respects, LDAP is 
much like Oracle databases, just on a smaller scale.

Is there any way to monitor for fragmentation?  Any dtrace scripts, perhaps?
> With regard to the changes (including the ARC size increase), it''s
> really impossible to
> say without data the extent with which prefetching at one or both 
> layers made the
> difference here. Was it the cumulative effect of both, or was one a 
> much larger
> contributing factor?Understood.  On the next system they try this on, they will leave ARC at 
4GB to see if they still have large gains.>
> It would be interesting to reproduce this in a lab.
I was thinking the same thing.  We would have to come up with some sort 
of workload where portions of larger files get modified many times, and 
try some of the tunables on sequential reads.

Jim> What release of Solaris 10 is this?
>
> Thanks
> /jim
>
>
>>
>> ... and increased their ARC to 8GB and backups that took 15+ hours 
>> now take 45 minutes.  They are still analyzing what effects 
>> re-enabling prefetch has on their applications.
>>
>> One other thing they noticed, before removing these tunables, is that 
>> the backups were taking progressly longer, each day.  For instance, 
>> at the beginning of last, they took 12 hours.  By Friday, they were 
>> taking 17 hours.  This is with similar sized datasets.  They will be 
>> keeping an eye on this, too, but I''m interested of any
possible
>> causes that might be related to ZFS.  One thing I''ve been told
is
>> that ZFS COW (copy-on write) operations can cause blocks to be 
>> scattered across a disk, where they were once located closer to one 
>> another.
>>
>> We''ll see how it behaves in the next week or so.
>>
>> Thanks for the feedback,
>> Jim
>>
>> On 10/21/10 02:49 PM, Amer Ather wrote:
>>> Jim,
>>>
>>> For sequential IO read performance, you need file system read
ahead.
>>> By setting:
>>>
>>> set zfs:zfs_prefetch_disable = 1
>>>
>>> You have disabled zfs prefetch that is needed to boost sequential
IO
>>> performance. Normally, we recommend to disable it for Oracle OLTP 
>>> type of workload to avoid IO inflation due to read ahead. However, 
>>> for backups it needs to be enabled. Take this setting out of 
>>> etcsystem file and retest.
>>>
>>> Amer.
>>>
>>>
>>>
>>> On 10/21/10 12:00 PM, Jim Nissen wrote:
>>>> I working with a customer who is having Directory server backup
>>>> performance problems, since switching to ZFS.  In short,
backups
>>>> that used to take 1 - 4 hours on UFS are now taking 12+ hours
on
>>>> ZFS.  We''ve figured out that ZFS reads seem to be
throttled, where
>>>> writes seem really fast.  Backend Storage is IBM SVC.
>>>>
>>>> As part of their cutover, they were given the following Best 
>>>> Practice recommendations from LDAP folks @Sun...
>>>>
>>>> /etc/system tunables:
>>>> set zfs:zfs_arc_max = 0x100000000
>>>> set zfs:zfs_vdev_cache_size = 0
>>>> set zfs:zfs_vdev_cache_bshift = 13
>>>> set zfs:zfs_prefetch_disable = 1
>>>> set zfs:zfs_nocacheflush = 1
>>>>
>>>> At ZFS filesystem level:
>>>> recordsize = 32K
>>>> noatime
>>>>
>>>> One of the things they noticed is that simple dd reads from one
of
>>>> the 132K recordsize filesystems runs much faster (4 - 7 times)
than
>>>> their 32K filesystems.  I joined a shared-shell where we
switched
>>>> the same filesystem from 32K to 128K, and we could see
underlying
>>>> disks were getting 4x better throughput (from 1.5 - 2MB/sec to
8 -
>>>> 10MB/s), whereas a direct dd against one of the disks shows
that
>>>> the disks were capable of much more (45+ MB/sec).
>>>>
>>>> Here are some snippets from iostat...
>>>>
>>>> ZFS recordsize of 32K, dd if=./somelarge5gfile of=/dev/null
bs=16k
>>>> (to mimic application blocksizes)
>>>>
>>>>                     extended device statistics
>>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b
device
>>>>    67.6    0.0 2132.7    0.0  0.0  0.3    0.0    4.5   0  30 
>>>> c6t60050768018E82BDA800000000000565d0
>>>>    67.4    0.0 2156.8    0.0  0.0  0.1    0.0    1.5   0  10 
>>>> c6t60050768018E82BDA800000000000564d0
>>>>    68.4    0.0 2158.3    0.0  0.0  0.3    0.0    4.5   0  31 
>>>> c6t60050768018E82BDA800000000000563d0
>>>>    66.2    0.0 2118.4    0.0  0.0  0.2    0.0    3.4   0  22 
>>>> c6t60050768018E82BDA800000000000562d0
>>>>
>>>> ZFS recordsize of 128K, dd if=./somelarge5gfile of=/dev/null
bs=16k
>>>> (to mimic application blocksizes)
>>>>
>>>>                     extended device statistics
>>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b
device
>>>>    78.2    0.0 10009.6    0.0  0.0  0.2    0.0    1.9   0  15 
>>>> c6t60050768018E82BDA800000000000565d0
>>>>    78.6    0.0 9960.0    0.0  0.0  0.1    0.0    1.2   0  10 
>>>> c6t60050768018E82BDA800000000000564d0
>>>>    79.4    0.0 10062.3    0.0  0.0  0.4    0.0    4.4   0  35 
>>>> c6t60050768018E82BDA800000000000563d0
>>>>    76.6    0.0 9804.8    0.0  0.0  0.2    0.0    2.3   0  17 
>>>> c6t60050768018E82BDA800000000000562d0
>>>>
>>>> dd if=/dev/rdsk/c6t60050768018E82BDA800000000000564d0s0 
>>>> of=/dev/null bs=32k (to mimic small ZFS blocksize)
>>>>                     extended device statistics
>>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b
device
>>>>  3220.9    0.0 51533.9    0.0  0.0  0.9    0.0    0.3   1  94 
>>>> c6t60050768018E82BDA800000000000564d0
>>>>
>>>> So, it''s not like the underlying disk isn''t
capable of much more
>>>> than what ZFS is asking of it.  I understand the part where it
will
>>>> have to 4x as much work with 32K blocksize as with 128K, but it
>>>> doesn''t seem as if ZFS is doing much at all with
underlying disks.
>>>>
>>>> We''ve ask the customer to rerun the test without
/etc/system
>>>> tunables.  Anybody else worked a similar issue?  Any hints
provided
>>>> would be greatly appreciated.
>>>>
>>>> Thanks!
>>>>
>>>> Jim
>>>>
>>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101025/882d5f52/attachment-0005.html>

david.lutz at oracle.com

2010-Oct-26 00:43 UTC

head link

[zfs-discuss] Performance problems due to smaller ZFS recordsize

This combination of tunables is probably a worst case set for doing
sequential or multi-block reads, particularly from a COW file system.
We know that disaggregation can occur due to small, random writes,
and that this can result in an increase in IOPS required to do
sequential or multi-block reads.  Each of these tunables may be
multiplying that effect.  On an IOPS limited disk config, the impact
could be pretty dramatic.

I have been doing some work related to COW effects on sequential
and multi-block workloads, and I am also getting ready to look
at prefetch in ZFS.  I''d like to discuss this offline to see how
this case fits with what I am working on.  I will give you a call
on Tuesday.

David

Jim Nissen wrote:> Jim,
> They are running Solaris 10 11/06 (u3) with kernel patch 142900-12.  
> See inline for the rest...
>
> On 10/25/10 11:19 AM, Jim Mauro wrote:
>>
>> Hi Jim - cross-posting to zfs-discuss, because 20X is, to say the 
>> least, compelling.
>>
>> Obviously, it would be awesome if we had the opportunity to 
>> whittle-down which of
>> the changes made this fly, or if it was a combination of the changes. 
>> Looking at them individually....
>>
> Not sure we can convince this customer to comply, on the system in 
> question.  HOWEVER, they also have another set of ldap servers that 
> are experiencing the same types of problems with backups.  I will see 
> if they would be willing to make single changes in steps.
>
> I think the best bet would be to reproduce in a lab, somewhere.
>>
>>> set zfs:zfs_vdev_cache_size = 0
>>
>> The default for this is 10MB per vdev, and as I understand it (which 
>> may be wrong)
>> is part of the device-level prefetch on reads.
>>
>>
>>> set zfs:zfs_vdev_cache_bshift = 13
>>
>> This obscurely named parameter defines the amount of data read from 
>> disk for
>> each disk read (I think).  The default value for this parameter is 
>> 16, equating to
>> 64k reads. The value of 13 reduces disk read sizes to 8k.
>>
>>> set zfs:zfs_prefetch_disable = 1
>>
>> The vdev parameters above relate to device-level prefetching.
>> zfs_prefetch_disable applies to file level prefetching.
>>
>> With regard to the COW/scattered blocks query, it is certainly a 
>> possible side-effect
>> of COW that maintaining sequential file block layout can get 
>> challenging, but
>> the TXG model and coalescing writes helps with that.
>>
> I was describing to the customer how ZFS uses COW for modifications, 
> and how it is possible, over time, to get fragmentation.  From a LDAP 
> standpoint, they suggested there are lots of cases where modifications 
> are made to already existing larger files.  In some respects, LDAP is 
> much like Oracle databases, just on a smaller scale.
>
> Is there any way to monitor for fragmentation?  Any dtrace scripts, 
> perhaps?
>
>> With regard to the changes (including the ARC size increase),
it''s
>> really impossible to 
>> say without data the extent with which prefetching at one or both 
>> layers made the
>> difference here. Was it the cumulative effect of both, or was one a 
>> much larger
>> contributing factor?
> Understood.  On the next system they try this on, they will leave ARC 
> at 4GB to see if they still have large gains.
>>
>> It would be interesting to reproduce this in a lab.
>
> I was thinking the same thing.  We would have to come up with some 
> sort of workload where portions of larger files get modified many 
> times, and try some of the tunables on sequential reads.
>
> Jim
>> What release of Solaris 10 is this?
>>
>> Thanks
>> /jim
>>
>>
>>>
>>> ... and increased their ARC to 8GB and backups that took 15+ hours 
>>> now take 45 minutes.  They are still analyzing what effects 
>>> re-enabling prefetch has on their applications. 
>>>
>>> One other thing they noticed, before removing these tunables, is 
>>> that the backups were taking progressly longer, each day.  For 
>>> instance, at the beginning of last, they took 12 hours.  By Friday,
>>> they were taking 17 hours.  This is with similar sized datasets.  
>>> They will be keeping an eye on this, too, but I''m
interested of any
>>> possible causes that might be related to ZFS.  One thing
I''ve been
>>> told is that ZFS COW (copy-on write) operations can cause blocks to
>>> be scattered across a disk, where they were once located closer to 
>>> one another.
>>>
>>> We''ll see how it behaves in the next week or so.
>>>
>>> Thanks for the feedback,
>>> Jim
>>>
>>> On 10/21/10 02:49 PM, Amer Ather wrote:
>>>> Jim,
>>>>
>>>> For sequential IO read performance, you need file system read 
>>>> ahead. By setting:
>>>>
>>>> set zfs:zfs_prefetch_disable = 1
>>>>
>>>> You have disabled zfs prefetch that is needed to boost
sequential
>>>> IO performance. Normally, we recommend to disable it for Oracle
>>>> OLTP type of workload to avoid IO inflation due to read ahead. 
>>>> However, for backups it needs to be enabled. Take this setting
out
>>>> of etcsystem file and retest.
>>>>
>>>> Amer.
>>>>
>>>>
>>>>
>>>> On 10/21/10 12:00 PM, Jim Nissen wrote:
>>>>> I working with a customer who is having Directory server
backup
>>>>> performance problems, since switching to ZFS.  In short,
backups
>>>>> that used to take 1 - 4 hours on UFS are now taking 12+
hours on
>>>>> ZFS.  We''ve figured out that ZFS reads seem to be
throttled, where
>>>>> writes seem really fast.  Backend Storage is IBM SVC.
>>>>>
>>>>> As part of their cutover, they were given the following
Best
>>>>> Practice recommendations from LDAP folks @Sun...
>>>>>
>>>>> /etc/system tunables:
>>>>> set zfs:zfs_arc_max = 0x100000000
>>>>> set zfs:zfs_vdev_cache_size = 0
>>>>> set zfs:zfs_vdev_cache_bshift = 13
>>>>> set zfs:zfs_prefetch_disable = 1
>>>>> set zfs:zfs_nocacheflush = 1
>>>>>
>>>>> At ZFS filesystem level:
>>>>> recordsize = 32K
>>>>> noatime
>>>>>
>>>>> One of the things they noticed is that simple dd reads from
one of
>>>>> the 132K recordsize filesystems runs much faster (4 - 7
times)
>>>>> than their 32K filesystems.  I joined a shared-shell where
we
>>>>> switched the same filesystem from 32K to 128K, and we could
see
>>>>> underlying disks were getting 4x better throughput (from
1.5 -
>>>>> 2MB/sec to 8 - 10MB/s), whereas a direct dd against one of
the
>>>>> disks shows that the disks were capable of much more (45+
MB/sec).
>>>>>
>>>>> Here are some snippets from iostat...
>>>>>
>>>>> ZFS recordsize of 32K, dd if=./somelarge5gfile of=/dev/null
bs=16k
>>>>> (to mimic application blocksizes)
>>>>>
>>>>>                     extended device statistics             
>>>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w 
%b device
>>>>>    67.6    0.0 2132.7    0.0  0.0  0.3    0.0    4.5   0 
30
>>>>> c6t60050768018E82BDA800000000000565d0
>>>>>    67.4    0.0 2156.8    0.0  0.0  0.1    0.0    1.5   0 
10
>>>>> c6t60050768018E82BDA800000000000564d0
>>>>>    68.4    0.0 2158.3    0.0  0.0  0.3    0.0    4.5   0 
31
>>>>> c6t60050768018E82BDA800000000000563d0
>>>>>    66.2    0.0 2118.4    0.0  0.0  0.2    0.0    3.4   0 
22
>>>>> c6t60050768018E82BDA800000000000562d0
>>>>>
>>>>> ZFS recordsize of 128K, dd if=./somelarge5gfile
of=/dev/null
>>>>> bs=16k (to mimic application blocksizes)
>>>>>
>>>>>                     extended device statistics             
>>>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w 
%b device
>>>>>    78.2    0.0 10009.6    0.0  0.0  0.2    0.0    1.9   0 
15
>>>>> c6t60050768018E82BDA800000000000565d0
>>>>>    78.6    0.0 9960.0    0.0  0.0  0.1    0.0    1.2   0 
10
>>>>> c6t60050768018E82BDA800000000000564d0
>>>>>    79.4    0.0 10062.3    0.0  0.0  0.4    0.0    4.4   0 
35
>>>>> c6t60050768018E82BDA800000000000563d0
>>>>>    76.6    0.0 9804.8    0.0  0.0  0.2    0.0    2.3   0 
17
>>>>> c6t60050768018E82BDA800000000000562d0
>>>>>
>>>>> dd if=/dev/rdsk/c6t60050768018E82BDA800000000000564d0s0 
>>>>> of=/dev/null bs=32k (to mimic small ZFS blocksize)
>>>>>                     extended device statistics             
>>>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w 
%b device
>>>>>  3220.9    0.0 51533.9    0.0  0.0  0.9    0.0    0.3   1 
94
>>>>> c6t60050768018E82BDA800000000000564d0
>>>>>
>>>>> So, it''s not like the underlying disk
isn''t capable of much more
>>>>> than what ZFS is asking of it.  I understand the part where
it
>>>>> will have to 4x as much work with 32K blocksize as with
128K, but
>>>>> it doesn''t seem as if ZFS is doing much at all
with underlying disks.
>>>>>
>>>>> We''ve ask the customer to rerun the test without
/etc/system
>>>>> tunables.  Anybody else worked a similar issue?  Any hints 
>>>>> provided would be greatly appreciated.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Jim
>>>>>
>>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101025/c9849c8e/attachment-0005.html>

zfs discuss - Oct 2010 - Performance problems due to smaller ZFS recordsize

[zfs-discuss] Performance problems due to smaller ZFS recordsize

[zfs-discuss] Performance problems due to smaller ZFS recordsize

[zfs-discuss] Performance problems due to smaller ZFS recordsize

[zfs-discuss] Performance problems due to smaller ZFS recordsize