thr3ads.net - zfs discuss - [zfs-discuss] ZFS write throttling [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Philip Beevers

2008-Feb-15 10:38 UTC

[zfs-discuss] ZFS write throttling

Hi everyone,

This is my first post to zfs-discuss, so be gentle with me :-)

I''ve been doing some testing with ZFS - in particular, in checkpointing
the large, proprietary in-memory database which is a key part of the
application I work on. In doing this I''ve found what seems to be some
fairly unhelpful write throttling behaviour from ZFS.

In summary, the environment is:

* An x4600 with 8 CPUs and 128GBytes of memory
* A 50GByte in-memory database
* A big, fast disk array (a 6140 with a LUN comprised of 4 SATA drives)
* Running Solaris 10 update 4 (problems initially seen on U3 so I got it
patched)

The problems happen when I checkpoint the database, which involves
putting that database on disk as quickly as possible, using the write(2)
system call.

The first time the checkpoint is run, it''s quick - about 160MBytes/sec,
even though the disk array is only sustaining 80MBytes/sec. So we''re
dirtying stuff in the ARC (and growing the ARC) at a pretty impressive
rate.

After letting the IO subside, running the checkpoint again results in
very different behaviour. It starts running very quickly, again at
160MByte/sec (with the underlying device doing 80MBytes/sec), and after
a while (presumably once the ARC is full) things go badly wrong. In
particular, a write(2) system call hangs for 6-7 minutes, apparently
until all the outstanding IO is done. Any reads from that device also
take a huge amount of time, making the box very unresponsive.

Obviously this isn''t good behaviour, but it''s particularly
unfortunate
given that this checkpoint is stuff that I don''t want to retain in any
kind of cache anyway - in fact, preferably I wouldn''t pollute the ARC
with it in the first place. But it seems directio(3C) doesn''t work with
ZFS (unsurprisingly as I guess this is implemented in segmap), and
madvise(..., MADV_DONTNEED) doesn''t drop data from the ARC (again, I
guess, as it''s working on segmap/segvn).

Of course, limiting the ARC size to something fairly small makes it
behave much better. But this isn''t really the answer.

I also tried using O_DSYNC, which stops the pathological behaviour but
makes things pretty slow - I only get a maximum of about 20MBytes/sec,
which is obviously much less than the hardware can sustain.

It sounds like we could do with different write throttling behaviour to
head this sort of thing off. Of course, the ideal would be to have some
way of telling ZFS not to bother keeping pages in the ARC.

The latter appears to be bug 6429855. But the underlying behaviour
doesn''t really seem desirable; are there plans afoot to do any work on
ZFS write throttling to address this kind of thing?

Regards,

Philip Beevers
Fidessa Infrastructure Development

mailto:philip.beevers at fidessa.com
phone: +44 1483 206571

********************************************************************************************************************************************************************************************
This message is intended only for the stated addressee(s) and may be
confidential. Access to this email by anyone else is unauthorised. Any opinions
expressed in this email do not necessarily reflect the opinions of Fidessa. Any
unauthorised disclosure, use or dissemination, either whole or in part is
prohibited. If you are not the intended recipient of this message, please notify
the sender immediately.

Fidessa plc - Registered office:
Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom
Registered in England no. 3781700 VAT registration no. 688 9008 78

Fidessa group plc - Registered Office:
Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom
Registered in England no. 3234176 VAT registration no. 688 9008 78

Roch Bourbonnais

2008-Feb-15 10:58 UTC

head link

[zfs-discuss] ZFS write throttling

Le 15 f?vr. 08 ? 11:38, Philip Beevers a ?crit :
> Hi everyone,
>
> This is my first post to zfs-discuss, so be gentle with me :-)
>
> I''ve been doing some testing with ZFS - in particular, in  
> checkpointing
> the large, proprietary in-memory database which is a key part of the
> application I work on. In doing this I''ve found what seems to be
some
> fairly unhelpful write throttling behaviour from ZFS.
>
> In summary, the environment is:
>
> * An x4600 with 8 CPUs and 128GBytes of memory
> * A 50GByte in-memory database
> * A big, fast disk array (a 6140 with a LUN comprised of 4 SATA  
> drives)
> * Running Solaris 10 update 4 (problems initially seen on U3 so I  
> got it
> patched)
>
> The problems happen when I checkpoint the database, which involves
> putting that database on disk as quickly as possible, using the  
> write(2)
> system call.
>
> The first time the checkpoint is run, it''s quick - about
160MBytes/
> sec,
> even though the disk array is only sustaining 80MBytes/sec. So
we''re
> dirtying stuff in the ARC (and growing the ARC) at a pretty impressive
> rate.
>
> After letting the IO subside, running the checkpoint again results in
> very different behaviour. It starts running very quickly, again at
> 160MByte/sec (with the underlying device doing 80MBytes/sec), and  
> after
> a while (presumably once the ARC is full) things go badly wrong. In
> particular, a write(2) system call hangs for 6-7 minutes, apparently
> until all the outstanding IO is done. Any reads from that device also
> take a huge amount of time, making the box very unresponsive.
>
> Obviously this isn''t good behaviour, but it''s
particularly unfortunate
> given that this checkpoint is stuff that I don''t want to retain in
any
> kind of cache anyway - in fact, preferably I wouldn''t pollute the
ARC
> with it in the first place. But it seems directio(3C) doesn''t work
> with
> ZFS (unsurprisingly as I guess this is implemented in segmap), and
> madvise(..., MADV_DONTNEED) doesn''t drop data from the ARC (again,
I
> guess, as it''s working on segmap/segvn).
>
> Of course, limiting the ARC size to something fairly small makes it
> behave much better. But this isn''t really the answer.
>
> I also tried using O_DSYNC, which stops the pathological behaviour but
> makes things pretty slow - I only get a maximum of about 20MBytes/sec,
> which is obviously much less than the hardware can sustain.
>
> It sounds like we could do with different write throttling behaviour  
> to
> head this sort of thing off. Of course, the ideal would be to have  
> some
> way of telling ZFS not to bother keeping pages in the ARC.
>
> The latter appears to be bug 6429855. But the underlying behaviour
> doesn''t really seem desirable; are there plans afoot to do any
work on
> ZFS write throttling to address this kind of thing?
>
Throttling is being addressed.

	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205


BTW, the new code will adjust write speed to disk speed very quickly.
You will not see those ultra fast initial checkpoints. Is this a  
concern ?

-r

> Regards,
>
> -- 
>
> Philip Beevers
> Fidessa Infrastructure Development
>
> mailto:philip.beevers at fidessa.com
> phone: +44 1483 206571
>
>
********************************************************************************************************************************************************************************************
> This message is intended only for the stated addressee(s) and may be  
> confidential.  Access to this email by anyone else is unauthorised.  
> Any opinions expressed in this email do not necessarily reflect the  
> opinions of Fidessa. Any unauthorised disclosure, use or  
> dissemination, either whole or in part is prohibited. If you are not  
> the intended recipient of this message, please notify the sender  
> immediately.
>
> Fidessa plc - Registered office:
> Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom
> Registered in England no. 3781700 VAT registration no. 688 9008 78
>
> Fidessa group plc - Registered Office:
> Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom
> Registered in England no. 3234176 VAT registration no. 688 9008 78
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Philip Beevers

2008-Feb-15 11:06 UTC

head link

[zfs-discuss] ZFS write throttling

Hi Roch,

Thanks for the response.
> Throttling is being addressed.
> 
> 	
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205
> 
> 
> BTW, the new code will adjust write speed to disk speed very quickly.
> You will not see those ultra fast initial checkpoints. Is 
> this a concern ?
That''s good news. No, the loss of initial performance isn''t a
big
problem - I''d be happy for it to go at spindle speed.

Regards,

-- 

Philip Beevers
Fidessa Infrastructure Development

mailto:philip.beevers at fidessa.com
phone: +44 1483 206571  

********************************************************************************************************************************************************************************************
This message is intended only for the stated addressee(s) and may be
confidential.  Access to this email by anyone else is unauthorised. Any opinions
expressed in this email do not necessarily reflect the opinions of Fidessa. Any
unauthorised disclosure, use or dissemination, either whole or in part is
prohibited. If you are not the intended recipient of this message, please notify
the sender immediately.

Fidessa plc - Registered office:
Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom
Registered in England no. 3781700 VAT registration no. 688 9008 78

Fidessa group plc - Registered Office:
Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom
Registered in England no. 3234176 VAT registration no. 688 9008 78

Tao Chen

2008-Feb-15 12:10 UTC

head link

[zfs-discuss] ZFS write throttling

On 2/15/08, Roch Bourbonnais <Roch.Bourbonnais at sun.com>
wrote:>
>  Le 15 f?vr. 08 ? 11:38, Philip Beevers a ?crit :
>
[...]>  > Obviously this isn''t good behaviour, but it''s
particularly unfortunate
>  > given that this checkpoint is stuff that I don''t want to
retain in any
>  > kind of cache anyway - in fact, preferably I wouldn''t
pollute the ARC
>  > with it in the first place. But it seems directio(3C)
doesn''t work
>  > with
>  > ZFS (unsurprisingly as I guess this is implemented in segmap), and
>  > madvise(..., MADV_DONTNEED) doesn''t drop data from the ARC
(again, I
>  > guess, as it''s working on segmap/segvn).
>  >
>  > Of course, limiting the ARC size to something fairly small makes it
>  > behave much better. But this isn''t really the answer.
>  >
>  > I also tried using O_DSYNC, which stops the pathological behaviour
but
>  > makes things pretty slow - I only get a maximum of about
20MBytes/sec,
>  > which is obviously much less than the hardware can sustain.
>  >
>  > It sounds like we could do with different write throttling behaviour
>  > to
>  > head this sort of thing off. Of course, the ideal would be to have
>  > some
>  > way of telling ZFS not to bother keeping pages in the ARC.
>  >
>  > The latter appears to be bug 6429855. But the underlying behaviour
>  > doesn''t really seem desirable; are there plans afoot to do
any work on
>  > ZFS write throttling to address this kind of thing?
>  >
>
>
> Throttling is being addressed.
>
>         http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205
>
>
>  BTW, the new code will adjust write speed to disk speed very quickly.
>  You will not see those ultra fast initial checkpoints. Is this a
>  concern ?
I''ll wait for more details on how you address this.
Maybe a blog? like this one:
http://blogs.technet.com/markrussinovich/archive/2008/02/04/2826167.aspx

"Inside Vista SP1 File Copy Improvements" :-

"One of the biggest problems with the engine''s implementation is
that for copies involving lots of data, the Cache Manager
write-behind thread on the target system often can''t keep up with
the rate at which data is written and cached in memory.
That causes the data to fill up memory, possibly forcing other
useful code and data out, and eventually, the target''s
system''s
memory to become a tunnel through which all the copied data
flows at a rate limited by the disk."

Sounds familiar? ;-)

Tao

Bob Friesenhahn

2008-Feb-15 17:33 UTC

head link

[zfs-discuss] ZFS write throttling

On Fri, 15 Feb 2008, Roch Bourbonnais wrote:>> The latter appears to be bug 6429855. But the underlying behaviour
>> doesn''t really seem desirable; are there plans afoot to do any
work on
>> ZFS write throttling to address this kind of thing?
>
> Throttling is being addressed.
>
> 	http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205
I have observed similar behavior when using ''iozone'' on a
large file
to benchmark ZFS on my StorageTek 2540 array.  Fsstat shows gaps of up 
to 30 seconds of no I/O when run on a 10 second update cycle but when 
I go to look at the lights on the array, I see that it is actually 
fully busy.  It seems that the application is stalled during this 
load.  It also seems that simple operations like ''ls'' get
stalled
under such heavy load.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marion Hakanson

2008-Feb-15 20:42 UTC

head link

[zfs-discuss] ZFS write throttling

Philip.Beevers at fidessa.com said:> I also tried using O_DSYNC, which stops the pathological behaviour but
makes
> things pretty slow - I only get a maximum of about 20MBytes/sec, which is
> obviously much less than the hardware can sustain. 
I may misunderstand this situation, but while you''re waiting for the
new
code from Sun, you might try O_DSYNC and at the same time tell the 6140
to ignore cache-flush requests from the host.  That should get you running
at spindle-speed:

  http://blogs.digitar.com/jjww/?itemid=44

Regards,

Marion

zfs discuss - Feb 2008 - ZFS write throttling

[zfs-discuss] ZFS write throttling

[zfs-discuss] ZFS write throttling

[zfs-discuss] ZFS write throttling

[zfs-discuss] ZFS write throttling

[zfs-discuss] ZFS write throttling

[zfs-discuss] ZFS write throttling