thr3ads.net - zfs discuss - [zfs-discuss] Pulsing write performance [Aug 2009]

If this information is useful, please help other people find it:
Share via:

David Bond

2009-Aug-27 08:30 UTC

[zfs-discuss] Pulsing write performance

Hi,

I was directed here after posting in CIFS discuss (as i first thought that it
could be a CIFS problem).

I posted the following in CIFS:

When using iometer from windows to the file share on opensolaris svn101 and
svn111 I get pauses every 5 seconds of around 5 seconds (maybe a little less)
where no data is transfered, when data is transfered it is at a fair speed and
gets around 1000-2000 iops with 1 thread (depending on the work type). The
maximum read response time is 200ms and the maximum write response time is
9824ms, which is very bad, an almost 10 seconds delay in being able to send data
to the server.
This has been experienced on 2 test servers, the same servers have also been
tested with windows server 2008 and they havent shown this problem (the share
performance was slightly lower than CIFS, but it was consistent, and the average
access time and maximums were very close.


I just noticed that if the server hasnt hit its target arc size, the pauses are
for maybe .5 seconds, but as soon as it hits its arc target, the iops drop to
around 50% of what it was and then there are the longer pauses around 4-5
seconds. and then after every pause the performance slows even more. So it
appears it is definately server side.

This is with 100% random io with a spread of 33% write 66% read, 2KB blocks.
over a 50GB file, no compression, and a 5.5GB target arc size.



Also I have just ran some tests with different IO patterns and 100 sequencial
writes produce and consistent IO of 2100IOPS, except when it pauses for maybe .5
seconds every 10 - 15 seconds.

100% random writes produce around 200 IOPS with a 4-6 second pause around every
10 seconds.

100% sequencial reads produce around 3700IOPS with no pauses, just random peaks
in response time (only 16ms) after about 1 minute of running, so nothing to
complain about.

100% random reads produce around 200IOPS, with no pauses.

So it appears that writes cause a problem, what is causing these very long write
delays?

A network capture shows that the server doesnt respond to the write from the
client when these pauses occur.

Also, when using iometer, the initial file creation doesnt have and pauses in
the creation, so it  might only happen when modifying files.

Any help on finding a solution to this would be really appriciated.

David
-- 
This message posted from opensolaris.org

Ross Walker

2009-Aug-27 14:47 UTC

head link

[zfs-discuss] Pulsing write performance

On Aug 27, 2009, at 4:30 AM, David Bond <david.bond at tag.no> wrote:
> Hi,
>
> I was directed here after posting in CIFS discuss (as i first  
> thought that it could be a CIFS problem).
>
> I posted the following in CIFS:
>
> When using iometer from windows to the file share on opensolaris  
> svn101 and svn111 I get pauses every 5 seconds of around 5 seconds  
> (maybe a little less) where no data is transfered, when data is  
> transfered it is at a fair speed and gets around 1000-2000 iops with  
> 1 thread (depending on the work type). The maximum read response  
> time is 200ms and the maximum write response time is 9824ms, which  
> is very bad, an almost 10 seconds delay in being able to send data  
> to the server.
> This has been experienced on 2 test servers, the same servers have  
> also been tested with windows server 2008 and they havent shown this  
> problem (the share performance was slightly lower than CIFS, but it  
> was consistent, and the average access time and maximums were very  
> close.
>
>
> I just noticed that if the server hasnt hit its target arc size, the  
> pauses are for maybe .5 seconds, but as soon as it hits its arc  
> target, the iops drop to around 50% of what it was and then there  
> are the longer pauses around 4-5 seconds. and then after every pause  
> the performance slows even more. So it appears it is definately  
> server side.
>
> This is with 100% random io with a spread of 33% write 66% read, 2KB  
> blocks. over a 50GB file, no compression, and a 5.5GB target arc size.
>
>
>
> Also I have just ran some tests with different IO patterns and 100  
> sequencial writes produce and consistent IO of 2100IOPS, except when  
> it pauses for maybe .5 seconds every 10 - 15 seconds.
>
> 100% random writes produce around 200 IOPS with a 4-6 second pause  
> around every 10 seconds.
>
> 100% sequencial reads produce around 3700IOPS with no pauses, just  
> random peaks in response time (only 16ms) after about 1 minute of  
> running, so nothing to complain about.
>
> 100% random reads produce around 200IOPS, with no pauses.
>
> So it appears that writes cause a problem, what is causing these  
> very long write delays?
>
> A network capture shows that the server doesnt respond to the write  
> from the client when these pauses occur.
>
> Also, when using iometer, the initial file creation doesnt have and  
> pauses in the creation, so it  might only happen when modifying files.
>
> Any help on finding a solution to this would be really appriciated.
What version? And system configuration?

I think it might be the issue where ZFS/ARC write caches more then the  
underlying storage can handle writing in a reasonable time.

There is a parameter to control how much is write cached, I believe it  
is zfs_write_override.

-Ross

Bob Friesenhahn

2009-Aug-27 15:29 UTC

head link

[zfs-discuss] Pulsing write performance

On Thu, 27 Aug 2009, David Bond wrote:>
> I just noticed that if the server hasnt hit its target arc size, the 
> pauses are for maybe .5 seconds, but as soon as it hits its arc 
> target, the iops drop to around 50% of what it was and then there 
> are the longer pauses around 4-5 seconds. and then after every pause 
> the performance slows even more. So it appears it is definately 
> server side.
This is known behavior of zfs for asynchronous writes.  Recent zfs 
defers/aggregates writes up to one of these limits:

   * 7/8ths of available RAM
   * 5 seconds worth of write I/O (full speed write)
   * 30 seconds aggregation time

Notice the 5 seconds.  This 5 seconds results in the 4-6 second pause 
and it seems that the aggregation time is 10 seconds on your system 
with this write load.  Systems with large amounts of RAM encounter 
this issue more than systems with limited RAM.

I encountered the same problem so I put this in /etc/system:

* Set ZFS maximum TXG group size to 3932160000
set zfs:zfs_write_limit_override = 0xea600000

By limiting the TXG group size, the size of the data burst is limited, 
but since zfs still writes the TXG as fast as it can, other I/O will 
cease during that time.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Roman Naumenko

2009-Aug-27 16:23 UTC

head link

[zfs-discuss] Pulsing write performance

Hi David,

Just wanted to ask you, how your windows server behaves during these pauses? Are
there any clients, connected to it?

The issue you''ve described might be related to one I saw on my server,
see here:
http://www.opensolaris.org/jive/thread.jspa?threadID=110013&tstart=0

I just wonder how windows behaves during these pauses.

--
Roman Naumenko
roman at frontline.ca
-- 
This message posted from opensolaris.org

Henrik Johansen

2009-Aug-27 19:26 UTC

head link

[zfs-discuss] Pulsing write performance

Ross Walker wrote:>On Aug 27, 2009, at 4:30 AM, David Bond <david.bond at tag.no> wrote:
>
>> Hi,
>>
>> I was directed here after posting in CIFS discuss (as i first  
>> thought that it could be a CIFS problem).
>>
>> I posted the following in CIFS:
>>
>> When using iometer from windows to the file share on opensolaris  
>> svn101 and svn111 I get pauses every 5 seconds of around 5 seconds  
>> (maybe a little less) where no data is transfered, when data is  
>> transfered it is at a fair speed and gets around 1000-2000 iops with  
>> 1 thread (depending on the work type). The maximum read response  
>> time is 200ms and the maximum write response time is 9824ms, which  
>> is very bad, an almost 10 seconds delay in being able to send data  
>> to the server.
>> This has been experienced on 2 test servers, the same servers have  
>> also been tested with windows server 2008 and they havent shown this  
>> problem (the share performance was slightly lower than CIFS, but it  
>> was consistent, and the average access time and maximums were very  
>> close.
>>
>>
>> I just noticed that if the server hasnt hit its target arc size, the  
>> pauses are for maybe .5 seconds, but as soon as it hits its arc  
>> target, the iops drop to around 50% of what it was and then there  
>> are the longer pauses around 4-5 seconds. and then after every pause  
>> the performance slows even more. So it appears it is definately  
>> server side.
>>
>> This is with 100% random io with a spread of 33% write 66% read, 2KB  
>> blocks. over a 50GB file, no compression, and a 5.5GB target arc size.
>>
>>
>>
>> Also I have just ran some tests with different IO patterns and 100  
>> sequencial writes produce and consistent IO of 2100IOPS, except when  
>> it pauses for maybe .5 seconds every 10 - 15 seconds.
>>
>> 100% random writes produce around 200 IOPS with a 4-6 second pause  
>> around every 10 seconds.
>>
>> 100% sequencial reads produce around 3700IOPS with no pauses, just  
>> random peaks in response time (only 16ms) after about 1 minute of  
>> running, so nothing to complain about.
>>
>> 100% random reads produce around 200IOPS, with no pauses.
>>
>> So it appears that writes cause a problem, what is causing these  
>> very long write delays?
>>
>> A network capture shows that the server doesnt respond to the write  
>> from the client when these pauses occur.
>>
>> Also, when using iometer, the initial file creation doesnt have and  
>> pauses in the creation, so it  might only happen when modifying files.
>>
>> Any help on finding a solution to this would be really appriciated.
>
>What version? And system configuration?
>
>I think it might be the issue where ZFS/ARC write caches more then the  
>underlying storage can handle writing in a reasonable time.
>
>There is a parameter to control how much is write cached, I believe it  
>is zfs_write_override.
You should be able to disable the write throttle mechanism altogether
with the undocumented zfs_no_write_throttle tunable.

I never got around to testing this though ...

>-Ross
>  
>_______________________________________________
>zfs-discuss mailing list
>zfs-discuss at opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Med venlig hilsen / Best Regards

Henrik Johansen
henrik at scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet

Tristan

2009-Aug-27 21:01 UTC

head link

[zfs-discuss] Pulsing write performance

I saw similar behavior when I was running under the kernel debugger (-k 
switch the the kernel). It largely went away when I went back to
"normal".

T

David Bond wrote:> Hi,
>
> I was directed here after posting in CIFS discuss (as i first thought that
it could be a CIFS problem).
>
> I posted the following in CIFS:
>
> When using iometer from windows to the file share on opensolaris svn101 and
svn111 I get pauses every 5 seconds of around 5 seconds (maybe a little less)
where no data is transfered, when data is transfered it is at a fair speed and
gets around 1000-2000 iops with 1 thread (depending on the work type). The
maximum read response time is 200ms and the maximum write response time is
9824ms, which is very bad, an almost 10 seconds delay in being able to send data
to the server.
> This has been experienced on 2 test servers, the same servers have also
been tested with windows server 2008 and they havent shown this problem (the
share performance was slightly lower than CIFS, but it was consistent, and the
average access time and maximums were very close.
>
>
> I just noticed that if the server hasnt hit its target arc size, the pauses
are for maybe .5 seconds, but as soon as it hits its arc target, the iops drop
to around 50% of what it was and then there are the longer pauses around 4-5
seconds. and then after every pause the performance slows even more. So it
appears it is definately server side.
>
> This is with 100% random io with a spread of 33% write 66% read, 2KB
blocks. over a 50GB file, no compression, and a 5.5GB target arc size.
>
>
>
> Also I have just ran some tests with different IO patterns and 100
sequencial writes produce and consistent IO of 2100IOPS, except when it pauses
for maybe .5 seconds every 10 - 15 seconds.
>
> 100% random writes produce around 200 IOPS with a 4-6 second pause around
every 10 seconds.
>
> 100% sequencial reads produce around 3700IOPS with no pauses, just random
peaks in response time (only 16ms) after about 1 minute of running, so nothing
to complain about.
>
> 100% random reads produce around 200IOPS, with no pauses.
>
> So it appears that writes cause a problem, what is causing these very long
write delays?
>
> A network capture shows that the server doesnt respond to the write from
the client when these pauses occur.
>
> Also, when using iometer, the initial file creation doesnt have and pauses
in the creation, so it  might only happen when modifying files.
>
> Any help on finding a solution to this would be really appriciated.
>
> David
>

Ross Walker

2009-Aug-27 22:34 UTC

head link

[zfs-discuss] Pulsing write performance

On Aug 27, 2009, at 11:29 AM, Bob Friesenhahn <bfriesen at
simple.dallas.tx.us
 > wrote:
> On Thu, 27 Aug 2009, David Bond wrote:
>>
>> I just noticed that if the server hasnt hit its target arc size,  
>> the pauses are for maybe .5 seconds, but as soon as it hits its arc  
>> target, the iops drop to around 50% of what it was and then there  
>> are the longer pauses around 4-5 seconds. and then after every  
>> pause the performance slows even more. So it appears it is  
>> definately server side.
>
> This is known behavior of zfs for asynchronous writes.  Recent zfs  
> defers/aggregates writes up to one of these limits:
>
>  * 7/8ths of available RAM
>  * 5 seconds worth of write I/O (full speed write)
>  * 30 seconds aggregation time
>
> Notice the 5 seconds.  This 5 seconds results in the 4-6 second  
> pause and it seems that the aggregation time is 10 seconds on your  
> system with this write load.  Systems with large amounts of RAM  
> encounter this issue more than systems with limited RAM.
>
> I encountered the same problem so I put this in /etc/system:
>
> * Set ZFS maximum TXG group size to 3932160000
> set zfs:zfs_write_limit_override = 0xea600000
That''s the option. When I was experiencing my writes starving reads I  
set this to 512MB or the size of my NVRAM cache for my controller and  
everything was happy again. Write flushes happened in less then a  
second and my IO flattened out nicely.

-Ross

David Bond

2009-Aug-29 20:50 UTC

head link

[zfs-discuss] Pulsing write performance

Hi,

happens on opensolaris build 101b and 111b.
Arc cache max set to 6GB, joined to a windows 2003 r2 ad domain. With a pool of
4 15Krpm drives in a 2 way mirror.
The bnx driver has been changed to have offloading enabled.

Not much else has been changed.

Ok, so when the chache fills and needs to be flushed, when the flush occurs it
locks access to it, so no read? or writes can occur from cache, and as
everything will go through the arc, nothing can happen until the arc has
finished its flush.

And to compensate for this, I would have to either reduce the cache size to one
that is small enough that the disk array can write it at such a speed that the
pauses are reduced to ones that are not really noticable.

Wouldnt that then impact the overal burst write performance also. Why doesnt the
arc allow writes while flushing? or just have 2 caches so that one can keep
taking writes while the other flushes. If it allowed writes to the buffer while
it was flushing, it would just reduce the write speed down to what the disks can
handel wouldnt it?

Anyway, thanks for the info I will give that parameter a go, see how it works.

Thanks
-- 
This message posted from opensolaris.org

David Bond

2009-Aug-29 20:54 UTC

head link

[zfs-discuss] Pulsing write performance

Ok,

so by limiting the write cache to that of the controller you were able to remove
the pauses?

How id that affect your overall write performance, if at all?

thanks I will give that ago.

David
-- 
This message posted from opensolaris.org

David Bond

2009-Aug-29 20:58 UTC

head link

[zfs-discuss] Pulsing write performance

I dont have any windows machine connected to it over iscsi (yet).

My reference to the windows servers was, having the same hapdware running
windows and its read writes not having these problems, so it isnt hardware
causing it.

But when I do eventually get iscsi going I will send a message if i have teh
same problems.

Also with your replication, whats teh perfomance like, does it impact the
overall write performance of your server having it enabled, is the replication 
continuous?

David
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Aug-29 21:04 UTC

head link

[zfs-discuss] Pulsing write performance

On Sat, 29 Aug 2009, David Bond wrote:>
> Ok, so when the chache fills and needs to be flushed, when the flush 
> occurs it locks access to it, so no read? or writes can occur from 
> cache, and as everything will go through the arc, nothing can happen 
> until the arc has finished its flush.
It has not been proven that reads from the ARC stop.  It is clear that 
reads from physical disk temporarily stop.  It is not clear (to me) if 
reads from physical disk stop because of the huge number of TXG sync 
write operations (up to 5 seconds worth) which are queued prior to the 
read request, or if reads are intentionally blocked due to some sort 
of coherency management.
> And to compensate for this, I would have to either reduce the cache 
> size to one that is small enough that the disk array can write it at 
> such a speed that the pauses are reduced to ones that are not really 
> noticable.
That would work.  There is likely to be more total physical I/O though 
since delaying the writes tends to eliminate many redundant writes. 
For example, an application which re-writes the same file over and 
over again would be sending more of that data to physical disk.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Roch

2009-Sep-04 12:02 UTC

head link

[zfs-discuss] Pulsing write performance

"100% random writes produce around 200 IOPS with a 4-6 second pause
	around every 10 seconds. "

This indicates that the bandwidth you''re able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers.

-r

David Bond writes:
 > Hi,
 > 
 > I was directed here after posting in CIFS discuss (as i first thought that
it could be a CIFS problem).
 > 
 > I posted the following in CIFS:
 > 
 > When using iometer from windows to the file share on opensolaris
 > svn101 and svn111 I get pauses every 5 seconds of around 5 seconds
 > (maybe a little less) where no data is transfered, when data is
 > transfered it is at a fair speed and gets around 1000-2000 iops with 1
 > thread (depending on the work type). The maximum read response time is
 > 200ms and the maximum write response time is 9824ms, which is very
 > bad, an almost 10 seconds delay in being able to send data to the
 > server. 
 > This has been experienced on 2 test servers, the same servers have
 > also been tested with windows server 2008 and they havent shown this
 > problem (the share performance was slightly lower than CIFS, but it
 > was consistent, and the average access time and maximums were very
 > close. 
 > 
 > 
 > I just noticed that if the server hasnt hit its target arc size, the
 > pauses are for maybe .5 seconds, but as soon as it hits its arc
 > target, the iops drop to around 50% of what it was and then there are
 > the longer pauses around 4-5 seconds. and then after every pause the
 > performance slows even more. So it appears it is definately server
 > side. 
 > 
 > This is with 100% random io with a spread of 33% write 66% read, 2KB
 > blocks. over a 50GB file, no compression, and a 5.5GB target arc
 > size. 
 > 
 > 
 > 
 > Also I have just ran some tests with different IO patterns and 100
 > sequencial writes produce and consistent IO of 2100IOPS, except when
 > it pauses for maybe .5 seconds every 10 - 15 seconds. 
 > 
 > 100% random writes produce around 200 IOPS with a 4-6 second pause
 > around every 10 seconds. 
 > 
 > 100% sequencial reads produce around 3700IOPS with no pauses, just
 > random peaks in response time (only 16ms) after about 1 minute of
 > running, so nothing to complain about. 
 > 
 > 100% random reads produce around 200IOPS, with no pauses. 
 > 
 > So it appears that writes cause a problem, what is causing these very
 > long write delays? 
 > 
 > A network capture shows that the server doesnt respond to the write
 > from the client when these pauses occur. 
 > 
 > Also, when using iometer, the initial file creation doesnt have and
 > pauses in the creation, so it  might only happen when modifying
 > files. 
 > 
 > Any help on finding a solution to this would be really appriciated.
 > 
 > David
 > -- 
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Scott Meilicke

2009-Sep-04 15:54 UTC

head link

[zfs-discuss] Pulsing write performance

Roch Bourbonnais Wrote:
""100% random writes produce around 200 IOPS with a 4-6 second pause
around every 10 seconds. "

This indicates that the bandwidth you''re able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers."

When I have tested using 50% reads, 60% random using iometer over NFS, I can see
the data going straight to disk due to the sync nature of NFS. But I also see
writes coming to a stand still every 10 seconds or so, which I have attributed
to the ZIL dumping to disk. Therefore I conclude that it is the process of
dumping the ZIL to disk that (mostly?) blocks writes during the dumping. I do
agree with Bob and others that suggest making the size of the dump smaller will
mask this behavior, and that seems like a good idea, although I have not yet
tried and tested it myself.

-Scott
-- 
This message posted from opensolaris.org

Neil Perrin

2009-Sep-04 17:06 UTC

head link

[zfs-discuss] Pulsing write performance

On 09/04/09 09:54, Scott Meilicke wrote:>> Roch Bourbonnais Wrote:
>>     ""100% random writes produce around 200 IOPS with a 4-6
second pause
>>     around every 10 seconds. "
>> 
>> This indicates that the bandwidth you''re able to transfer
>> through the protocol is about 50% greater than the bandwidth
>> the pool can offer to ZFS. Since, this is is not sustainable, you
>> see here ZFS trying to balance the 2 numbers."
> 
> When I have tested using 50% reads, 60% random using iometer over NFS,
> I can see the data going straight to disk due to the sync nature of NFS.
> But I also see writes coming to a stand still every 10 seconds or so,
> which I have attributed to the ZIL dumping to disk. Therefore I conclude
> that it is the process of dumping the ZIL to disk that (mostly?) blocks
> writes during the dumping.
The ZIL does does not work like that. It is not a journal.

Under a typical write load write transactions are batched and
written out in a group transaction (txg). This txg sync occurs
every 30s under light load but more frequently or continuously
under heavy load.

When writing synchronous data (eg NFS) the transactions get written immediately
to the intent log and are made stable. When the txg later commits the
intent log blocks containing those committed transactions can be
freed. So as you can see there is no periodic dumping of
the ZIL to disk. What you are probably observing is the periodic txg
commit.

Hope that helps: Neil.

Scott Meilicke

2009-Sep-04 17:37 UTC

head link

[zfs-discuss] Pulsing write performance

So what happens during the txg commit?

For example, if the ZIL is a separate device, SSD for this example, does it not
work like:

1. A sync operation commits the data to the SSD
2. A txg commit happens, and the data from the SSD are written to the spinning
disk

So this is two writes, correct?

-Scott
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Sep-04 17:51 UTC

head link

[zfs-discuss] Pulsing write performance

On Fri, 4 Sep 2009, Scott Meilicke wrote:
> So what happens during the txg commit?
>
> For example, if the ZIL is a separate device, SSD for this example, does it
not work like:
>
> 1. A sync operation commits the data to the SSD
> 2. A txg commit happens, and the data from the SSD are written to the
spinning disk
>
> So this is two writes, correct?
>From past descriptions, the slog is basically a list of pending write system calls.  The only time the slog is read is after a reboot. 
Otherwise, the slog is simply updated as write operations proceed.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Scott Meilicke

2009-Sep-04 17:58 UTC

head link

[zfs-discuss] Pulsing write performance

I am still not buying it :) I need to research this to satisfy myself.

I can understand that the writes come from memory to disk during a txg write for
async, and that is the behavior I see in testing.

But for sync, data must be committed, and a SSD/ZIL makes that faster because
you are writing to the SSD/ZIL, and not to spinning disk. Eventually that data
on the SSD must get to spinning disk.

To the books I go!

-Scott
-- 
This message posted from opensolaris.org

Eric Sproul

2009-Sep-04 18:01 UTC

head link

[zfs-discuss] Pulsing write performance

Scott Meilicke wrote:> So what happens during the txg commit?
> 
> For example, if the ZIL is a separate device, SSD for this example, does it
not work like:
> 
> 1. A sync operation commits the data to the SSD
> 2. A txg commit happens, and the data from the SSD are written to the
spinning disk
#1 is correct.  #2 is incorrect.  The TXG commit goes from memory into the main
pool.  The SSD data is simply left there in case something bad happens before
the TXG commit succeeds.  Once it succeeds, then the SSD data can be
overwritten.

The only time you need to read from a ZIL device is if a crash occurs and you
need those blocks to repair the pool.

Eric

Scott Meilicke

2009-Sep-04 18:15 UTC

head link

[zfs-discuss] Pulsing write performance

Doh! I knew that, but then forgot...

So, for the case of no separate device for the ZIL, the ZIL lives on the disk
pool. In which case, the data are written to the pool twice during a sync:

1. To the ZIL (on disk) 
2. From RAM to disk during tgx

If this is correct (and my history in this thread is not so good, so...), would
that then explain some sort of pulsing write behavior for sync write operations?
-- 
This message posted from opensolaris.org

Scott Meilicke

2009-Sep-04 18:22 UTC

head link

[zfs-discuss] Pulsing write performance

So, I just re-read the thread, and you can forget my last post. I had thought
the argument was that the data were not being written to disk twice (assuming no
separate device for the ZIL), but it was just explaining to me that the data are
not read from the ZIL to disk, but rather from memory to disk. I need more
coffee...
-- 
This message posted from opensolaris.org

Kyle McDonald

2009-Sep-04 18:55 UTC

head link

[zfs-discuss] Pulsing write performance

Scott Meilicke wrote:> I am still not buying it :) I need to research this to satisfy myself.
>
> I can understand that the writes come from memory to disk during a txg
write for async, and that is the behavior I see in testing.
>
> But for sync, data must be committed, and a SSD/ZIL makes that faster
because you are writing to the SSD/ZIL, and not to spinning disk. Eventually
that data on the SSD must get to spinning disk.
>
>   But the txg (which may contain more data than just the sync data that 
was written to the ZIL) is still written from memory. Just because the 
sync data was written to the ZIL, doesn''t mean it''s not still
in memory.

  -Kyle
> To the books I go!
>
> -Scott
>

Ross Walker

2009-Sep-04 20:17 UTC

head link

[zfs-discuss] Pulsing write performance

On Sep 4, 2009, at 2:22 PM, Scott Meilicke <scott.meilicke at
craneaerospace.com
 > wrote:
> So, I just re-read the thread, and you can forget my last post. I  
> had thought the argument was that the data were not being written to  
> disk twice (assuming no separate device for the ZIL), but it was  
> just explaining to me that the data are not read from the ZIL to  
> disk, but rather from memory to disk. I need more coffee...
I think your confusing ARC write-back with ZIL and it isn''t the sync  
writes that are blocking IO it''s the async writes that have been  
cached and are now being flushed.

Just tell ARC to cache less IO for your hardware with the kernel  
config Bob mentioned way back.

-Ross

Scott Meilicke

2009-Sep-04 20:33 UTC

head link

[zfs-discuss] Pulsing write performance

Yes, I was getting confused. Thanks to you (and everyone else) for clarifying.

Sync or async, I see the txg flushing to disk starve read IO.

Scott
-- 
This message posted from opensolaris.org

Ross Walker

2009-Sep-04 20:48 UTC

head link

[zfs-discuss] Pulsing write performance

On Sep 4, 2009, at 4:33 PM, Scott Meilicke <scott.meilicke at
craneaerospace.com
 > wrote:
> Yes, I was getting confused. Thanks to you (and everyone else) for  
> clarifying.
>
> Sync or async, I see the txg flushing to disk starve read IO.
Well try the kernel setting and see how it helps.

Honestly though if you can say it''s all sync writes with certainty and
IO is still blocking, you need a better storage sub-system, or an  
additional pool.

-Ross

Scott Meilicke

2009-Sep-04 21:25 UTC

head link

[zfs-discuss] Pulsing write performance

I only see the blocking while load testing, not during regular usage, so I am
not so worried. I will try the kernel settings to see if that helps if/when I
see the issue in production.

For what it is worth, here is the pattern I see when load testing NFS (iometer,
60% random, 65% read, 8k chunks, 32 outstanding I/Os):

data01      59.6G  20.4T     46     24   757K  3.09M
data01      59.6G  20.4T     39     24   593K  3.09M
data01      59.6G  20.4T     45     25   687K  3.22M
data01      59.6G  20.4T     45     23   683K  2.97M
data01      59.6G  20.4T     33     23   492K  2.97M
data01      59.6G  20.4T     16     41   214K  1.71M
data01      59.6G  20.4T      3  2.36K  53.4K  30.4M
data01      59.6G  20.4T      1  2.23K  20.3K  29.2M
data01      59.6G  20.4T      0  2.24K  30.2K  28.9M
data01      59.6G  20.4T      0  1.93K  30.2K  25.1M
data01      59.6G  20.4T      0  2.22K      0  28.4M
data01      59.7G  20.4T     21    295   317K  4.48M
data01      59.7G  20.4T     32     12   495K  1.61M
data01      59.7G  20.4T     35     25   515K  3.22M
data01      59.7G  20.4T     36     11   522K  1.49M
data01      59.7G  20.4T     33     24   508K  3.09M

LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM.

-Scott
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Sep-04 22:33 UTC

head link

[zfs-discuss] Pulsing write performance

On Fri, 4 Sep 2009, Scott Meilicke wrote:
> I only see the blocking while load testing, not during regular 
> usage, so I am not so worried. I will try the kernel settings to see 
> if that helps if/when I see the issue in production.
The flipside of the "pulsing" is that the deferred writes dimish 
contention for precious read IOPs and quite a few programs have a 
habit of updating/rewriting a file over and over again.  If the file 
is completely asynchronously rewritten once per second and zfs writes 
a transaction group every 30 seconds, then 29 of those updates avoided 
consuming write IOPs.  Another benefit is that if zfs has more data in 
hand to write, then it can do a much better job of avoiding 
fragmentation, avoid unnecessary COW by diminishing short tail writes, 
and achieve more optimum write patterns.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Walker

2009-Sep-05 00:02 UTC

head link

[zfs-discuss] Pulsing write performance

On Sep 4, 2009, at 6:33 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us 
 > wrote:
> On Fri, 4 Sep 2009, Scott Meilicke wrote:
>
>> I only see the blocking while load testing, not during regular  
>> usage, so I am not so worried. I will try the kernel settings to  
>> see if that helps if/when I see the issue in production.
>
> The flipside of the "pulsing" is that the deferred writes dimish
> contention for precious read IOPs and quite a few programs have a  
> habit of updating/rewriting a file over and over again.  If the file  
> is completely asynchronously rewritten once per second and zfs  
> writes a transaction group every 30 seconds, then 29 of those  
> updates avoided consuming write IOPs.  Another benefit is that if  
> zfs has more data in hand to write, then it can do a much better job  
> of avoiding fragmentation, avoid unnecessary COW by diminishing  
> short tail writes, and achieve more optimum write patterns.
I guess one can find a silver lining in any grey cloud, but for myself  
I''d just rather see a more linear approach to writes. Anyway I have  
never seen any reads happen during these write flushes.

-Ross

Ross Walker

2009-Sep-05 00:24 UTC

head link

[zfs-discuss] Pulsing write performance

On Sep 4, 2009, at 5:25 PM, Scott Meilicke <scott.meilicke at
craneaerospace.com
 > wrote:
> I only see the blocking while load testing, not during regular  
> usage, so I am not so worried. I will try the kernel settings to see  
> if that helps if/when I see the issue in production.
>
> For what it is worth, here is the pattern I see when load testing  
> NFS (iometer, 60% random, 65% read, 8k chunks, 32 outstanding I/Os):
>
> data01      59.6G  20.4T     46     24   757K  3.09M
> data01      59.6G  20.4T     39     24   593K  3.09M
> data01      59.6G  20.4T     45     25   687K  3.22M
> data01      59.6G  20.4T     45     23   683K  2.97M
> data01      59.6G  20.4T     33     23   492K  2.97M
> data01      59.6G  20.4T     16     41   214K  1.71M
> data01      59.6G  20.4T      3  2.36K  53.4K  30.4M
> data01      59.6G  20.4T      1  2.23K  20.3K  29.2M
> data01      59.6G  20.4T      0  2.24K  30.2K  28.9M
> data01      59.6G  20.4T      0  1.93K  30.2K  25.1M
> data01      59.6G  20.4T      0  2.22K      0  28.4M
> data01      59.7G  20.4T     21    295   317K  4.48M
> data01      59.7G  20.4T     32     12   495K  1.61M
> data01      59.7G  20.4T     35     25   515K  3.22M
> data01      59.7G  20.4T     36     11   522K  1.49M
> data01      59.7G  20.4T     33     24   508K  3.09M
>
> LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM.
With that setup you''ll see max 3x the IOPS of the type of disks, not  
really the kind of setup for 60% random workload. Assuming 2TB SATA  
drives the max IOPS would be around 240 IOPS.

Now if it were mirror vdevs you''d get 7x or 560 IOPS.

Is this for VMware or data warehousing?

You''ll also need an SSD drive in the mix if your not using a  
controller with NVRAM write-back. Especially when sharing over NFS.

I guess since it''s 15 drives it''s an MD1000, I might have gone
with
the newer 2.5" drive enclosure as it holds 24 over 15 and most SSDs  
come in 2.5".

Since you got it already, invest in a PERC 6/E with 512MB of cache and  
stick it in the other PCIe 8x slot.

-Ross

Bob Friesenhahn

2009-Sep-05 00:59 UTC

head link

[zfs-discuss] Pulsing write performance

On Fri, 4 Sep 2009, Ross Walker wrote:>
> I guess one can find a silver lining in any grey cloud, but for myself
I''d
> just rather see a more linear approach to writes. Anyway I have never seen 
> any reads happen during these write flushes.
I have yet to see a read happen during the write flush either.  That 
impacts my application since it needs to read in order to proceed, and 
it does a similar amount of writes as it does reads.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Walker

2009-Sep-05 01:44 UTC

head link

[zfs-discuss] Pulsing write performance

On Sep 4, 2009, at 8:59 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us 
 > wrote:
> On Fri, 4 Sep 2009, Ross Walker wrote:
>>
>> I guess one can find a silver lining in any grey cloud, but for  
>> myself I''d just rather see a more linear approach to writes.
Anyway
>> I have never seen any reads happen during these write flushes.
>
> I have yet to see a read happen during the write flush either.  That  
> impacts my application since it needs to read in order to proceed,  
> and it does a similar amount of writes as it does reads.
The ARC makes it hard to tell if they are satisfied from cache or  
blocked due to writes.

I suppose if you have the hardware to go sync that might be the best  
bet. That and limiting the write cache.

Though I have only heard good comments from my ESX admins since moving  
the VMs off iSCSI and on to ZFS over NFS, so it can''t be that bad.

-Ross

Bob Friesenhahn

2009-Sep-05 01:49 UTC

head link

[zfs-discuss] Pulsing write performance

On Fri, 4 Sep 2009, Ross Walker wrote:>> 
>> I have yet to see a read happen during the write flush either.  That 
>> impacts my application since it needs to read in order to proceed, and
it
>> does a similar amount of writes as it does reads.
>
> The ARC makes it hard to tell if they are satisfied from cache or blocked
due
> to writes.
The existing prefetch bug makes it doubly hard. :-)

First I complained about the blocking reads, and then I complained 
about the blocking writes (presumed responsible for the blocking 
reads) and now I am waiting for working prefetch in order to feed my 
hungry application.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

David Magda

2009-Sep-05 02:02 UTC

head link

[zfs-discuss] Pulsing write performance

On Sep 4, 2009, at 21:44, Ross Walker wrote:
> Though I have only heard good comments from my ESX admins since  
> moving the VMs off iSCSI and on to ZFS over NFS, so it can''t be
that
> bad.
What''s your pool configuration? Striped mirrors? RAID-Z with SSDs?  
Other?

Ross Walker

2009-Sep-05 02:22 UTC

head link

[zfs-discuss] Pulsing write performance

On Sep 4, 2009, at 10:02 PM, David Magda <dmagda at ee.ryerson.ca> wrote:
> On Sep 4, 2009, at 21:44, Ross Walker wrote:
>
>> Though I have only heard good comments from my ESX admins since  
>> moving the VMs off iSCSI and on to ZFS over NFS, so it can''t
be
>> that bad.
>
> What''s your pool configuration? Striped mirrors? RAID-Z with SSDs?
> Other?
Striped mirrors off NVRAM backed controller (Dell PERC 6/E).

RAID-Z isn''t the best for many VMs as the whole vdev acts as single  
disk for random io.

-Ross

Scott Meilicke

2009-Sep-08 16:17 UTC

head link

[zfs-discuss] Pulsing write performance

True, this setup is not designed for high random I/O, but rather lots of storage
with fair performance. This box is for our dev/test backend storage. Our
production VI runs in the 500-700 IOPS (80+ VMs, production plus dev/test) on
average, so for our development VI, we are expecting half of that at most, on
average. Testing with parameters that match the observed behavior of the
production VI gets us about 750 IOPS with compression (NFS, 2009.06), so I am
happy with the performance and very happy with the amount of available space.

Stripped mirrors are much faster, ~2200 IOPS with 16 disks (but alas, tested
with iSCSI on 2008.11, compression on. We got about 1,000 IOPS with the 3x5
raidz setup with compression to compare iSCSI and 2008.11 vs NFS and 2009.06),
but again we are shooting for available space, with performance being a
secondary goal. And yes, we would likely get much better performance using SSDs
for the ZIL and L2ARC.

This has been an interesting thread! Sorry for the bit of hijacking...
-- 
This message posted from opensolaris.org

skiik

2009-Sep-12 14:23 UTC

head link

[zfs-discuss] Pulsing write performance

im playing around with a home raidz2 install and i can see this pulsing as well.

The only difference is i have 6 ext usb drives with activity lights on them so i
can see whats actually being written to the disk and when :)

What i see is about 8 second pauses while data is being sent over the network
into what appears to be some sort of in memory cache.  Then the cache is flushed
to disk and the drives all spring into life, and the network activity dropps
down to zero.  after a few seconds writing, the drives stop and the whole
process begins again.

Somethings funny is going on there...
-- 
This message posted from opensolaris.org

zfs discuss - Aug 2009 - Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance

[zfs-discuss] Pulsing write performance