thr3ads.net - zfs discuss - [zfs-discuss] Periodic flush [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Bob Friesenhahn

2008-Mar-27 02:35 UTC

[zfs-discuss] Periodic flush

My application processes thousands of files sequentially, reading 
input files, and outputting new files.  I am using Solaris 10U4. 
While running the application in a verbose mode, I see that it runs 
very fast but pauses about every 7 seconds for a second or two.  This 
is while reading 50MB/second and writing 73MB/second (ARC cache miss 
rate of 87%).  The pause does not occur if the application spends more 
time doing real work.  However, it would be nice if the pause went 
away.

I have tried turning down the ARC size (from 14GB to 10GB) but the 
behavior did not noticeably improve.  The storage device is trained to 
ignore cache flush requests.  According to the Evil Tuning Guide, the 
pause I am seeing is due to a cache flush after the uberblock updates.

It does not seem like a wise choice to disable ZFS cache flushing 
entirely.  Is there a better way other than adding a small delay into 
my application?

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Neelakanth Nadgir

2008-Mar-27 03:35 UTC

head link

[zfs-discuss] Periodic flush

Bob Friesenhahn wrote:> My application processes thousands of files sequentially, reading 
> input files, and outputting new files.  I am using Solaris 10U4. 
> While running the application in a verbose mode, I see that it runs 
> very fast but pauses about every 7 seconds for a second or two. 
When you experience the pause at the application level,
do you see an increase in writes to disk? This might the
regular syncing of the transaction group to disk.
This is normal behavior. The "amount" of pause is
determined by how much data needs to be synced. You could
of course decrease it by reducing the time between syncs
(either by reducing the ARC and/or decreasing txg_time),
however, I am not sure it will translate to better performance
for you.

hth,
-neel

  This> is while reading 50MB/second and writing 73MB/second (ARC cache miss 
> rate of 87%).  The pause does not occur if the application spends more 
> time doing real work.  However, it would be nice if the pause went 
> away.
> 
> I have tried turning down the ARC size (from 14GB to 10GB) but the 
> behavior did not noticeably improve.  The storage device is trained to 
> ignore cache flush requests.  According to the Evil Tuning Guide, the 
> pause I am seeing is due to a cache flush after the uberblock updates.
> 
> It does not seem like a wise choice to disable ZFS cache flushing 
> entirely.  Is there a better way other than adding a small delay into 
> my application?
> 
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Selim Daoud

2008-Mar-27 12:09 UTC

head link

[zfs-discuss] Periodic flush

the question is: does the "IO pausing" behaviour you noticed penalize
your application?
what are the consequences at the application level?

for instance we have seen application doing some kind of data capture
from external device (video for example) requiring a constant
throughput to disk (data feed), risking otherwise loss of data. in
this case qfs might be a better option (no free though)
if your application is not suffering, then you should be able to live
with this apparent "io hangs"

s-


On Thu, Mar 27, 2008 at 3:35 AM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> My application processes thousands of files sequentially, reading
>  input files, and outputting new files.  I am using Solaris 10U4.
>  While running the application in a verbose mode, I see that it runs
>  very fast but pauses about every 7 seconds for a second or two.  This
>  is while reading 50MB/second and writing 73MB/second (ARC cache miss
>  rate of 87%).  The pause does not occur if the application spends more
>  time doing real work.  However, it would be nice if the pause went
>  away.
>
>  I have tried turning down the ARC size (from 14GB to 10GB) but the
>  behavior did not noticeably improve.  The storage device is trained to
>  ignore cache flush requests.  According to the Evil Tuning Guide, the
>  pause I am seeing is due to a cache flush after the uberblock updates.
>
>  It does not seem like a wise choice to disable ZFS cache flushing
>  entirely.  Is there a better way other than adding a small delay into
>  my application?
>
>  Bob
>  =====================================>  Bob Friesenhahn
>  bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
>  GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
>  _______________________________________________
>  zfs-discuss mailing list
>  zfs-discuss at opensolaris.org
>  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
------------------------------------------------------
Blog: http://fakoli.blogspot.com/

Bob Friesenhahn

2008-Mar-27 15:08 UTC

head link

[zfs-discuss] Periodic flush

On Wed, 26 Mar 2008, Neelakanth Nadgir wrote:> When you experience the pause at the application level,
> do you see an increase in writes to disk? This might the
> regular syncing of the transaction group to disk.
If I use ''zpool iostat'' with a one second interval what I see
is two
or three samples with no write I/O at all followed by a huge write of 
100 to 312MB/second.  Writes claimed to be a lower rate are split 
across two sample intervale.

It seems that writes are being cached and then issued all at once. 
This behavior assumes that the file may be written multiple times so a 
delayed write is more efficient.

If I run a script like

while true
do
sync
done

then the write data rate is much more consistent (at about 
66MB/second) and the program does not stall.  Of course this is not 
very efficient.

Are the ''zpool iostat'' statistics accurate?

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Mar-27 15:13 UTC

head link

[zfs-discuss] Periodic flush

Selim Daoud wrote:> the question is: does the "IO pausing" behaviour you noticed
penalize
> your application?
> what are the consequences at the application level?
>
> for instance we have seen application doing some kind of data capture
> from external device (video for example) requiring a constant
> throughput to disk (data feed), risking otherwise loss of data. in
> this case qfs might be a better option (no free though)
> if your application is not suffering, then you should be able to live
> with this apparent "io hangs"
>
>   
I would look at txg_time first... for lots of streaming writes on a machine
with limited memory writes you can smooth out the sawtooth.

QFS is open sourced. http://blogs.sun.com/samqfs
 -- richard

Neelakanth Nadgir

2008-Mar-27 16:04 UTC

head link

[zfs-discuss] Periodic flush

Bob Friesenhahn wrote:> On Wed, 26 Mar 2008, Neelakanth Nadgir wrote:
>> When you experience the pause at the application level,
>> do you see an increase in writes to disk? This might the
>> regular syncing of the transaction group to disk.
> 
> If I use ''zpool iostat'' with a one second interval what I
see is two
> or three samples with no write I/O at all followed by a huge write of 
> 100 to 312MB/second.  Writes claimed to be a lower rate are split 
> across two sample intervale.
> 
> It seems that writes are being cached and then issued all at once. 
> This behavior assumes that the file may be written multiple times so a 
> delayed write is more efficient.
> 
This does sound like the regular syncing.
> If I run a script like
> 
> while true
> do
> sync
> done
> 
> then the write data rate is much more consistent (at about 
> 66MB/second) and the program does not stall.  Of course this is not 
> very efficient.
> 
This causes the sync to happen much faster, but as you say, suboptimal.
Haven''t had the time to go through the bug report, but probably
CR 6429205 each zpool needs to monitor its throughput
and throttle heavy writers
will help.
> Are the ''zpool iostat'' statistics accurate?
> 
Yes. You could also look at regular iostat
and correlate it.
-neel

Bob Friesenhahn

2008-Mar-27 16:24 UTC

head link

[zfs-discuss] Periodic flush

On Thu, 27 Mar 2008, Neelakanth Nadgir wrote:>
> This causes the sync to happen much faster, but as you say, suboptimal.
> Haven''t had the time to go through the bug report, but probably
> CR 6429205 each zpool needs to monitor its throughput
> and throttle heavy writers
> will help.
I hope that this feature is implemented soon, and works well. :-)

I tested with my application outputting to a UFS filesystem on a 
single 15K RPM SAS disk and saw that it writes about 50MB/second and 
without the bursty behavior of ZFS.  When writing to ZFS filesystem on 
a RAID array, zpool I/O stat reports an average (over 10 seconds) 
write rate of 54MB/second.  Given that the throughput is not much 
higher on the RAID array, I assume that the bottleneck is in my 
application.
>> Are the ''zpool iostat'' statistics accurate?
>
> Yes. You could also look at regular iostat
> and correlate it.
Iostat shows that my RAID array disks are loafing with only 9MB/second 
writes to each but with 82 writes/second.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

eric kustarz

2008-Mar-27 21:36 UTC

head link

[zfs-discuss] Periodic flush

On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote:> On Thu, 27 Mar 2008, Neelakanth Nadgir wrote:
>>
>> This causes the sync to happen much faster, but as you say,  
>> suboptimal.
>> Haven''t had the time to go through the bug report, but
probably
>> CR 6429205 each zpool needs to monitor its throughput
>> and throttle heavy writers
>> will help.
>
> I hope that this feature is implemented soon, and works well. :-)
Actually, this has gone back into snv_87 (and no we don''t know which  
s10uX it will go into yet).

eric

abs

2008-Mar-27 21:57 UTC

head link

[zfs-discuss] Periodic flush

you may want to try disabling the disk write cache on the single disk.
also for the RAID disable ''host cache flush'' if such an option
exists.  that solved the problem for me.

let me know.


Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote: On Thu, 27 Mar
2008, Neelakanth Nadgir wrote:>
> This causes the sync to happen much faster, but as you say, suboptimal.
> Haven''t had the time to go through the bug report, but probably
> CR 6429205 each zpool needs to monitor its throughput
> and throttle heavy writers
> will help.
I hope that this feature is implemented soon, and works well. :-)

I tested with my application outputting to a UFS filesystem on a 
single 15K RPM SAS disk and saw that it writes about 50MB/second and 
without the bursty behavior of ZFS.  When writing to ZFS filesystem on 
a RAID array, zpool I/O stat reports an average (over 10 seconds) 
write rate of 54MB/second.  Given that the throughput is not much 
higher on the RAID array, I assume that the bottleneck is in my 
application.
>> Are the ''zpool iostat'' statistics accurate?
>
> Yes. You could also look at regular iostat
> and correlate it.
Iostat shows that my RAID array disks are loafing with only 9MB/second 
writes to each but with 82 writes/second.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080327/6888269a/attachment.html>

Robert Milkowski

2008-Mar-28 08:06 UTC

head link

[zfs-discuss] Periodic flush

Hello eric,

Thursday, March 27, 2008, 9:36:42 PM, you wrote:

ek> On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn
wrote:>> On Thu, 27 Mar 2008, Neelakanth Nadgir wrote:
>>>
>>> This causes the sync to happen much faster, but as you say,  
>>> suboptimal.
>>> Haven''t had the time to go through the bug report, but
probably
>>> CR 6429205 each zpool needs to monitor its throughput
>>> and throttle heavy writers
>>> will help.
>>
>> I hope that this feature is implemented soon, and works well. :-)
ek> Actually, this has gone back into snv_87 (and no we don''t know
which
ek> s10uX it will go into yet).


Could you share more details how it works right now after change?

-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Mark Maybee

2008-Apr-15 19:32 UTC

head link

[zfs-discuss] Periodic flush

ZFS has always done a certain amount of "write throttling".  In the
past
(or the present, for those of you running S10 or pre build 87 bits) this
throttling was controlled by a timer and the size of the ARC: we would
"cut" a transaction group every 5 seconds based off of our timer, and
we would also "cut" a transaction group if we had more than 1/4 of the
ARC size worth of dirty data in the transaction group.  So, for example,
if you have a machine with 16GB of physical memory it wouldn''t be
unusual to see an ARC size of around 12GB.  This means we would allow
up to 3GB of dirty data into a single transaction group (if the writes
complete in less than 5 seconds).  Now we can have up to three
transaction groups "in progress" at any time: open context, quiesce
context, and sync context.  As a final wrinkle, we also don''t allow
more
than 1/2 the ARC to be composed of dirty write data.  All taken
together, this means that there can be up to 6GB of writes "in the
pipe"
(using the 12GB ARC example from above).

Problems with this design start to show up when the write-to-disk
bandwidth can''t keep up with the application: if the application is
writing at a rate of, say, 1GB/sec, it will "fill the pipe" within
6 seconds.  But if the IO bandwidth to disk is only 512MB/sec, its
going to take 12sec to get this data onto the disk.  This "impedance
mis-match" is going to manifest as pauses:  the application fills
the pipe, then waits for the pipe to empty, then starts writing again.
Note that this won''t be smooth, since we need to complete an entire
sync phase before allowing things to progress.  So you can end up
with IO gaps.  This is probably what the original submitter is
experiencing.  Note there are a few other subtleties here that I
have glossed over, but the general picture is accurate.

The new write throttle code put back into build 87 attempts to
smooth out the process.  We now measure the amount of time it takes
to sync each transaction group, and the amount of data in that group.
We dynamically resize our write throttle to try to keep the sync
time constant (at 5secs) under write load.  We also introduce
"fairness" delays on writers when we near pipeline capacity: each
write is delayed 1/100sec when we are about to "fill up".  This
prevents a single heavy writer from "starving out" occasional
writers.  So instead of coming to an abrupt halt when the pipeline
fills, we slow down our write pace.  The result should be a constant
even IO load.

There is one "down side" to this new model: if a write load is very
"bursty", e.g., a large 5GB write followed by 30secs of idle, the
new code may be less efficient than the old.  In the old code, all
of this IO would be let in at memory speed and then more slowly make
its way out to disk.  In the new code, the writes may be slowed down.
The data makes its way to the disk in the same amount of time, but
the application takes longer.  Conceptually: we are sizing the write
buffer to the pool bandwidth, rather than to the memory size.

Robert Milkowski wrote:> Hello eric,
> 
> Thursday, March 27, 2008, 9:36:42 PM, you wrote:
> 
> ek> On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote:
>>> On Thu, 27 Mar 2008, Neelakanth Nadgir wrote:
>>>> This causes the sync to happen much faster, but as you say,  
>>>> suboptimal.
>>>> Haven''t had the time to go through the bug report, but
probably
>>>> CR 6429205 each zpool needs to monitor its throughput
>>>> and throttle heavy writers
>>>> will help.
>>> I hope that this feature is implemented soon, and works well. :-)
> 
> ek> Actually, this has gone back into snv_87 (and no we don''t
know which
> ek> s10uX it will go into yet).
> 
> 
> Could you share more details how it works right now after change?
>

Bob Friesenhahn

2008-Apr-15 20:24 UTC

head link

[zfs-discuss] Periodic flush

On Tue, 15 Apr 2008, Mark Maybee wrote:> going to take 12sec to get this data onto the disk.  This "impedance
> mis-match" is going to manifest as pauses:  the application fills
> the pipe, then waits for the pipe to empty, then starts writing again.
> Note that this won''t be smooth, since we need to complete an
entire
> sync phase before allowing things to progress.  So you can end up
> with IO gaps.  This is probably what the original submitter is
Yes.  With an application which also needs to make best use of 
available CPU, these I/O "gaps" cut into available CPU time (by 
blocking the process) unless the application uses multithreading and 
an intermediate write queue (more memory) to separate the CPU-centric 
parts from the I/O-centric parts.  While the single-threaded 
application is waiting for data to be written, it is not able to read 
and process more data.  Since reads take time to complete, being 
blocked on write stops new reads from being started so the data is 
ready when it is needed.
> There is one "down side" to this new model: if a write load is
very
> "bursty", e.g., a large 5GB write followed by 30secs of idle, the
> new code may be less efficient than the old.  In the old code, all
This is also a common scenario. :-)

Presumably the special "slow I/O" code would not kick in unless the 
burst was large enough to fill quite a bit of the ARC.

Real time throttling is quite a challenge to do in software.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Robert Milkowski

2008-Apr-18 08:31 UTC

head link

[zfs-discuss] Periodic flush

Hello Mark,

Tuesday, April 15, 2008, 8:32:32 PM, you wrote:

MM> ZFS has always done a certain amount of "write throttling".  In
the past
MM> (or the present, for those of you running S10 or pre build 87 bits) this
MM> throttling was controlled by a timer and the size of the ARC: we would
MM> "cut" a transaction group every 5 seconds based off of our
timer, and
MM> we would also "cut" a transaction group if we had more than 1/4
of the
MM> ARC size worth of dirty data in the transaction group.  So, for example,
MM> if you have a machine with 16GB of physical memory it wouldn''t
be
MM> unusual to see an ARC size of around 12GB.  This means we would allow
MM> up to 3GB of dirty data into a single transaction group (if the writes
MM> complete in less than 5 seconds).  Now we can have up to three
MM> transaction groups "in progress" at any time: open context,
quiesce
MM> context, and sync context.  As a final wrinkle, we also don''t
allow more
MM> than 1/2 the ARC to be composed of dirty write data.  All taken
MM> together, this means that there can be up to 6GB of writes "in the
pipe"
MM> (using the 12GB ARC example from above).

MM> Problems with this design start to show up when the write-to-disk
MM> bandwidth can''t keep up with the application: if the application
is
MM> writing at a rate of, say, 1GB/sec, it will "fill the pipe"
within
MM> 6 seconds.  But if the IO bandwidth to disk is only 512MB/sec, its
MM> going to take 12sec to get this data onto the disk.  This "impedance
MM> mis-match" is going to manifest as pauses:  the application fills
MM> the pipe, then waits for the pipe to empty, then starts writing again.
MM> Note that this won''t be smooth, since we need to complete an
entire
MM> sync phase before allowing things to progress.  So you can end up
MM> with IO gaps.  This is probably what the original submitter is
MM> experiencing.  Note there are a few other subtleties here that I
MM> have glossed over, but the general picture is accurate.

MM> The new write throttle code put back into build 87 attempts to
MM> smooth out the process.  We now measure the amount of time it takes
MM> to sync each transaction group, and the amount of data in that group.
MM> We dynamically resize our write throttle to try to keep the sync
MM> time constant (at 5secs) under write load.  We also introduce
MM> "fairness" delays on writers when we near pipeline capacity:
each
MM> write is delayed 1/100sec when we are about to "fill up".  This
MM> prevents a single heavy writer from "starving out" occasional
MM> writers.  So instead of coming to an abrupt halt when the pipeline
MM> fills, we slow down our write pace.  The result should be a constant
MM> even IO load.

MM> There is one "down side" to this new model: if a write load is
very
MM> "bursty", e.g., a large 5GB write followed by 30secs of idle,
the
MM> new code may be less efficient than the old.  In the old code, all
MM> of this IO would be let in at memory speed and then more slowly make
MM> its way out to disk.  In the new code, the writes may be slowed down.
MM> The data makes its way to the disk in the same amount of time, but
MM> the application takes longer.  Conceptually: we are sizing the write
MM> buffer to the pool bandwidth, rather than to the memory size.



First - thank you for your explanation - it is very helpful.

I''m worried about the last part - but it''s hard to be optimal
for all
workloads. Nevertheless sometimes the problem is if you change the
behavior from application perspective. With other file systems I
guess you are able to fill in most of memory and still keep disks busy
100% of the time without IO gaps.

My biggest concern were these gaps in IO as zfs should keep disks 100%
busy if needed.



-- 
Best regards,
 Robert Milkowski                           mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Roch - PAE

2008-May-15 13:08 UTC

head link

[zfs-discuss] Periodic flush

Bob Friesenhahn writes:
 > On Tue, 15 Apr 2008, Mark Maybee wrote:
 > > going to take 12sec to get this data onto the disk.  This
"impedance
 > > mis-match" is going to manifest as pauses:  the application
fills
 > > the pipe, then waits for the pipe to empty, then starts writing
again.
 > > Note that this won''t be smooth, since we need to complete an
entire
 > > sync phase before allowing things to progress.  So you can end up
 > > with IO gaps.  This is probably what the original submitter is
 > 
 > Yes.  With an application which also needs to make best use of 
 > available CPU, these I/O "gaps" cut into available CPU time (by 
 > blocking the process) unless the application uses multithreading and 
 > an intermediate write queue (more memory) to separate the CPU-centric 
 > parts from the I/O-centric parts.  While the single-threaded 
 > application is waiting for data to be written, it is not able to read 
 > and process more data.  Since reads take time to complete, being 
 > blocked on write stops new reads from being started so the data is 
 > ready when it is needed.
 > 
 > > There is one "down side" to this new model: if a write load
is very
 > > "bursty", e.g., a large 5GB write followed by 30secs of
idle, the
 > > new code may be less efficient than the old.  In the old code, all
 > 
 > This is also a common scenario. :-)
 > 
 > Presumably the special "slow I/O" code would not kick in unless
the
 > burst was large enough to fill quite a bit of the ARC.
 > 

Bursts of 1/8th of physical memory or 5 seconds of storage
throughput whichever is smallest.

-r



 > Real time throttling is quite a challenge to do in software.
 > 
 > Bob
 > ===================================== > Bob Friesenhahn
 > bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
 > GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2008-Jun-28 03:14 UTC

head link

[zfs-discuss] Periodic flush

Hello Mark,

Tuesday, April 15, 2008, 8:32:32 PM, you wrote:

MM> The new write throttle code put back into build 87 attempts to
MM> smooth out the process.  We now measure the amount of time it takes
MM> to sync each transaction group, and the amount of data in that group.
MM> We dynamically resize our write throttle to try to keep the sync
MM> time constant (at 5secs) under write load.  We also introduce
MM> "fairness" delays on writers when we near pipeline capacity:
each
MM> write is delayed 1/100sec when we are about to "fill up".  This
MM> prevents a single heavy writer from "starving out" occasional
MM> writers.  So instead of coming to an abrupt halt when the pipeline
MM> fills, we slow down our write pace.  The result should be a constant
MM> even IO load.

snv_91, 48x 500GB sata drives in one large stripe:

# zpool create -f test c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c1t5d0 c1t6d0 c1t7d0
c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0 c3t0d0 c3t1d0 c3t2d0
c3t3d0 c3t4d0 c3t5d0 c3t6d0 c3t7d0 c4t0d0 c4t1d0 c4t2d0 c4t3d0 c4t4d0 c4t5d0
c4t6d0 c4t7d0 c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0 c5t6d0 c5t7d0 c6t0d0
c6t1d0 c6t2d0 c6t3d0 c6t4d0 c6t5d0 c6t6d0 c6t7d0
# zfs set atime=off test


# dd if=/dev/zero of=/test/q1 bs=1024k
^C34374+0 records in
34374+0 records out


# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
[...]
test        58.9M  21.7T      0  1.19K      0  80.8M
test         862M  21.7T      0  6.67K      0   776M
test        1.52G  21.7T      0  5.50K      0   689M
test        1.52G  21.7T      0  9.28K      0  1.16G
test        2.88G  21.7T      0  1.14K      0   135M
test        2.88G  21.7T      0  1.61K      0   206M
test        2.88G  21.7T      0  18.0K      0  2.24G
test        5.60G  21.7T      0     79      0   264K
test        5.60G  21.7T      0      0      0      0
test        5.60G  21.7T      0  10.9K      0  1.36G
test        9.59G  21.7T      0  7.09K      0   897M
test        9.59G  21.7T      0      0      0      0
test        9.59G  21.7T      0  6.33K      0   807M
test        9.59G  21.7T      0  17.9K      0  2.24G
test        13.6G  21.7T      0  1.96K      0   239M
test        13.6G  21.7T      0      0      0      0
test        13.6G  21.7T      0  11.9K      0  1.49G
test        17.6G  21.7T      0  9.91K      0  1.23G
test        17.6G  21.7T      0      0      0      0
test        17.6G  21.7T      0  5.48K      0   700M
test        17.6G  21.7T      0  20.0K      0  2.50G
test        21.6G  21.7T      0  2.03K      0   244M
test        21.6G  21.7T      0      0      0      0
test        21.6G  21.7T      0      0      0      0
test        21.6G  21.7T      0  4.03K      0   513M
test        21.6G  21.7T      0  23.7K      0  2.97G
test        25.6G  21.7T      0  1.83K      0   225M
test        25.6G  21.7T      0      0      0      0
test        25.6G  21.7T      0  13.9K      0  1.74G
test        29.6G  21.7T      1  1.40K   127K   167M
test        29.6G  21.7T      0      0      0      0
test        29.6G  21.7T      0  7.14K      0   912M
test        29.6G  21.7T      0  19.2K      0  2.40G
test        33.6G  21.7T      1    378   127K  34.8M
test        33.6G  21.7T      0      0      0      0
^C


Well, doesn''t actually look good. Checking with iostat I don''t
see any
problems like long service times, etc.

Reducing zfs_txg_synctime to 1 helps a little bit but still it''s not
even stream of data.

If I start 3 dd streams at the same time then it is slightly better
(zfs_txg_synctime set back to 5) but still very jumpy.

Reading with one dd produces steady throghput but I''m disapointed with
actual performance:

test         161G  21.6T  9.94K      0  1.24G      0
test         161G  21.6T  10.0K      0  1.25G      0
test         161G  21.6T  10.3K      0  1.29G      0
test         161G  21.6T  10.1K      0  1.27G      0
test         161G  21.6T  10.4K      0  1.31G      0
test         161G  21.6T  10.1K      0  1.27G      0
test         161G  21.6T  10.4K      0  1.30G      0
test         161G  21.6T  10.2K      0  1.27G      0
test         161G  21.6T  10.3K      0  1.29G      0
test         161G  21.6T  10.0K      0  1.25G      0
test         161G  21.6T  9.96K      0  1.24G      0
test         161G  21.6T  10.6K      0  1.33G      0
test         161G  21.6T  10.1K      0  1.26G      0
test         161G  21.6T  10.2K      0  1.27G      0
test         161G  21.6T  10.4K      0  1.30G      0
test         161G  21.6T  9.62K      0  1.20G      0
test         161G  21.6T  8.22K      0  1.03G      0
test         161G  21.6T  9.61K      0  1.20G      0
test         161G  21.6T  10.2K      0  1.28G      0
test         161G  21.6T  9.12K      0  1.14G      0
test         161G  21.6T  9.96K      0  1.25G      0
test         161G  21.6T  9.72K      0  1.22G      0
test         161G  21.6T  10.6K      0  1.32G      0
test         161G  21.6T  9.93K      0  1.24G      0
test         161G  21.6T  9.94K      0  1.24G      0


zpool scrub produces:

test         161G  21.6T     25     69  2.70M   392K
test         161G  21.6T  10.9K      0  1.35G      0
test         161G  21.6T  13.4K      0  1.66G      0
test         161G  21.6T  13.2K      0  1.63G      0
test         161G  21.6T  11.8K      0  1.46G      0
test         161G  21.6T  13.8K      0  1.72G      0
test         161G  21.6T  12.4K      0  1.53G      0
test         161G  21.6T  12.9K      0  1.59G      0
test         161G  21.6T  12.9K      0  1.59G      0
test         161G  21.6T  13.4K      0  1.67G      0
test         161G  21.6T  12.2K      0  1.51G      0
test         161G  21.6T  12.9K      0  1.59G      0
test         161G  21.6T  12.5K      0  1.55G      0
test         161G  21.6T  13.3K      0  1.64G      0




So sequential reading gives steady thruput but numbers are a little
bit lower than expected.

Sequential writing is still jumpy with single or multiple dd streams
for pool with many disk drives.

Lets destroy the pool and create a new one, smaller one.



# zpool create -f test c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0
# zfs set atime=off test

# dd if=/dev/zero of=/test/q1 bs=1024k
^C15905+0 records in
15905+0 records out


# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
[...]
test         688M  2.72T      0  3.29K      0   401M
test        1.01G  2.72T      0  3.69K      0   462M
test        1.35G  2.72T      0  3.59K      0   450M
test        1.35G  2.72T      0  2.95K      0   372M
test        2.03G  2.72T      0  3.37K      0   428M
test        2.03G  2.72T      0  1.94K      0   248M
test        2.71G  2.72T      0  2.44K      0   301M
test        2.71G  2.72T      0  3.88K      0   497M
test        2.71G  2.72T      0  3.86K      0   494M
test        4.07G  2.71T      0  3.42K      0   425M
test        4.07G  2.71T      0  3.89K      0   498M
test        4.07G  2.71T      0  3.88K      0   497M
test        5.43G  2.71T      0  3.44K      0   429M
test        5.43G  2.71T      0  3.94K      0   504M
test        5.43G  2.71T      0  3.88K      0   497M
test        5.43G  2.71T      0  3.88K      0   497M
test        7.62G  2.71T      0  2.34K      0   286M
test        7.62G  2.71T      0  4.23K      0   539M
test        7.62G  2.71T      0  3.89K      0   498M
test        7.62G  2.71T      0  3.87K      0   495M
test        7.62G  2.71T      0  3.88K      0   497M
test        9.81G  2.71T      0  3.33K      0   418M
test        9.81G  2.71T      0  4.12K      0   526M
test        9.81G  2.71T      0  3.88K      0   497M


Much more steady - interesting.


Let''s do it again with yet bigger pool and lets keep distributing
disks in "rows" across controllers.

# zpool create -f test c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0 c1t1d0 c2t1d0
c3t1d0 c4t1d0 c5t1d0 c6t1d0
# zfs set atime=off test

test        1.35G  5.44T      0  5.42K      0   671M
test        2.03G  5.44T      0  7.01K      0   883M
test        2.71G  5.43T      0  6.22K      0   786M
test        2.71G  5.43T      0  8.09K      0  1.01G
test        4.07G  5.43T      0  7.14K      0   902M
test        5.43G  5.43T      0  4.02K      0   507M
test        5.43G  5.43T      0  5.52K      0   700M
test        5.43G  5.43T      0  8.04K      0  1.00G
test        5.43G  5.43T      0  7.70K      0   986M
test        8.15G  5.43T      0  6.13K      0   769M
test        8.15G  5.43T      0  7.77K      0   995M
test        8.15G  5.43T      0  7.67K      0   981M
test        10.9G  5.43T      0  4.15K      0   517M
test        10.9G  5.43T      0  7.74K      0   986M
test        10.9G  5.43T      0  7.76K      0   994M
test        10.9G  5.43T      0  7.75K      0   993M
test        14.9G  5.42T      0  6.79K      0   860M
test        14.9G  5.42T      0  7.50K      0   958M
test        14.9G  5.42T      0  8.25K      0  1.03G
test        14.9G  5.42T      0  7.77K      0   995M
test        18.9G  5.42T      0  4.86K      0   614M


starting to be more jumpy, but still not as bad as in first case.

So lets create a pool out of all disks again but this time lets
continue to provide disks in "rows" across controllers.

# zpool create -f test c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0 c1t1d0 c2t1d0
c3t1d0 c4t1d0 c5t1d0 c6t1d0 c1t2d0 c2t2d0 c3t2d0 c4t2d0 c5t2d0 c6t2d0 c1t3d0
c2t3d0 c3t3d0 c4t3d0 c5t3d0 c6t3d0 c1t4d0 c2t4d0 c3t4d0 c4t4d0 c5t4d0 c6t4d0
c1t5d0 c2t5d0 c3t5d0 c4t5d0 c5t5d0 c6t5d0 c1t6d0 c2t6d0 c3t6d0 c4t6d0 c5t6d0
c6t6d0 c1t7d0 c2t7d0 c3t7d0 c4t7d0 c5t7d0 c6t7d0
# zfs set atime=off test

test         862M  21.7T      0  5.81K      0   689M
test        1.52G  21.7T      0  5.50K      0   689M
test        2.88G  21.7T      0  10.9K      0  1.35G
test        2.88G  21.7T      0      0      0      0
test        2.88G  21.7T      0  9.49K      0  1.18G
test        5.60G  21.7T      0  11.1K      0  1.38G
test        5.60G  21.7T      0      0      0      0
test        5.60G  21.7T      0      0      0      0
test        5.60G  21.7T      0  15.3K      0  1.90G
test        9.59G  21.7T      0  15.4K      0  1.91G
test        9.59G  21.7T      0      0      0      0
test        9.59G  21.7T      0      0      0      0
test        9.59G  21.7T      0  16.8K      0  2.09G
test        13.6G  21.7T      0  8.60K      0  1.06G
test        13.6G  21.7T      0      0      0      0
test        13.6G  21.7T      0  4.01K      0   512M
test        13.6G  21.7T      0  20.2K      0  2.52G
test        17.6G  21.7T      0  2.86K      0   353M
test        17.6G  21.7T      0      0      0      0
test        17.6G  21.7T      0  11.6K      0  1.45G
test        21.6G  21.7T      0  14.1K      0  1.75G
test        21.6G  21.7T      0      0      0      0
test        21.6G  21.7T      0      0      0      0
test        21.6G  21.7T      0  4.74K      0   602M
test        21.6G  21.7T      0  17.6K      0  2.20G
test        25.6G  21.7T      0  8.00K      0  1008M
test        25.6G  21.7T      0      0      0      0
test        25.6G  21.7T      0      0      0      0
test        25.6G  21.7T      0  16.8K      0  2.09G
test        25.6G  21.7T      0  15.0K      0  1.86G
test        29.6G  21.7T      0     11      0  11.9K



Any idea?



-- 
Best regards,
 Robert Milkowski                           mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Roch Bourbonnais

2008-Jun-28 10:25 UTC

head link

[zfs-discuss] Periodic flush

Le 28 juin 08 ? 05:14, Robert Milkowski a ?crit :
> Hello Mark,
>
> Tuesday, April 15, 2008, 8:32:32 PM, you wrote:
>
> MM> The new write throttle code put back into build 87 attempts to
> MM> smooth out the process.  We now measure the amount of time it  
> takes
> MM> to sync each transaction group, and the amount of data in that  
> group.
> MM> We dynamically resize our write throttle to try to keep the sync
> MM> time constant (at 5secs) under write load.  We also introduce
> MM> "fairness" delays on writers when we near pipeline
capacity: each
> MM> write is delayed 1/100sec when we are about to "fill up". 
This
> MM> prevents a single heavy writer from "starving out"
occasional
> MM> writers.  So instead of coming to an abrupt halt when the pipeline
> MM> fills, we slow down our write pace.  The result should be a  
> constant
> MM> even IO load.
>
> snv_91, 48x 500GB sata drives in one large stripe:
>
> # zpool create -f test c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c1t5d0  
> c1t6d0 c1t7d0 c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0  
> c2t7d0 c3t0d0 c3t1d0 c3t2d0 c3t3d0 c3t4d0 c3t5d0 c3t6d0 c3t7d0  
> c4t0d0 c4t1d0 c4t2d0 c4t3d0 c4t4d0 c4t5d0 c4t6d0 c4t7d0 c5t0d0  
> c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0 c5t6d0 c5t7d0 c6t0d0 c6t1d0  
> c6t2d0 c6t3d0 c6t4d0 c6t5d0 c6t6d0 c6t7d0
> # zfs set atime=off test
>
>
> # dd if=/dev/zero of=/test/q1 bs=1024k
> ^C34374+0 records in
> 34374+0 records out
>
>
> # zpool iostat 1
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> [...]
> test        58.9M  21.7T      0  1.19K      0  80.8M
> test         862M  21.7T      0  6.67K      0   776M
> test        1.52G  21.7T      0  5.50K      0   689M
> test        1.52G  21.7T      0  9.28K      0  1.16G
> test        2.88G  21.7T      0  1.14K      0   135M
> test        2.88G  21.7T      0  1.61K      0   206M
> test        2.88G  21.7T      0  18.0K      0  2.24G
> test        5.60G  21.7T      0     79      0   264K
> test        5.60G  21.7T      0      0      0      0
> test        5.60G  21.7T      0  10.9K      0  1.36G
> test        9.59G  21.7T      0  7.09K      0   897M
> test        9.59G  21.7T      0      0      0      0
> test        9.59G  21.7T      0  6.33K      0   807M
> test        9.59G  21.7T      0  17.9K      0  2.24G
> test        13.6G  21.7T      0  1.96K      0   239M
> test        13.6G  21.7T      0      0      0      0
> test        13.6G  21.7T      0  11.9K      0  1.49G
> test        17.6G  21.7T      0  9.91K      0  1.23G
> test        17.6G  21.7T      0      0      0      0
> test        17.6G  21.7T      0  5.48K      0   700M
> test        17.6G  21.7T      0  20.0K      0  2.50G
> test        21.6G  21.7T      0  2.03K      0   244M
> test        21.6G  21.7T      0      0      0      0
> test        21.6G  21.7T      0      0      0      0
> test        21.6G  21.7T      0  4.03K      0   513M
> test        21.6G  21.7T      0  23.7K      0  2.97G
> test        25.6G  21.7T      0  1.83K      0   225M
> test        25.6G  21.7T      0      0      0      0
> test        25.6G  21.7T      0  13.9K      0  1.74G
> test        29.6G  21.7T      1  1.40K   127K   167M
> test        29.6G  21.7T      0      0      0      0
> test        29.6G  21.7T      0  7.14K      0   912M
> test        29.6G  21.7T      0  19.2K      0  2.40G
> test        33.6G  21.7T      1    378   127K  34.8M
> test        33.6G  21.7T      0      0      0      0
> ^C
>
>
> Well, doesn''t actually look good. Checking with iostat I
don''t see any
> problems like long service times, etc.
>
I suspect,  a single dd is cpu bound.
> Reducing zfs_txg_synctime to 1 helps a little bit but still it''s
not
> even stream of data.
>
> If I start 3 dd streams at the same time then it is slightly better
> (zfs_txg_synctime set back to 5) but still very jumpy.
>
Try zfs_txg_synctime to 10; that reduces the txg overhead.
> Reading with one dd produces steady throghput but I''m disapointed
with
> actual performance:
>
Again, probably cpu bound. What''s "ptime dd..." saying ?
> test         161G  21.6T  9.94K      0  1.24G      0
> test         161G  21.6T  10.0K      0  1.25G      0
> test         161G  21.6T  10.3K      0  1.29G      0
> test         161G  21.6T  10.1K      0  1.27G      0
> test         161G  21.6T  10.4K      0  1.31G      0
> test         161G  21.6T  10.1K      0  1.27G      0
> test         161G  21.6T  10.4K      0  1.30G      0
> test         161G  21.6T  10.2K      0  1.27G      0
> test         161G  21.6T  10.3K      0  1.29G      0
> test         161G  21.6T  10.0K      0  1.25G      0
> test         161G  21.6T  9.96K      0  1.24G      0
> test         161G  21.6T  10.6K      0  1.33G      0
> test         161G  21.6T  10.1K      0  1.26G      0
> test         161G  21.6T  10.2K      0  1.27G      0
> test         161G  21.6T  10.4K      0  1.30G      0
> test         161G  21.6T  9.62K      0  1.20G      0
> test         161G  21.6T  8.22K      0  1.03G      0
> test         161G  21.6T  9.61K      0  1.20G      0
> test         161G  21.6T  10.2K      0  1.28G      0
> test         161G  21.6T  9.12K      0  1.14G      0
> test         161G  21.6T  9.96K      0  1.25G      0
> test         161G  21.6T  9.72K      0  1.22G      0
> test         161G  21.6T  10.6K      0  1.32G      0
> test         161G  21.6T  9.93K      0  1.24G      0
> test         161G  21.6T  9.94K      0  1.24G      0
>
>
> zpool scrub produces:
>
> test         161G  21.6T     25     69  2.70M   392K
> test         161G  21.6T  10.9K      0  1.35G      0
> test         161G  21.6T  13.4K      0  1.66G      0
> test         161G  21.6T  13.2K      0  1.63G      0
> test         161G  21.6T  11.8K      0  1.46G      0
> test         161G  21.6T  13.8K      0  1.72G      0
> test         161G  21.6T  12.4K      0  1.53G      0
> test         161G  21.6T  12.9K      0  1.59G      0
> test         161G  21.6T  12.9K      0  1.59G      0
> test         161G  21.6T  13.4K      0  1.67G      0
> test         161G  21.6T  12.2K      0  1.51G      0
> test         161G  21.6T  12.9K      0  1.59G      0
> test         161G  21.6T  12.5K      0  1.55G      0
> test         161G  21.6T  13.3K      0  1.64G      0
>
>
>
>
> So sequential reading gives steady thruput but numbers are a little
> bit lower than expected.
>
> Sequential writing is still jumpy with single or multiple dd streams
> for pool with many disk drives.
>
> Lets destroy the pool and create a new one, smaller one.
>
>
>
> # zpool create -f test c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0
> # zfs set atime=off test
>
> # dd if=/dev/zero of=/test/q1 bs=1024k
> ^C15905+0 records in
> 15905+0 records out
>
>
> # zpool iostat 1
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> [...]
> test         688M  2.72T      0  3.29K      0   401M
> test        1.01G  2.72T      0  3.69K      0   462M
> test        1.35G  2.72T      0  3.59K      0   450M
> test        1.35G  2.72T      0  2.95K      0   372M
> test        2.03G  2.72T      0  3.37K      0   428M
> test        2.03G  2.72T      0  1.94K      0   248M
> test        2.71G  2.72T      0  2.44K      0   301M
> test        2.71G  2.72T      0  3.88K      0   497M
> test        2.71G  2.72T      0  3.86K      0   494M
> test        4.07G  2.71T      0  3.42K      0   425M
> test        4.07G  2.71T      0  3.89K      0   498M
> test        4.07G  2.71T      0  3.88K      0   497M
> test        5.43G  2.71T      0  3.44K      0   429M
> test        5.43G  2.71T      0  3.94K      0   504M
> test        5.43G  2.71T      0  3.88K      0   497M
> test        5.43G  2.71T      0  3.88K      0   497M
> test        7.62G  2.71T      0  2.34K      0   286M
> test        7.62G  2.71T      0  4.23K      0   539M
> test        7.62G  2.71T      0  3.89K      0   498M
> test        7.62G  2.71T      0  3.87K      0   495M
> test        7.62G  2.71T      0  3.88K      0   497M
> test        9.81G  2.71T      0  3.33K      0   418M
> test        9.81G  2.71T      0  4.12K      0   526M
> test        9.81G  2.71T      0  3.88K      0   497M
>
>
> Much more steady - interesting.
>
Now it''s disk bound.
>
> Let''s do it again with yet bigger pool and lets keep distributing
> disks in "rows" across controllers.
>
> # zpool create -f test c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0  
> c1t1d0 c2t1d0 c3t1d0 c4t1d0 c5t1d0 c6t1d0
> # zfs set atime=off test
>
> test        1.35G  5.44T      0  5.42K      0   671M
> test        2.03G  5.44T      0  7.01K      0   883M
> test        2.71G  5.43T      0  6.22K      0   786M
> test        2.71G  5.43T      0  8.09K      0  1.01G
> test        4.07G  5.43T      0  7.14K      0   902M
> test        5.43G  5.43T      0  4.02K      0   507M
> test        5.43G  5.43T      0  5.52K      0   700M
> test        5.43G  5.43T      0  8.04K      0  1.00G
> test        5.43G  5.43T      0  7.70K      0   986M
> test        8.15G  5.43T      0  6.13K      0   769M
> test        8.15G  5.43T      0  7.77K      0   995M
> test        8.15G  5.43T      0  7.67K      0   981M
> test        10.9G  5.43T      0  4.15K      0   517M
> test        10.9G  5.43T      0  7.74K      0   986M
> test        10.9G  5.43T      0  7.76K      0   994M
> test        10.9G  5.43T      0  7.75K      0   993M
> test        14.9G  5.42T      0  6.79K      0   860M
> test        14.9G  5.42T      0  7.50K      0   958M
> test        14.9G  5.42T      0  8.25K      0  1.03G
> test        14.9G  5.42T      0  7.77K      0   995M
> test        18.9G  5.42T      0  4.86K      0   614M
>
>
> starting to be more jumpy, but still not as bad as in first case.
>
> So lets create a pool out of all disks again but this time lets
> continue to provide disks in "rows" across controllers.
>
> # zpool create -f test c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0  
> c1t1d0 c2t1d0 c3t1d0 c4t1d0 c5t1d0 c6t1d0 c1t2d0 c2t2d0 c3t2d0  
> c4t2d0 c5t2d0 c6t2d0 c1t3d0 c2t3d0 c3t3d0 c4t3d0 c5t3d0 c6t3d0  
> c1t4d0 c2t4d0 c3t4d0 c4t4d0 c5t4d0 c6t4d0 c1t5d0 c2t5d0 c3t5d0  
> c4t5d0 c5t5d0 c6t5d0 c1t6d0 c2t6d0 c3t6d0 c4t6d0 c5t6d0 c6t6d0  
> c1t7d0 c2t7d0 c3t7d0 c4t7d0 c5t7d0 c6t7d0
> # zfs set atime=off test
>
> test         862M  21.7T      0  5.81K      0   689M
> test        1.52G  21.7T      0  5.50K      0   689M
> test        2.88G  21.7T      0  10.9K      0  1.35G
> test        2.88G  21.7T      0      0      0      0
> test        2.88G  21.7T      0  9.49K      0  1.18G
> test        5.60G  21.7T      0  11.1K      0  1.38G
> test        5.60G  21.7T      0      0      0      0
> test        5.60G  21.7T      0      0      0      0
> test        5.60G  21.7T      0  15.3K      0  1.90G
> test        9.59G  21.7T      0  15.4K      0  1.91G
> test        9.59G  21.7T      0      0      0      0
> test        9.59G  21.7T      0      0      0      0
> test        9.59G  21.7T      0  16.8K      0  2.09G
> test        13.6G  21.7T      0  8.60K      0  1.06G
> test        13.6G  21.7T      0      0      0      0
> test        13.6G  21.7T      0  4.01K      0   512M
> test        13.6G  21.7T      0  20.2K      0  2.52G
> test        17.6G  21.7T      0  2.86K      0   353M
> test        17.6G  21.7T      0      0      0      0
> test        17.6G  21.7T      0  11.6K      0  1.45G
> test        21.6G  21.7T      0  14.1K      0  1.75G
> test        21.6G  21.7T      0      0      0      0
> test        21.6G  21.7T      0      0      0      0
> test        21.6G  21.7T      0  4.74K      0   602M
> test        21.6G  21.7T      0  17.6K      0  2.20G
> test        25.6G  21.7T      0  8.00K      0  1008M
> test        25.6G  21.7T      0      0      0      0
> test        25.6G  21.7T      0      0      0      0
> test        25.6G  21.7T      0  16.8K      0  2.09G
> test        25.6G  21.7T      0  15.0K      0  1.86G
> test        29.6G  21.7T      0     11      0  11.9K
>
>
>
> Any idea?
>
>
>
> -- 
> Best regards,
> Robert Milkowski                           mailto:milek at task.gda.pl
>                                       http://milek.blogspot.com
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2008-Jun-30 23:01 UTC

head link

[zfs-discuss] Periodic flush

Hello Roch,

Saturday, June 28, 2008, 11:25:17 AM, you wrote:


RB> I suspect,  a single dd is cpu bound.

I don''t think so.

Se below one with a stripe of 48x disks again. Single dd with 1024k
block size and 64GB to write.

bash-3.2# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test         333K  21.7T      1      1   147K   147K
test         333K  21.7T      0      0      0      0
test         333K  21.7T      0      0      0      0
test         333K  21.7T      0      0      0      0
test         333K  21.7T      0      0      0      0
test         333K  21.7T      0      0      0      0
test         333K  21.7T      0      0      0      0
test         333K  21.7T      0      0      0      0
test         333K  21.7T      0  1.60K      0   204M
test         333K  21.7T      0  20.5K      0  2.55G
test        4.00G  21.7T      0  9.19K      0  1.13G
test        4.00G  21.7T      0      0      0      0
test        4.00G  21.7T      0  1.78K      0   228M
test        4.00G  21.7T      0  12.5K      0  1.55G
test        7.99G  21.7T      0  16.2K      0  2.01G
test        7.99G  21.7T      0      0      0      0
test        7.99G  21.7T      0  13.4K      0  1.68G
test        12.0G  21.7T      0  4.31K      0   530M
test        12.0G  21.7T      0      0      0      0
test        12.0G  21.7T      0  6.91K      0   882M
test        12.0G  21.7T      0  21.8K      0  2.72G
test        16.0G  21.7T      0    839      0  88.4M
test        16.0G  21.7T      0      0      0      0
test        16.0G  21.7T      0  4.42K      0   565M
test        16.0G  21.7T      0  18.5K      0  2.31G
test        20.0G  21.7T      0  8.87K      0  1.10G
test        20.0G  21.7T      0      0      0      0
test        20.0G  21.7T      0  12.2K      0  1.52G
test        24.0G  21.7T      0  9.28K      0  1.14G
test        24.0G  21.7T      0      0      0      0
test        24.0G  21.7T      0      0      0      0
test        24.0G  21.7T      0      0      0      0
test        24.0G  21.7T      0  14.5K      0  1.81G
test        28.0G  21.7T      0  10.1K  63.6K  1.25G
test        28.0G  21.7T      0      0      0      0
test        28.0G  21.7T      0  10.7K      0  1.34G
test        32.0G  21.7T      0  13.6K  63.2K  1.69G
test        32.0G  21.7T      0      0      0      0
test        32.0G  21.7T      0      0      0      0
test        32.0G  21.7T      0  11.1K      0  1.39G
test        36.0G  21.7T      0  19.9K      0  2.48G
test        36.0G  21.7T      0      0      0      0
test        36.0G  21.7T      0      0      0      0
test        36.0G  21.7T      0  17.7K      0  2.21G
test        40.0G  21.7T      0  5.42K  63.1K   680M
test        40.0G  21.7T      0      0      0      0
test        40.0G  21.7T      0  6.62K      0   844M
test        44.0G  21.7T      1  19.8K   125K  2.46G
test        44.0G  21.7T      0      0      0      0
test        44.0G  21.7T      0      0      0      0
test        44.0G  21.7T      0  18.0K      0  2.24G
test        47.9G  21.7T      1  13.2K   127K  1.63G
test        47.9G  21.7T      0      0      0      0
test        47.9G  21.7T      0      0      0      0
test        47.9G  21.7T      0  15.6K      0  1.94G
test        47.9G  21.7T      1  16.1K   126K  1.99G
test        51.9G  21.7T      0      0      0      0
test        51.9G  21.7T      0      0      0      0
test        51.9G  21.7T      0  14.2K      0  1.77G
test        55.9G  21.7T      0  14.0K  63.2K  1.73G
test        55.9G  21.7T      0      0      0      0
test        55.9G  21.7T      0      0      0      0
test        55.9G  21.7T      0  16.3K      0  2.04G
test        59.9G  21.7T      0  14.5K  63.2K  1.80G
test        59.9G  21.7T      0      0      0      0
test        59.9G  21.7T      0      0      0      0
test        59.9G  21.7T      0  17.7K      0  2.21G
test        63.9G  21.7T      0  4.84K  62.6K   603M
test        63.9G  21.7T      0      0      0      0
test        63.9G  21.7T      0      0      0      0
test        63.9G  21.7T      0      0      0      0
test        63.9G  21.7T      0      0      0      0
test        63.9G  21.7T      0      0      0      0
test        63.9G  21.7T      0      0      0      0
test        63.9G  21.7T      0      0      0      0
^C
bash-3.2#

bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536
65536+0 records in
65536+0 records out

real     1:06.312
user        0.074
sys        54.060
bash-3.2#

Doesn''t look like it''s CPU bound.



Let''s try to read the file after zpool export test; zpool import test

bash-3.2# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        64.0G  21.7T     15     46  1.22M  1.76M
test        64.0G  21.7T      0      0      0      0
test        64.0G  21.7T      0      0      0      0
test        64.0G  21.7T  6.64K      0   849M      0
test        64.0G  21.7T  10.2K      0  1.27G      0
test        64.0G  21.7T  10.7K      0  1.33G      0
test        64.0G  21.7T  9.91K      0  1.24G      0
test        64.0G  21.7T  10.1K      0  1.27G      0
test        64.0G  21.7T  10.7K      0  1.33G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.6K      0  1.33G      0
test        64.0G  21.7T  10.6K      0  1.33G      0
test        64.0G  21.7T  10.2K      0  1.27G      0
test        64.0G  21.7T  10.3K      0  1.29G      0
test        64.0G  21.7T  10.5K      0  1.31G      0
test        64.0G  21.7T  9.16K      0  1.14G      0
test        64.0G  21.7T  1.98K      0   253M      0
test        64.0G  21.7T  2.48K      0   317M      0
test        64.0G  21.7T  1.98K      0   253M      0
test        64.0G  21.7T  1.98K      0   254M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  1.73K      0   221M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  1.49K      0   191M      0
test        64.0G  21.7T  2.47K      0   317M      0
test        64.0G  21.7T  1.46K      0   186M      0
test        64.0G  21.7T  2.01K      0   258M      0
test        64.0G  21.7T  1.98K      0   254M      0
test        64.0G  21.7T  1.97K      0   253M      0
test        64.0G  21.7T  2.23K      0   286M      0
test        64.0G  21.7T  1.98K      0   254M      0
test        64.0G  21.7T  1.73K      0   221M      0
test        64.0G  21.7T  1.98K      0   254M      0
test        64.0G  21.7T  2.42K      0   310M      0
test        64.0G  21.7T  1.78K      0   228M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  1.67K      0   214M      0
test        64.0G  21.7T  1.80K      0   230M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  2.47K      0   317M      0
test        64.0G  21.7T  1.73K      0   221M      0
test        64.0G  21.7T  1.99K      0   254M      0
test        64.0G  21.7T  1.24K      0   159M      0
test        64.0G  21.7T  2.47K      0   316M      0
test        64.0G  21.7T  2.47K      0   317M      0
test        64.0G  21.7T  1.99K      0   254M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  1.73K      0   221M      0
test        64.0G  21.7T  2.48K      0   317M      0
test        64.0G  21.7T  2.48K      0   317M      0
test        64.0G  21.7T  1.49K      0   190M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  1.81K      0   232M      0
test        64.0G  21.7T  1.90K      0   243M      0
test        64.0G  21.7T  2.48K      0   317M      0
test        64.0G  21.7T  1.49K      0   191M      0
test        64.0G  21.7T  2.47K      0   317M      0
test        64.0G  21.7T  1.99K      0   254M      0
test        64.0G  21.7T  1.97K      0   253M      0
test        64.0G  21.7T  1.49K      0   190M      0
test        64.0G  21.7T  2.23K      0   286M      0
test        64.0G  21.7T  1.82K      0   232M      0
test        64.0G  21.7T  2.15K      0   275M      0
test        64.0G  21.7T  2.22K      0   285M      0
test        64.0G  21.7T  1.73K      0   222M      0
test        64.0G  21.7T  2.23K      0   286M      0
test        64.0G  21.7T  1.90K      0   244M      0
test        64.0G  21.7T  1.81K      0   231M      0
test        64.0G  21.7T  2.23K      0   285M      0
test        64.0G  21.7T  1.97K      0   252M      0
test        64.0G  21.7T  2.00K      0   255M      0
test        64.0G  21.7T  8.42K      0  1.05G      0
test        64.0G  21.7T  10.3K      0  1.29G      0
test        64.0G  21.7T  10.2K      0  1.28G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.2K      0  1.27G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.6K      0  1.32G      0
test        64.0G  21.7T  10.5K      0  1.31G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  9.23K      0  1.15G      0
test        64.0G  21.7T  10.5K      0  1.31G      0
test        64.0G  21.7T  10.0K      0  1.25G      0
test        64.0G  21.7T  9.55K      0  1.19G      0
test        64.0G  21.7T  10.2K      0  1.27G      0
test        64.0G  21.7T  10.0K      0  1.25G      0
test        64.0G  21.7T  9.91K      0  1.24G      0
test        64.0G  21.7T  10.6K      0  1.32G      0
test        64.0G  21.7T  9.24K      0  1.15G      0
test        64.0G  21.7T  10.1K      0  1.26G      0
test        64.0G  21.7T  10.3K      0  1.29G      0
test        64.0G  21.7T  10.3K      0  1.29G      0
test        64.0G  21.7T  10.6K      0  1.33G      0
test        64.0G  21.7T  10.6K      0  1.33G      0
test        64.0G  21.7T  8.54K      0  1.07G      0
test        64.0G  21.7T      0      0      0      0
test        64.0G  21.7T      0      0      0      0
test        64.0G  21.7T      0      0      0      0
^C

bash-3.2# ptime dd if=/test/q1 of=/dev/null bs=1024k
65536+0 records in
65536+0 records out

real     1:36.732
user        0.046
sys        48.069
bash-3.2#


Well, that drop for several dozen seconds was interesting...
Lets run it again without export/import:

bash-3.2# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test        64.0G  21.7T  3.00K      6   384M   271K
test        64.0G  21.7T      0      0      0      0
test        64.0G  21.7T  2.58K      0   330M      0
test        64.0G  21.7T  6.02K      0   771M      0
test        64.0G  21.7T  8.37K      0  1.05G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  9.64K      0  1.20G      0
test        64.0G  21.7T  10.5K      0  1.31G      0
test        64.0G  21.7T  10.6K      0  1.32G      0
test        64.0G  21.7T  10.6K      0  1.33G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  9.65K      0  1.21G      0
test        64.0G  21.7T  9.84K      0  1.23G      0
test        64.0G  21.7T  9.22K      0  1.15G      0
test        64.0G  21.7T  10.9K      0  1.36G      0
test        64.0G  21.7T  10.9K      0  1.36G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.7K      0  1.34G      0
test        64.0G  21.7T  10.6K      0  1.33G      0
test        64.0G  21.7T  10.9K      0  1.36G      0
test        64.0G  21.7T  10.6K      0  1.32G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.7K      0  1.34G      0
test        64.0G  21.7T  10.5K      0  1.32G      0
test        64.0G  21.7T  10.6K      0  1.32G      0
test        64.0G  21.7T  10.8K      0  1.34G      0
test        64.0G  21.7T  10.4K      0  1.29G      0
test        64.0G  21.7T  10.5K      0  1.31G      0
test        64.0G  21.7T  9.15K      0  1.14G      0
test        64.0G  21.7T  10.8K      0  1.35G      0
test        64.0G  21.7T  9.76K      0  1.22G      0
test        64.0G  21.7T  8.67K      0  1.08G      0
test        64.0G  21.7T  10.8K      0  1.36G      0
test        64.0G  21.7T  10.9K      0  1.36G      0
test        64.0G  21.7T  10.3K      0  1.28G      0
test        64.0G  21.7T  9.76K      0  1.22G      0
test        64.0G  21.7T  10.5K      0  1.31G      0
test        64.0G  21.7T  10.6K      0  1.33G      0
test        64.0G  21.7T  9.23K      0  1.15G      0
test        64.0G  21.7T  9.63K      0  1.20G      0
test        64.0G  21.7T  9.79K      0  1.22G      0
test        64.0G  21.7T  10.2K      0  1.28G      0
test        64.0G  21.7T  10.4K      0  1.30G      0
test        64.0G  21.7T  10.3K      0  1.29G      0
test        64.0G  21.7T  10.2K      0  1.28G      0
test        64.0G  21.7T  10.6K      0  1.33G      0
test        64.0G  21.7T  10.8K      0  1.35G      0
test        64.0G  21.7T  10.5K      0  1.32G      0
test        64.0G  21.7T  11.0K      0  1.37G      0
test        64.0G  21.7T  10.2K      0  1.27G      0
test        64.0G  21.7T  9.69K      0  1.21G      0
test        64.0G  21.7T  6.07K      0   777M      0
test        64.0G  21.7T      0      0      0      0
test        64.0G  21.7T      0      0      0      0
test        64.0G  21.7T      0      0      0      0
^C
bash-3.2#

bash-3.2# ptime dd if=/test/q1 of=/dev/null bs=1024k
65536+0 records in
65536+0 records out

real       50.521
user        0.043
sys        48.971
bash-3.2#

Now looks like reading from the pool using single dd is actually CPU
bound.

Reading the same file again and again does produce, more or less,
consistent timing. However every time I export/import the pool during
the first read there is that drop in throughput during first read and
total time increases to almost 100 seconds.... some meta-data? (of
course there are no errors oof any sort, etc.)





>> Reducing zfs_txg_synctime to 1 helps a little bit but still
it''s not
>> even stream of data.
>>
>> If I start 3 dd streams at the same time then it is slightly better
>> (zfs_txg_synctime set back to 5) but still very jumpy.
>>
RB> Try zfs_txg_synctime to 10; that reduces the txg overhead.


Doesn''t help...

[...]
test        13.6G  21.7T      0      0      0      0
test        13.6G  21.7T      0  8.46K      0  1.05G
test        17.6G  21.7T      0  19.3K      0  2.40G
test        17.6G  21.7T      0      0      0      0
test        17.6G  21.7T      0      0      0      0
test        17.6G  21.7T      0  8.04K      0  1022M
test        17.6G  21.7T      0  20.2K      0  2.51G
test        21.6G  21.7T      0     76      0   249K
test        21.6G  21.7T      0      0      0      0
test        21.6G  21.7T      0      0      0      0
test        21.6G  21.7T      0  10.1K      0  1.25G
test        25.6G  21.7T      0  18.6K      0  2.31G
test        25.6G  21.7T      0      0      0      0
test        25.6G  21.7T      0      0      0      0
test        25.6G  21.7T      0  6.34K      0   810M
test        25.6G  21.7T      0  19.9K      0  2.48G
test        29.6G  21.7T      0     88  63.2K   354K
[...]

bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536
65536+0 records in
65536+0 records out

real     1:10.074
user        0.074
sys        52.250
bash-3.2#


Increasing it even further (up-to 32s) doesn''t help either.

However lowering it to 1s gives:

[...]
test        2.43G  21.7T      0  8.62K      0  1.07G
test        4.46G  21.7T      0  7.23K      0   912M
test        4.46G  21.7T      0    624      0  77.9M
test        6.66G  21.7T      0  10.7K      0  1.33G
test        6.66G  21.7T      0  6.66K      0   850M
test        8.86G  21.7T      0  10.6K      0  1.31G
test        8.86G  21.7T      0  1.96K      0   251M
test        11.2G  21.7T      0  16.5K      0  2.04G
test        11.2G  21.7T      0      0      0      0
test        11.2G  21.7T      0  18.6K      0  2.31G
test        13.5G  21.7T      0     11      0  11.9K
test        13.5G  21.7T      0  2.60K      0   332M
test        13.5G  21.7T      0  19.1K      0  2.37G
test        16.3G  21.7T      0     11      0  11.9K
test        16.3G  21.7T      0  9.61K      0  1.20G
test        18.4G  21.7T      0  7.41K      0   936M
test        18.4G  21.7T      0  11.6K      0  1.45G
test        20.3G  21.7T      0  3.26K      0   407M
test        20.3G  21.7T      0  7.66K      0   977M
test        22.5G  21.7T      0  7.62K      0   963M
test        22.5G  21.7T      0  6.86K      0   875M
test        24.5G  21.7T      0  8.41K      0  1.04G
test        24.5G  21.7T      0  10.4K      0  1.30G
test        26.5G  21.7T      1  2.19K   127K   270M
test        26.5G  21.7T      0      0      0      0
test        26.5G  21.7T      0  4.56K      0   584M
test        28.5G  21.7T      0  11.5K      0  1.42G
[...]

bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536
65536+0 records in
65536+0 records out

real     1:09.541
user        0.072
sys        53.421
bash-3.2#



Looks slightly less jumpy but the total real time is about the same so
average throughput is actually the same (about 1GB/s).



>> Reading with one dd produces steady throghput but I''m
disapointed with
>> actual performance:
>>
RB> Again, probably cpu bound. What''s "ptime dd..." saying
?

You were right here. Reading with single dd seems to be cpu bound.
However multiple streams for reading do not seem to increase
performance considerably.

Nevertheless the main issu is jumpy writing...




-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2008-Jul-01 00:31 UTC

head link

[zfs-discuss] Periodic flush

Hello Robert,

Tuesday, July 1, 2008, 12:01:03 AM, you wrote:

RM> Nevertheless the main issu is jumpy writing...


I was just wondering how much thruoughput I can get running multiple
dd - one per disk drive and what kind of aggregated throughput I would
get.

So for each out of 48 disks I did:

dd if=/dev/zero of=/dev/rdsk/c6t7d0s0 bs=128k&

The iostat looks like:

bash-3.2# iostat -xnzC 1|egrep " c[0-6]$|devic"
[skipped the first output]
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 5308.0    0.0 679418.9  0.1  7.2    0.0    1.4   0 718 c1
    0.0 5264.2    0.0 673813.1  0.1  7.2    0.0    1.4   0 720 c2
    0.0 4047.6    0.0 518095.1  0.1  7.3    0.0    1.8   0 725 c3
    0.0 5340.1    0.0 683532.5  0.1  7.2    0.0    1.3   0 718 c4
    0.0 5325.1    0.0 681608.0  0.1  7.1    0.0    1.3   0 714 c5
    0.0 4089.3    0.0 523434.0  0.1  7.3    0.0    1.8   0 727 c6
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 5283.1    0.0 676231.2  0.1  7.2    0.0    1.4   0 723 c1
    0.0 5215.2    0.0 667549.5  0.1  7.2    0.0    1.4   0 720 c2
    0.0 4009.0    0.0 513152.8  0.1  7.3    0.0    1.8   0 725 c3
    0.0 5281.9    0.0 676082.5  0.1  7.2    0.0    1.4   0 722 c4
    0.0 5316.6    0.0 680520.9  0.1  7.2    0.0    1.4   0 720 c5
    0.0 4159.5    0.0 532420.9  0.1  7.3    0.0    1.7   0 726 c6
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 5322.0    0.0 681213.6  0.1  7.2    0.0    1.4   0 720 c1
    0.0 5292.9    0.0 677494.0  0.1  7.2    0.0    1.4   0 722 c2
    0.0 4051.4    0.0 518573.3  0.1  7.3    0.0    1.8   0 727 c3
    0.0 5315.0    0.0 680318.8  0.1  7.2    0.0    1.4   0 721 c4
    0.0 5313.1    0.0 680074.3  0.1  7.2    0.0    1.4   0 723 c5
    0.0 4184.8    0.0 535648.7  0.1  7.3    0.0    1.7   0 730 c6
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 5296.4    0.0 677940.2  0.1  7.1    0.0    1.3   0 714 c1
    0.0 5236.4    0.0 670265.3  0.1  7.2    0.0    1.4   0 720 c2
    0.0 4023.5    0.0 515011.5  0.1  7.3    0.0    1.8   0 728 c3
    0.0 5291.4    0.0 677300.7  0.1  7.2    0.0    1.4   0 723 c4
    0.0 5297.4    0.0 678072.8  0.1  7.2    0.0    1.4   0 720 c5
    0.0 4095.6    0.0 524236.0  0.1  7.3    0.0    1.8   0 726 c6
^C


one full output:
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 5302.0    0.0 678658.6  0.1  7.2    0.0    1.4   0 722 c1
    0.0  664.0    0.0 84992.8  0.0  0.9    0.0    1.4   1  90 c1t0d0
    0.0  657.0    0.0 84090.5  0.0  0.9    0.0    1.3   1  89 c1t1d0
    0.0  666.0    0.0 85251.4  0.0  0.9    0.0    1.3   1  89 c1t2d0
    0.0  662.0    0.0 84735.6  0.0  0.9    0.0    1.4   1  91 c1t3d0
    0.0  669.1    0.0 85638.4  0.0  0.9    0.0    1.4   1  92 c1t4d0
    0.0  665.0    0.0 85122.9  0.0  0.9    0.0    1.4   1  91 c1t5d0
    0.0  652.9    0.0 83575.1  0.0  0.9    0.0    1.4   1  90 c1t6d0
    0.0  666.0    0.0 85251.8  0.0  0.9    0.0    1.4   1  91 c1t7d0
    0.0 5293.3    0.0 677537.5  0.1  7.3    0.0    1.4   0 725 c2
    0.0  660.0    0.0 84481.2  0.0  0.9    0.0    1.4   1  91 c2t0d0
    0.0  661.0    0.0 84610.3  0.0  0.9    0.0    1.4   1  90 c2t1d0
    0.0  664.0    0.0 84997.4  0.0  0.9    0.0    1.4   1  90 c2t2d0
    0.0  662.0    0.0 84739.4  0.0  0.9    0.0    1.4   1  92 c2t3d0
    0.0  655.0    0.0 83836.6  0.0  0.9    0.0    1.4   1  89 c2t4d0
    0.0  663.1    0.0 84871.3  0.0  0.9    0.0    1.4   1  90 c2t5d0
    0.0  663.1    0.0 84871.5  0.0  0.9    0.0    1.4   1  92 c2t6d0
    0.0  665.1    0.0 85129.7  0.0  0.9    0.0    1.4   1  92 c2t7d0
    0.0 4072.1    0.0 521228.9  0.1  7.3    0.0    1.8   0 728 c3
    0.0  506.9    0.0 64879.3  0.0  0.9    0.0    1.8   1  90 c3t0d0
    0.0  513.9    0.0 65782.4  0.0  0.9    0.0    1.8   1  92 c3t1d0
    0.0  511.9    0.0 65524.4  0.0  0.9    0.0    1.8   1  91 c3t2d0
    0.0  505.9    0.0 64750.5  0.0  0.9    0.0    1.8   1  91 c3t3d0
    0.0  502.8    0.0 64363.6  0.0  0.9    0.0    1.8   1  90 c3t4d0
    0.0  506.9    0.0 64879.6  0.0  0.9    0.0    1.8   1  91 c3t5d0
    0.0  513.9    0.0 65782.6  0.0  0.9    0.0    1.8   1  92 c3t6d0
    0.0  509.9    0.0 65266.6  0.0  0.9    0.0    1.8   1  91 c3t7d0
    0.0 5298.7    0.0 678232.6  0.1  7.3    0.0    1.4   0 725 c4
    0.0  664.1    0.0 85001.4  0.0  0.9    0.0    1.4   1  92 c4t0d0
    0.0  662.1    0.0 84743.4  0.0  0.9    0.0    1.4   1  90 c4t1d0
    0.0  663.1    0.0 84872.4  0.0  0.9    0.0    1.4   1  92 c4t2d0
    0.0  664.1    0.0 85001.4  0.0  0.9    0.0    1.3   1  88 c4t3d0
    0.0  657.1    0.0 84105.4  0.0  0.9    0.0    1.4   1  91 c4t4d0
    0.0  658.1    0.0 84234.5  0.0  0.9    0.0    1.4   1  91 c4t5d0
    0.0  669.2    0.0 85653.4  0.0  0.9    0.0    1.3   1  90 c4t6d0
    0.0  661.1    0.0 84620.5  0.0  0.9    0.0    1.4   1  91 c4t7d0
    0.0 5314.1    0.0 680209.2  0.1  7.2    0.0    1.3   0 717 c5
    0.0  666.1    0.0 85265.7  0.0  0.9    0.0    1.3   1  89 c5t0d0
    0.0  662.1    0.0 84749.8  0.0  0.9    0.0    1.3   1  88 c5t1d0
    0.0  660.1    0.0 84491.8  0.0  0.9    0.0    1.3   1  89 c5t2d0
    0.0  665.2    0.0 85140.3  0.0  0.9    0.0    1.3   1  89 c5t3d0
    0.0  668.2    0.0 85527.3  0.0  0.9    0.0    1.4   1  92 c5t4d0
    0.0  666.2    0.0 85269.5  0.0  0.9    0.0    1.3   1  89 c5t5d0
    0.0  664.2    0.0 85011.4  0.0  0.9    0.0    1.4   1  91 c5t6d0
    0.0  662.1    0.0 84753.5  0.0  0.9    0.0    1.4   1  90 c5t7d0
    0.0 4229.8    0.0 541418.9  0.1  7.3    0.0    1.7   0 726 c6
    0.0  518.0    0.0 66306.4  0.0  0.9    0.0    1.7   1  89 c6t0d0
    0.0  533.1    0.0 68241.7  0.0  0.9    0.0    1.7   1  91 c6t1d0
    0.0  531.1    0.0 67983.6  0.0  0.9    0.0    1.7   1  91 c6t2d0
    0.0  524.1    0.0 67080.6  0.0  0.9    0.0    1.7   1  90 c6t3d0
    0.0  540.2    0.0 69144.7  0.0  0.9    0.0    1.7   1  92 c6t4d0
    0.0  525.1    0.0 67209.8  0.0  0.9    0.0    1.7   1  90 c6t5d0
    0.0  535.2    0.0 68500.0  0.0  0.9    0.0    1.7   1  92 c6t6d0
    0.0  523.1    0.0 66952.1  0.0  0.9    0.0    1.7   1  90 c6t7d0


bash-3.2# bc
scale=4
678658.6+677537.5+521228.9+678232.6+680209.2+541418.9
3777285.7
3777285.7/(1024*1024)
3.6023
bash-3.2#

So it''s about 3.6GB/s - pretty good :)

Average throughput with one large stripe pool using zfs is less than
half of that above performance... :( And yes, even with multiple dd to
the same pool.


Additionally turning checksums off helps:

bash-3.2# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test         366G  21.4T      0  10.7K  43.2K  1.33G
test         370G  21.4T      0  14.7K  63.4K  1.82G
test         370G  21.4T      0  22.0K      0  2.69G
test         374G  21.4T      0  12.4K      0  1.54G
test         374G  21.4T      0  23.6K      0  2.91G
test         378G  21.4T      0  12.5K      0  1.53G
test         378G  21.4T      0  17.3K      0  2.13G
test         382G  21.4T      1  16.6K   126K  2.05G
test         382G  21.4T      2  17.7K   190K  2.19G
test         386G  21.4T      0  20.4K      0  2.51G
test         390G  21.4T     11  11.6K   762K  1.44G
test         390G  21.4T      0  28.9K      0  3.55G
test         394G  21.4T      2  12.5K   157K  1.51G
test         398G  21.4T      1  20.0K   127K  2.49G
test         398G  21.4T      0  16.3K      0  2.00G
test         402G  21.4T      4  15.3K   311K  1.90G
test         402G  21.4T      0  21.9K      0  2.70G
test         406G  21.4T      4  9.73K   314K  1.19G
test         406G  21.4T      0  22.7K      0  2.78G
test         410G  21.4T      2  14.4K   131K  1.78G
test         414G  21.3T      0  19.9K  61.4K  2.43G
test         414G  21.3T      0  19.1K      0  2.35G
^C
bash-3.2# zfs set checksum=on test
bash-3.2# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test         439G  21.3T      0  11.4K  50.2K  1.41G
test         439G  21.3T      0  5.52K      0   702M
test         439G  21.3T      0  24.6K      0  3.07G
test         443G  21.3T      0  13.7K      0  1.70G
test         447G  21.3T      1  13.1K   123K  1.62G
test         447G  21.3T      0  16.1K      0  2.00G
test         451G  21.3T      1  3.97K   116K   498M
test         451G  21.3T      0  17.5K      0  2.19G
test         455G  21.3T      1  12.4K  66.9K  1.54G
test         455G  21.3T      0  13.0K      0  1.60G
test         459G  21.3T      0     11      0  11.9K
test         459G  21.3T      0  16.8K      0  2.09G
test         463G  21.3T      0  9.34K      0  1.16G
test         467G  21.3T      0  15.4K      0  1.91G
test         467G  21.3T      0  16.3K      0  2.03G
test         471G  21.3T      0  9.67K      0  1.20G
test         475G  21.3T      0  17.3K      0  2.13G
test         475G  21.3T      0  3.71K      0   472M
test         475G  21.3T      0  21.9K      0  2.73G
test         479G  21.3T      0  17.4K      0  2.16G
test         483G  21.3T      0    848      0  96.4M
test         483G  21.3T      0  17.4K      0  2.17G
^C
bash-3.2#

bash-3.2# zpool iostat 5
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test         582G  21.2T      0  11.8K  44.4K  1.46G
test         590G  21.2T      1  13.8K  76.5K  1.72G
test         598G  21.2T      1  12.4K   102K  1.54G
test         610G  21.2T      1  14.0K  76.7K  1.73G
test         618G  21.1T      0  12.9K  25.5K  1.59G
test         626G  21.1T      0  14.8K  11.1K  1.83G
test         634G  21.1T      0  14.2K  11.9K  1.76G
test         642G  21.1T      0  12.8K  12.8K  1.59G
test         650G  21.1T      0  12.9K  12.8K  1.60G
^C
bash-3.2# zfs set checksum=off test
bash-3.2# zpool iostat 5
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test         669G  21.1T      0  12.0K  43.5K  1.48G
test         681G  21.1T      0  17.7K  25.2K  2.18G
test         693G  21.1T      0  16.0K  12.7K  1.97G
test         701G  21.1T      0  19.4K  25.5K  2.38G
test         713G  21.1T      0  16.6K  12.8K  2.03G
test         725G  21.0T      0  17.8K  24.9K  2.18G
test         737G  21.0T      0  17.2K  12.7K  2.11G
test         745G  21.0T      0  19.0K  38.3K  2.34G
test         757G  21.0T      0  16.9K  12.8K  2.08G
test         769G  21.0T      0  17.6K  50.7K  2.16G
^C
bash-3.2#


So without checksums it is much better but still it''s jumpy instead of
steady/constant stream. Especially with 1s iostat resolution.




-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Roch - PAE

2008-Jul-01 07:57 UTC

head link

[zfs-discuss] Periodic flush

Robert Milkowski writes:

 > Hello Roch,
 > 
 > Saturday, June 28, 2008, 11:25:17 AM, you wrote:
 > 
 > 
 > RB> I suspect,  a single dd is cpu bound.
 > 
 > I don''t think so.
 > 

We''re nearly so as you show. More below.

 > Se below one with a stripe of 48x disks again. Single dd with 1024k
 > block size and 64GB to write.
 > 
 > bash-3.2# zpool iostat 1
 >                capacity     operations    bandwidth
 > pool         used  avail   read  write   read  write
 > ----------  -----  -----  -----  -----  -----  -----
 > test         333K  21.7T      1      1   147K   147K
 > test         333K  21.7T      0      0      0      0
 > test         333K  21.7T      0      0      0      0
 > test         333K  21.7T      0      0      0      0
 > test         333K  21.7T      0      0      0      0
 > test         333K  21.7T      0      0      0      0
 > test         333K  21.7T      0      0      0      0
 > test         333K  21.7T      0      0      0      0
 > test         333K  21.7T      0  1.60K      0   204M
 > test         333K  21.7T      0  20.5K      0  2.55G
 > test        4.00G  21.7T      0  9.19K      0  1.13G
 > test        4.00G  21.7T      0      0      0      0
 > test        4.00G  21.7T      0  1.78K      0   228M
 > test        4.00G  21.7T      0  12.5K      0  1.55G
 > test        7.99G  21.7T      0  16.2K      0  2.01G
 > test        7.99G  21.7T      0      0      0      0
 > test        7.99G  21.7T      0  13.4K      0  1.68G
 > test        12.0G  21.7T      0  4.31K      0   530M
 > test        12.0G  21.7T      0      0      0      0
 > test        12.0G  21.7T      0  6.91K      0   882M
 > test        12.0G  21.7T      0  21.8K      0  2.72G
 > test        16.0G  21.7T      0    839      0  88.4M
 > test        16.0G  21.7T      0      0      0      0
 > test        16.0G  21.7T      0  4.42K      0   565M
 > test        16.0G  21.7T      0  18.5K      0  2.31G
 > test        20.0G  21.7T      0  8.87K      0  1.10G
 > test        20.0G  21.7T      0      0      0      0
 > test        20.0G  21.7T      0  12.2K      0  1.52G
 > test        24.0G  21.7T      0  9.28K      0  1.14G
 > test        24.0G  21.7T      0      0      0      0
 > test        24.0G  21.7T      0      0      0      0
 > test        24.0G  21.7T      0      0      0      0
 > test        24.0G  21.7T      0  14.5K      0  1.81G
 > test        28.0G  21.7T      0  10.1K  63.6K  1.25G
 > test        28.0G  21.7T      0      0      0      0
 > test        28.0G  21.7T      0  10.7K      0  1.34G
 > test        32.0G  21.7T      0  13.6K  63.2K  1.69G
 > test        32.0G  21.7T      0      0      0      0
 > test        32.0G  21.7T      0      0      0      0
 > test        32.0G  21.7T      0  11.1K      0  1.39G
 > test        36.0G  21.7T      0  19.9K      0  2.48G
 > test        36.0G  21.7T      0      0      0      0
 > test        36.0G  21.7T      0      0      0      0
 > test        36.0G  21.7T      0  17.7K      0  2.21G
 > test        40.0G  21.7T      0  5.42K  63.1K   680M
 > test        40.0G  21.7T      0      0      0      0
 > test        40.0G  21.7T      0  6.62K      0   844M
 > test        44.0G  21.7T      1  19.8K   125K  2.46G
 > test        44.0G  21.7T      0      0      0      0
 > test        44.0G  21.7T      0      0      0      0
 > test        44.0G  21.7T      0  18.0K      0  2.24G
 > test        47.9G  21.7T      1  13.2K   127K  1.63G
 > test        47.9G  21.7T      0      0      0      0
 > test        47.9G  21.7T      0      0      0      0
 > test        47.9G  21.7T      0  15.6K      0  1.94G
 > test        47.9G  21.7T      1  16.1K   126K  1.99G
 > test        51.9G  21.7T      0      0      0      0
 > test        51.9G  21.7T      0      0      0      0
 > test        51.9G  21.7T      0  14.2K      0  1.77G
 > test        55.9G  21.7T      0  14.0K  63.2K  1.73G
 > test        55.9G  21.7T      0      0      0      0
 > test        55.9G  21.7T      0      0      0      0
 > test        55.9G  21.7T      0  16.3K      0  2.04G
 > test        59.9G  21.7T      0  14.5K  63.2K  1.80G
 > test        59.9G  21.7T      0      0      0      0
 > test        59.9G  21.7T      0      0      0      0
 > test        59.9G  21.7T      0  17.7K      0  2.21G
 > test        63.9G  21.7T      0  4.84K  62.6K   603M
 > test        63.9G  21.7T      0      0      0      0
 > test        63.9G  21.7T      0      0      0      0
 > test        63.9G  21.7T      0      0      0      0
 > test        63.9G  21.7T      0      0      0      0
 > test        63.9G  21.7T      0      0      0      0
 > test        63.9G  21.7T      0      0      0      0
 > test        63.9G  21.7T      0      0      0      0
 > ^C
 > bash-3.2#
 > 
 > bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536
 > 65536+0 records in
 > 65536+0 records out
 > 
 > real     1:06.312
 > user        0.074
 > sys        54.060
 > bash-3.2#
 > 
 > Doesn''t look like it''s CPU bound.
 > 

So if sys we''re at 81%  of CPU saturation. If you make this
100% you will still have zeros in the zpool iostat.

We might be waiting on memory pages and a few other locks.


 > 
 > 
 > Let''s try to read the file after zpool export test; zpool import
test
 > 
 > bash-3.2# zpool iostat 1
 >                capacity     operations    bandwidth
 > pool         used  avail   read  write   read  write
 > ----------  -----  -----  -----  -----  -----  -----
 > test        64.0G  21.7T     15     46  1.22M  1.76M
 > test        64.0G  21.7T      0      0      0      0
 > test        64.0G  21.7T      0      0      0      0
 > test        64.0G  21.7T  6.64K      0   849M      0
 > test        64.0G  21.7T  10.2K      0  1.27G      0
 > test        64.0G  21.7T  10.7K      0  1.33G      0
 > test        64.0G  21.7T  9.91K      0  1.24G      0
 > test        64.0G  21.7T  10.1K      0  1.27G      0
 > test        64.0G  21.7T  10.7K      0  1.33G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.6K      0  1.33G      0
 > test        64.0G  21.7T  10.6K      0  1.33G      0
 > test        64.0G  21.7T  10.2K      0  1.27G      0
 > test        64.0G  21.7T  10.3K      0  1.29G      0
 > test        64.0G  21.7T  10.5K      0  1.31G      0
 > test        64.0G  21.7T  9.16K      0  1.14G      0
 > test        64.0G  21.7T  1.98K      0   253M      0
 > test        64.0G  21.7T  2.48K      0   317M      0
 > test        64.0G  21.7T  1.98K      0   253M      0
 > test        64.0G  21.7T  1.98K      0   254M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  1.73K      0   221M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  1.49K      0   191M      0
 > test        64.0G  21.7T  2.47K      0   317M      0
 > test        64.0G  21.7T  1.46K      0   186M      0
 > test        64.0G  21.7T  2.01K      0   258M      0
 > test        64.0G  21.7T  1.98K      0   254M      0
 > test        64.0G  21.7T  1.97K      0   253M      0
 > test        64.0G  21.7T  2.23K      0   286M      0
 > test        64.0G  21.7T  1.98K      0   254M      0
 > test        64.0G  21.7T  1.73K      0   221M      0
 > test        64.0G  21.7T  1.98K      0   254M      0
 > test        64.0G  21.7T  2.42K      0   310M      0
 > test        64.0G  21.7T  1.78K      0   228M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  1.67K      0   214M      0
 > test        64.0G  21.7T  1.80K      0   230M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  2.47K      0   317M      0
 > test        64.0G  21.7T  1.73K      0   221M      0
 > test        64.0G  21.7T  1.99K      0   254M      0
 > test        64.0G  21.7T  1.24K      0   159M      0
 > test        64.0G  21.7T  2.47K      0   316M      0
 > test        64.0G  21.7T  2.47K      0   317M      0
 > test        64.0G  21.7T  1.99K      0   254M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  1.73K      0   221M      0
 > test        64.0G  21.7T  2.48K      0   317M      0
 > test        64.0G  21.7T  2.48K      0   317M      0
 > test        64.0G  21.7T  1.49K      0   190M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  1.81K      0   232M      0
 > test        64.0G  21.7T  1.90K      0   243M      0
 > test        64.0G  21.7T  2.48K      0   317M      0
 > test        64.0G  21.7T  1.49K      0   191M      0
 > test        64.0G  21.7T  2.47K      0   317M      0
 > test        64.0G  21.7T  1.99K      0   254M      0
 > test        64.0G  21.7T  1.97K      0   253M      0
 > test        64.0G  21.7T  1.49K      0   190M      0
 > test        64.0G  21.7T  2.23K      0   286M      0
 > test        64.0G  21.7T  1.82K      0   232M      0
 > test        64.0G  21.7T  2.15K      0   275M      0
 > test        64.0G  21.7T  2.22K      0   285M      0
 > test        64.0G  21.7T  1.73K      0   222M      0
 > test        64.0G  21.7T  2.23K      0   286M      0
 > test        64.0G  21.7T  1.90K      0   244M      0
 > test        64.0G  21.7T  1.81K      0   231M      0
 > test        64.0G  21.7T  2.23K      0   285M      0
 > test        64.0G  21.7T  1.97K      0   252M      0
 > test        64.0G  21.7T  2.00K      0   255M      0
 > test        64.0G  21.7T  8.42K      0  1.05G      0
 > test        64.0G  21.7T  10.3K      0  1.29G      0
 > test        64.0G  21.7T  10.2K      0  1.28G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.2K      0  1.27G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.6K      0  1.32G      0
 > test        64.0G  21.7T  10.5K      0  1.31G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  9.23K      0  1.15G      0
 > test        64.0G  21.7T  10.5K      0  1.31G      0
 > test        64.0G  21.7T  10.0K      0  1.25G      0
 > test        64.0G  21.7T  9.55K      0  1.19G      0
 > test        64.0G  21.7T  10.2K      0  1.27G      0
 > test        64.0G  21.7T  10.0K      0  1.25G      0
 > test        64.0G  21.7T  9.91K      0  1.24G      0
 > test        64.0G  21.7T  10.6K      0  1.32G      0
 > test        64.0G  21.7T  9.24K      0  1.15G      0
 > test        64.0G  21.7T  10.1K      0  1.26G      0
 > test        64.0G  21.7T  10.3K      0  1.29G      0
 > test        64.0G  21.7T  10.3K      0  1.29G      0
 > test        64.0G  21.7T  10.6K      0  1.33G      0
 > test        64.0G  21.7T  10.6K      0  1.33G      0
 > test        64.0G  21.7T  8.54K      0  1.07G      0
 > test        64.0G  21.7T      0      0      0      0
 > test        64.0G  21.7T      0      0      0      0
 > test        64.0G  21.7T      0      0      0      0
 > ^C
 > 
 > bash-3.2# ptime dd if=/test/q1 of=/dev/null bs=1024k
 > 65536+0 records in
 > 65536+0 records out
 > 
 > real     1:36.732
 > user        0.046
 > sys        48.069
 > bash-3.2#
 > 
 > 
 > Well, that drop for several dozen seconds was interesting...
 > Lets run it again without export/import:
 > 
 > bash-3.2# zpool iostat 1
 >                capacity     operations    bandwidth
 > pool         used  avail   read  write   read  write
 > ----------  -----  -----  -----  -----  -----  -----
 > test        64.0G  21.7T  3.00K      6   384M   271K
 > test        64.0G  21.7T      0      0      0      0
 > test        64.0G  21.7T  2.58K      0   330M      0
 > test        64.0G  21.7T  6.02K      0   771M      0
 > test        64.0G  21.7T  8.37K      0  1.05G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  9.64K      0  1.20G      0
 > test        64.0G  21.7T  10.5K      0  1.31G      0
 > test        64.0G  21.7T  10.6K      0  1.32G      0
 > test        64.0G  21.7T  10.6K      0  1.33G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  9.65K      0  1.21G      0
 > test        64.0G  21.7T  9.84K      0  1.23G      0
 > test        64.0G  21.7T  9.22K      0  1.15G      0
 > test        64.0G  21.7T  10.9K      0  1.36G      0
 > test        64.0G  21.7T  10.9K      0  1.36G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.7K      0  1.34G      0
 > test        64.0G  21.7T  10.6K      0  1.33G      0
 > test        64.0G  21.7T  10.9K      0  1.36G      0
 > test        64.0G  21.7T  10.6K      0  1.32G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.7K      0  1.34G      0
 > test        64.0G  21.7T  10.5K      0  1.32G      0
 > test        64.0G  21.7T  10.6K      0  1.32G      0
 > test        64.0G  21.7T  10.8K      0  1.34G      0
 > test        64.0G  21.7T  10.4K      0  1.29G      0
 > test        64.0G  21.7T  10.5K      0  1.31G      0
 > test        64.0G  21.7T  9.15K      0  1.14G      0
 > test        64.0G  21.7T  10.8K      0  1.35G      0
 > test        64.0G  21.7T  9.76K      0  1.22G      0
 > test        64.0G  21.7T  8.67K      0  1.08G      0
 > test        64.0G  21.7T  10.8K      0  1.36G      0
 > test        64.0G  21.7T  10.9K      0  1.36G      0
 > test        64.0G  21.7T  10.3K      0  1.28G      0
 > test        64.0G  21.7T  9.76K      0  1.22G      0
 > test        64.0G  21.7T  10.5K      0  1.31G      0
 > test        64.0G  21.7T  10.6K      0  1.33G      0
 > test        64.0G  21.7T  9.23K      0  1.15G      0
 > test        64.0G  21.7T  9.63K      0  1.20G      0
 > test        64.0G  21.7T  9.79K      0  1.22G      0
 > test        64.0G  21.7T  10.2K      0  1.28G      0
 > test        64.0G  21.7T  10.4K      0  1.30G      0
 > test        64.0G  21.7T  10.3K      0  1.29G      0
 > test        64.0G  21.7T  10.2K      0  1.28G      0
 > test        64.0G  21.7T  10.6K      0  1.33G      0
 > test        64.0G  21.7T  10.8K      0  1.35G      0
 > test        64.0G  21.7T  10.5K      0  1.32G      0
 > test        64.0G  21.7T  11.0K      0  1.37G      0
 > test        64.0G  21.7T  10.2K      0  1.27G      0
 > test        64.0G  21.7T  9.69K      0  1.21G      0
 > test        64.0G  21.7T  6.07K      0   777M      0
 > test        64.0G  21.7T      0      0      0      0
 > test        64.0G  21.7T      0      0      0      0
 > test        64.0G  21.7T      0      0      0      0
 > ^C
 > bash-3.2#
 > 
 > bash-3.2# ptime dd if=/test/q1 of=/dev/null bs=1024k
 > 65536+0 records in
 > 65536+0 records out
 > 
 > real       50.521
 > user        0.043
 > sys        48.971
 > bash-3.2#
 > 
 > Now looks like reading from the pool using single dd is actually CPU
 > bound.
 > 
 > Reading the same file again and again does produce, more or less,
 > consistent timing. However every time I export/import the pool during
 > the first read there is that drop in throughput during first read and
 > total time increases to almost 100 seconds.... some meta-data? (of
 > course there are no errors oof any sort, etc.)
 > 
 > 

That might fall in either of these buckets.

	6412053 zfetch needs some love
	6579975 dnode_new_blkid should check before it locks



 > 
 > 
 > 
 > 
 > >> Reducing zfs_txg_synctime to 1 helps a little bit but still
it''s not
 > >> even stream of data.
 > >>
 > >> If I start 3 dd streams at the same time then it is slightly
better
 > >> (zfs_txg_synctime set back to 5) but still very jumpy.
 > >>
 > 
 > RB> Try zfs_txg_synctime to 10; that reduces the txg overhead.
 > 
 > 

You need multiple dd; we''re basically  CPU bound here.  With
multiple dd and zfs_txg_synctime   to 10 you will  have more
write  throughput.  The  drops  you   will have  then   will
correspond to   the   metadata  updates phase   during   the
transaction group.

But the drops below correspond to txg going faster (memory
to disk) than dd is able to fill memory (some of that speed
governed by suboptimal locking).


-r

 > Doesn''t help...
 > 
 > [...]
 > test        13.6G  21.7T      0      0      0      0
 > test        13.6G  21.7T      0  8.46K      0  1.05G
 > test        17.6G  21.7T      0  19.3K      0  2.40G
 > test        17.6G  21.7T      0      0      0      0
 > test        17.6G  21.7T      0      0      0      0
 > test        17.6G  21.7T      0  8.04K      0  1022M
 > test        17.6G  21.7T      0  20.2K      0  2.51G
 > test        21.6G  21.7T      0     76      0   249K
 > test        21.6G  21.7T      0      0      0      0
 > test        21.6G  21.7T      0      0      0      0
 > test        21.6G  21.7T      0  10.1K      0  1.25G
 > test        25.6G  21.7T      0  18.6K      0  2.31G
 > test        25.6G  21.7T      0      0      0      0
 > test        25.6G  21.7T      0      0      0      0
 > test        25.6G  21.7T      0  6.34K      0   810M
 > test        25.6G  21.7T      0  19.9K      0  2.48G
 > test        29.6G  21.7T      0     88  63.2K   354K
 > [...]
 > 
 > bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536
 > 65536+0 records in
 > 65536+0 records out
 > 
 > real     1:10.074
 > user        0.074
 > sys        52.250
 > bash-3.2#
 > 
 > 
 > Increasing it even further (up-to 32s) doesn''t help either.
 > 
 > However lowering it to 1s gives:
 > 
 > [...]
 > test        2.43G  21.7T      0  8.62K      0  1.07G
 > test        4.46G  21.7T      0  7.23K      0   912M
 > test        4.46G  21.7T      0    624      0  77.9M
 > test        6.66G  21.7T      0  10.7K      0  1.33G
 > test        6.66G  21.7T      0  6.66K      0   850M
 > test        8.86G  21.7T      0  10.6K      0  1.31G
 > test        8.86G  21.7T      0  1.96K      0   251M
 > test        11.2G  21.7T      0  16.5K      0  2.04G
 > test        11.2G  21.7T      0      0      0      0
 > test        11.2G  21.7T      0  18.6K      0  2.31G
 > test        13.5G  21.7T      0     11      0  11.9K
 > test        13.5G  21.7T      0  2.60K      0   332M
 > test        13.5G  21.7T      0  19.1K      0  2.37G
 > test        16.3G  21.7T      0     11      0  11.9K
 > test        16.3G  21.7T      0  9.61K      0  1.20G
 > test        18.4G  21.7T      0  7.41K      0   936M
 > test        18.4G  21.7T      0  11.6K      0  1.45G
 > test        20.3G  21.7T      0  3.26K      0   407M
 > test        20.3G  21.7T      0  7.66K      0   977M
 > test        22.5G  21.7T      0  7.62K      0   963M
 > test        22.5G  21.7T      0  6.86K      0   875M
 > test        24.5G  21.7T      0  8.41K      0  1.04G
 > test        24.5G  21.7T      0  10.4K      0  1.30G
 > test        26.5G  21.7T      1  2.19K   127K   270M
 > test        26.5G  21.7T      0      0      0      0
 > test        26.5G  21.7T      0  4.56K      0   584M
 > test        28.5G  21.7T      0  11.5K      0  1.42G
 > [...]
 > 
 > bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536
 > 65536+0 records in
 > 65536+0 records out
 > 
 > real     1:09.541
 > user        0.072
 > sys        53.421
 > bash-3.2#
 > 
 > 
 > 
 > Looks slightly less jumpy but the total real time is about the same so
 > average throughput is actually the same (about 1GB/s).
 > 
 > 
 > 
 > 
 > >> Reading with one dd produces steady throghput but I''m
disapointed with
 > >> actual performance:
 > >>
 > 
 > RB> Again, probably cpu bound. What''s "ptime dd..."
saying ?
 > 
 > You were right here. Reading with single dd seems to be cpu bound.
 > However multiple streams for reading do not seem to increase
 > performance considerably.
 > 
 > Nevertheless the main issu is jumpy writing...
 > 
 > 
 > 
 > 
 > -- 
 > Best regards,
 >  Robert Milkowski                            mailto:milek at task.gda.pl
 >                                        http://milek.blogspot.com
 >

zfs discuss - Mar 2008 - Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush

[zfs-discuss] Periodic flush