thr3ads.net - zfs discuss - [zfs-discuss] WriteBack versus SSD-ZIL [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Edward Ned Harvey

2010-Mar-05 13:36 UTC

[zfs-discuss] WriteBack versus SSD-ZIL

In this email, when I say PERC, I really mean either a PERC, or any other
hardware WriteBack buffered raid controller with BBU.

For future server purchases, I want to know which is faster: (a) A bunch of
hard disks with PERC and WriteBack enabled, or (b) A bunch of hard disks,
plus one SSD for ZIL. Unfortunately, I don''t have an SSD available for
testing. So here is what I was able to do:

I measured the write speed of the naked disks (PERC set to WriteThrough).
Results were around 350 ops/sec.

I measured the write speed with the WriteBack enabled. Results were around
1250 ops/sec.

So right from the start, we can see there''s a huge performance boost by
enabling the WriteBack. Even for large sequential writes, buffering allows
the disks to operate much more continuously. The next question is how it
compares against the SSD ZIL.

Since I don''t have an SSD available, I created a ram device and put the
ZIL
there. This is not a measure of the speed if I had an SSD; rather, it is a
measure which the SSD cannot possibly achieve. So it serves to establish an
upper bound. If the upper and lower bounds are near each other, then we
have a good estimate of the speed with SSD ZIL . but if the upper bound and
lower bound are far apart, we haven''t gotten much information.

With the ZIL in RAM, results were around 1700 ops/sec.

This measure is very far from the lower bound. But it still serves to
provide some useful knowledge. The take-home knowledge is:

. There''s a lot of performance to be gained by accelerating the
ZIL.
Potentially up to 6x or 7x faster than naked disks.

. The WriteBack raid controller achieves a lot of this performance
increase. About 3x or 4x acceleration.

. I don''t know how much an SSD would help. I don''t
know if it''s
better, the same, or worse than the PERC. I don''t know if the
combination
of PERC and SSD together would go faster than either one individually.

I have a hypothesis. I think the best configuration will be to use a PERC,
with WriteBack enabled on all the spindle hard drives, but include an SSD
for ZIL, and set the PERC for WriteThrough on the SSD. This has yet to be
proven or disproven.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100305/306ee74c/attachment.html>

Erik Trimble

2010-Mar-05 15:04 UTC

head link

[zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL

Edward Ned Harvey wrote:>
> In this email, when I say PERC, I really mean either a PERC, or any 
> other hardware WriteBack buffered raid controller with BBU.
>
> For future server purchases, I want to know which is faster: (a) A 
> bunch of hard disks with PERC and WriteBack enabled, or (b) A bunch of 
> hard disks, plus one SSD for ZIL. Unfortunately, I don?t have an SSD 
> available for testing. So here is what I was able to do:
>
> I measured the write speed of the naked disks (PERC set to 
> WriteThrough). Results were around 350 ops/sec.
>
> I measured the write speed with the WriteBack enabled. Results were 
> around 1250 ops/sec.
>
> So right from the start, we can see there?s a huge performance boost 
> by enabling the WriteBack. Even for large sequential writes, buffering 
> allows the disks to operate much more continuously. The next question 
> is how it compares against the SSD ZIL.
>
> Since I don?t have an SSD available, I created a ram device and put 
> the ZIL there. This is not a measure of the speed if I had an SSD; 
> rather, it is a measure which the SSD cannot possibly achieve. So it 
> serves to establish an upper bound. If the upper and lower bounds are 
> near each other, then we have a good estimate of the speed with SSD 
> ZIL ? but if the upper bound and lower bound are far apart, we haven?t 
> gotten much information.
>
> With the ZIL in RAM, results were around 1700 ops/sec.
>
> This measure is very far from the lower bound. But it still serves to 
> provide some useful knowledge. The take-home knowledge is:
>
> ? There?s a lot of performance to be gained by accelerating the ZIL. 
> Potentially up to 6x or 7x faster than naked disks.
>
> ? The WriteBack raid controller achieves a lot of this performance 
> increase. About 3x or 4x acceleration.
>
> ? I don?t know how much an SSD would help. I don?t know if it?s 
> better, the same, or worse than the PERC. I don?t know if the 
> combination of PERC and SSD together would go faster than either one 
> individually.
>
> I have a hypothesis. I think the best configuration will be to use a 
> PERC, with WriteBack enabled on all the spindle hard drives, but 
> include an SSD for ZIL, and set the PERC for WriteThrough on the SSD. 
> This has yet to be proven or disproven.
>
Cache on controllers is almost always battery-backed DRAM (NVRAM having 
made an exit from the scene awhile ago). As such, it has fabulous 
latency and throughput, compared to anything else. From a performance 
standpoint, it will beat a SSD ZIL on a 1-for-1 basis. However, it''s 
almost never the case that you find a HBA cache even a fraction of the 
size of a small SSD. So, what happens when you flood the HBA with more 
I/O than the on-board cache can handle? It reduces performance back to 
the level of NO cache.

 From everything I''ve seen, an SSD wins simply because it''s
20-100x the
size. HBAs almost never have more than 512MB of cache, and even fancy 
SAN boxes generally have 1-2GB max. So, HBAs are subject to being 
overwhelmed with heavy I/O. The SSD ZIL has a much better chance of 
being able to weather a heavy I/O period without being filled. Thus, 
SSDs are better at "average" performance - they provide a relatively 
steady performance profile, whereas HBA cache is very spiky.

The other real advantage of SSD ZIL is that it covers the entire pool. 
Most larger pools spread their disks over multiple controllers, each of 
which must have a cache in order for the whole pool to perform evenly.

If you know you''re going to be doing very intermittent or modest level 
of I/O, and that I/O is likely to fit within a HBA''s cache, then it
will
outperform an SSD. For a continuous heavy load, or for extremely spiky 
loads (which are considerably in excess of the HBA''s cache), then an
SSD
will win handily. The key here is the hard drives - the faster they are, 
the faster the HBA will be able to empty it''s cache to disk, and the 
lower likelihood it will get overwhelmed by new I/O.

All that said, I''m pretty sure most I/O patterns heavily favor SSD ZIL 
over HBA cache.

Note that ZIL applies only to Synchronous I/O, while HBA cache is for 
all writes. Also note that HBA cache can be used (or shared) for a read 
cache, as well, according to the HBA setup. And, a good (SLC) SSD can 
handle 50,000 IOPS until it''s filled. Which takes a very long time 
relative to an HBA cache.

------
I''ve always wondered what the benefit (and difficulty to add to ZFS) 
would be to having an async write cache for ZFS - that is, ZFS currently 
buffers async writes in RAM, until it decides to aggregate enough of 
them to flush to disk. I think it would be interesting to see what would 
happen if an async SSD cache was available, since the write pattern is 
"large, streaming", which means that the same devices useful for L2ARC
would perform well as an async write cache. In essence, use the async 
write SSD as an extra-large buffer, in the same way that L2ARC on SSD 
supplements main memory for the read cache.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Zhu Han

2010-Mar-06 04:29 UTC

head link

[zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL

Hi, Erik,

I''ve always wondered what the benefit (and difficulty to add to ZFS)
would> be to having an async write cache for ZFS - that is, ZFS currently buffers
> async writes in RAM, until it decides to aggregate enough of them to flush
> to disk. I think it would be interesting to see what would happen if an
> async SSD cache was available, since the write pattern is "large,
> streaming", which means that the same devices useful for L2ARC would
perform
> well as an async write cache. In essence, use the async write SSD as an
> extra-large buffer, in the same way that L2ARC on SSD supplements main
> memory for the read cache.
>
> I think HDD can handle those "large and streaming like" workload
quitewell. You can observe SSD doesn''t play very much better than HDD on
streaming write/read. So it could be good enough to flush those async
aggreated write data back to HDD instead of bringing another level of write
cache.

> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100306/9248bb22/attachment.html>

Erik Trimble

2010-Mar-06 04:50 UTC

head link

[zfs-discuss] WriteBack versus SSD-ZIL

[moved off osol-discuss]

Zhu Han wrote:> Hi, Erik,
>
>     I''ve always wondered what the benefit (and difficulty to add
to
>     ZFS) would be to having an async write cache for ZFS - that is,
>     ZFS currently buffers async writes in RAM, until it decides to
>     aggregate enough of them to flush to disk. I think it would be
>     interesting to see what would happen if an async SSD cache was
>     available, since the write pattern is "large, streaming",
which
>     means that the same devices useful for L2ARC would perform well as
>     an async write cache. In essence, use the async write SSD as an
>     extra-large buffer, in the same way that L2ARC on SSD supplements
>     main memory for the read cache.
>
> I think HDD can handle those "large and streaming like" workload
quite
> well. You can observe SSD doesn''t play very much better than HDD
on
> streaming write/read. So it could be good enough to flush those async 
> aggreated write data back to HDD instead of bringing another level of 
> write cache.
>  
This is true.  SSDs and HDs differ little in their ability to handle raw 
throughput. However, we often still see problems in ZFS associated with 
periodic system "pauses" where ZFS effectively monopolizes the HDs to 
write out it''s current buffered I/O.  People have been complaining
about
this for quite awhile.  SSDs have a huge advantage where IOPS are 
concerned, and given that the backing store HDs have to service both 
read and write requests, they''re severely limited on the number of IOPs
they can give to incoming data.

You have a good point, but I''d still be curious to see what an async 
cache would do.  After all, that is effectively what the HBA cache is, 
and we see a significant improvement with it, and not just for sync write.

I also don''t know what the threshold is in ZFS for it to consider it 
time to do a async buffer flush.  Is it time based?  % of RAM based? 
Absolute amount? All of that would impact whether an SSD async cache 
would be useful.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Zhu Han

2010-Mar-06 09:38 UTC

head link

[zfs-discuss] WriteBack versus SSD-ZIL

On Sat, Mar 6, 2010 at 12:50 PM, Erik Trimble <Erik.Trimble at sun.com>
wrote:
> This is true.  SSDs and HDs differ little in their ability to handle raw
> throughput. However, we often still see problems in ZFS associated with
> periodic system "pauses" where ZFS effectively monopolizes the
HDs to write
> out it''s current buffered I/O.  People have been complaining about
this for
> quite awhile.  SSDs have a huge advantage where IOPS are concerned, and
> given that the backing store HDs have to service both read and write
> requests, they''re severely limited on the number of IOPs they can
give to
> incoming data.
>
> You have a good point, but I''d still be curious to see what an
async cache
> would do.  After all, that is effectively what the HBA cache is, and we see
> a significant improvement with it, and not just for sync write.
>
>I might see what your mean here. Because ZFS has to aggregate some write
data during a short period (txn alive time) to avoid generating too many
random write HDD requests, the bandwidth of HDD during this time is wasted.
For write heavy streaming workload, especially those who can saturate the
HDD pool bandwidth easily, ZFS will make the performance worse than those
legacy file system, i.e. UFS or EXT3. The IOPS of the HDD is not the
limitation here. The bandwidth of the HDD is the root cause.

This is the design choice of ZFS. Reducing the length of period during txn
commit can alleviate the problem. So that the size of data needing to flush
to the disk every time could be smaller. Replace the HDD with some high-end
FC disks may solve this problem.

> I also don''t know what the threshold is in ZFS for it to consider
it time
> to do a async buffer flush.  Is it time based?  % of RAM based? Absolute
> amount? All of that would impact whether an SSD async cache would be
useful.
>
>IMHO, ZFS flush the data back to disk asynchronously every 5 seconds, which
is the default configuration of txn commit period.  ZFS will also flush the
data back to disk even before the 5 second period, based on the estimation
of amount of memory has been used for the current txn.  This is called as
write throttle. See below link:
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle


> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100306/86e92e85/attachment.html>

Richard Elling

2010-Mar-06 22:39 UTC

head link

[zfs-discuss] WriteBack versus SSD-ZIL

On Mar 6, 2010, at 1:38 AM, Zhu Han wrote:
> On Sat, Mar 6, 2010 at 12:50 PM, Erik Trimble <Erik.Trimble at
sun.com> wrote:
> This is true.  SSDs and HDs differ little in their ability to handle raw
throughput. However, we often still see problems in ZFS associated with periodic
system "pauses" where ZFS effectively monopolizes the HDs to write out
it''s current buffered I/O.  People have been complaining about this for
quite awhile.  SSDs have a huge advantage where IOPS are concerned, and given
that the backing store HDs have to service both read and write requests,
they''re severely limited on the number of IOPs they can give to
incoming data.
> 
> You have a good point, but I''d still be curious to see what an
async cache would do.  After all, that is effectively what the HBA cache is, and
we see a significant improvement with it, and not just for sync write.
> 
> 
> I might see what your mean here. Because ZFS has to aggregate some write
data during a short period (txn alive time) to avoid generating too many random
write HDD requests, the bandwidth of HDD during this time is wasted. For write
heavy streaming workload, especially those who can saturate the HDD pool
bandwidth easily, ZFS will make the performance worse than those legacy file
system, i.e. UFS or EXT3. The IOPS of the HDD is not the limitation here. The
bandwidth of the HDD is the root cause.
This statement is too simple, and thus does not represent reality very well.
For a fully streaming workload where the load is near the capacity of the
storage, the algorithms in ZFS will work to optimize the match.  There is 
still some work to be done, but I don''t believe UFS has beat ZFS on
Solaris
for a significant streaming benchmark for several years now.

What we do see is that high performance SSDs can saturate the SAS/SATA
link for extended periods of time. For example, a Western Digital SiliconEdge
Blue (a new, midrange model) can read at 250 MB/sec in contrast to a 
WD RE4 which has a media transfer rate of 138 MB/sec.  High-speed SSDs
are already putting the hurt on 6Gbps SAS/SATA -- the Micron models claim
370 MB/sec sustained.  Since this can be easily parallelized, expect that
the high-end SSDs will saturate whatever you can connect them to.  This is 
one reason why the F5100 has 64 SAS channels for host connections.
> This is the design choice of ZFS. Reducing the length of period during txn
commit can alleviate the problem. So that the size of data needing to flush to
the disk every time could be smaller. Replace the HDD with some high-end FC
disks may solve this problem.
Properly matching I/O source and sink is still important, no file system can
relieve you of that duty :-)
>  I also don''t know what the threshold is in ZFS for it to consider
it time to do a async buffer flush.  Is it time based?  % of RAM based? Absolute
amount? All of that would impact whether an SSD async cache would be useful.
The answer is "yes" to all of these questions, but there are many
variables
to consider, so YMMV.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)

Edward Ned Harvey

2010-Mar-07 03:10 UTC

head link

[zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL

>  From everything I''ve seen, an SSD wins simply because
it''s 20-100x the
> size. HBAs almost never have more than 512MB of cache, and even fancy
> SAN boxes generally have 1-2GB max. So, HBAs are subject to being
> overwhelmed with heavy I/O. The SSD ZIL has a much better chance of
> being able to weather a heavy I/O period without being filled. Thus,
> SSDs are better at "average" performance - they provide a
relatively
> steady performance profile, whereas HBA cache is very spiky.
This is a really good point.  So you think I may actually get better
performance by disabling the WriteBack on all the spindle disks, and
enabling it on the SSD instead.  This is precisely the opposite of what I
was thinking.

I''m planning to publish some more results soon, but haven''t
gathered it all
yet.  But see these:
Just naked disks, no acceleration.
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteThr
ough.txt
Same configuration as above, but WriteBack enabled.
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac
k.txt
Same configuration as the naked disks, but a ramdrive was created for ZIL
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-ramZIL.t
xt
Using the ramdrive for ZIL, and also WriteBack enabled on PERC
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac
k_and_ramZIL.txt

This result shows the WriteBack enabled makes a huge performance difference
(3-4x higher) for writes, compared to the naked disks.  I don''t think
it''s
because an entire write operation fits into the HBA DRAM, or the HBA is
remaining un-saturated.  The PERC has 256M, but the test includes 8 threads
all simultaneously writing separate 4G files in various sized chunks and
patterns.  I think when the PERC ram is full of stuff queued for write to
disk, it''s simply able to order and organize and optimize the write
operations to leverage the disks as much as possible.

Damon Atkins

2010-Mar-08 04:10 UTC

head link

[zfs-discuss] Should ZFS write data out when disk are idle

I think ZFS should look for more opportunities to write to disk rather than
leaving it to the last second (5seconds) as it appears it does.  e.g.

if a file has record size worth of data outstanding it should be queued within
ZFS to be written out. If the record is updated again before a txg, then it can
be re-queued (if it has left the queue) and written to the same block or a new
block. The write queue would empty when there is spare I/O bandwidth capacity
and memory capacity on the disk determined thought outstanding I/Os. Once the
data is on disk it could be free to be re-used even before the txg has occurred,
but checksum details would need to be recorded first. The txg comes along after
X seconds and finds most of the data writes have already happen and only
metadata writes are left to do.

One would should assume this would help with the delays at txg, talked about in
this thread.

The example below shows 28 x 128k writes to the same file before anything is
written to disk and the disk are idle the entire time. There is no cost to
writing to disk if the disk is not doing anything or is under capacity. (Not a
perfect example)


At the other end maybe updates for access time properties should not be updated
to disk until there is some real data to write, or 30minutes has passed to allow
green disks to power down for a while.  (atime= on|off|delay)

Cheers

No dedup on, but compression on
while sleep 1 ; do echo `dd if=/dev/random of=xxxx bs=128k count=1 2>&1`
; done &
iostat -zxcnT d 1
  us sy wt id
  0  5  0 94
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   53.0    0.0  301.5  0.0  0.2    0.0    3.4   0   4 c5t0d0
    0.0   53.0    0.0  301.5  0.0  0.2    0.0    3.1   0   4 c5t2d0
    0.0   58.0    0.0  127.0  0.0  0.0    0.0    0.1   0   0 c5t1d0
    0.0   58.0    0.0  127.0  0.0  0.0    0.0    0.1   0   0 c5t3d0
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:41 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    3.0    0.0    2.0  0.0  0.0    0.0    0.5   0   0 c5t0d0
    0.0    3.0    0.0    2.0  0.0  0.0    0.0    0.5   0   0 c5t2d0
    0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t1d0
    0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t3d0
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:42 PM EST
     cpu
 us sy wt id
  1  3  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:43 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:44 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:45 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:46 PM EST
     cpu
 us sy wt id
  1  4  0 95
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:47 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:48 PM EST
     cpu
 us sy wt id
  0 19  0 80
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:49 PM EST
     cpu
 us sy wt id
  1 27  0 72
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:50 PM EST
     cpu
 us sy wt id
  0  3  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:51 PM EST
     cpu
 us sy wt id
  1  3  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:52 PM EST
     cpu
 us sy wt id
  0  4  0 95
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:53 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
Monday,  8 March 2010 02:51:54 PM EST
     cpu
 us sy wt id
  1  3  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:55 PM EST
     cpu
 us sy wt id
  0  1  0 99
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:56 PM EST
     cpu
 us sy wt id
  1  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:57 PM EST
     cpu
 us sy wt id
  1  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:58 PM EST
     cpu
 us sy wt id
  0  3  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:59 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:00 PM EST
     cpu
 us sy wt id
  1  4  0 95
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:01 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:02 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:03 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:04 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:05 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:06 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:07 PM EST
     cpu
 us sy wt id
  1  4  0 95
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:08 PM EST
     cpu
 us sy wt id
  1  3  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:09 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:10 PM EST
     cpu
 us sy wt id
  1  4  0 95
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   37.0    0.0  140.5  0.0  0.1    0.0    1.9       0   2 c5t0d0
    0.0   37.0    0.0  140.5  0.0  0.1    0.0    1.9       0   2 c5t2d0
    0.0   41.0    0.0   79.5  0.0  0.0    0.0     0.1       0   0 c5t1d0
    0.0   40.0    0.0   79.5  0.0  1.6    0.0    38.8      0  39 c5t3d0
0+1 records in 0+1 records out
Monday,  8 March 2010 02:52:11 PM EST
     cpu
 us sy wt id
  0  4  0 96
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t0d0
    0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t2d0
    0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t1d0
    0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t3d0
Monday,  8 March 2010 02:52:12 PM EST
     cpu
 us sy wt id
  0  1  0 99
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Mar-08 15:54 UTC

head link

[zfs-discuss] Should ZFS write data out when disk are idle

On Sun, 7 Mar 2010, Damon Atkins wrote:
> The example below shows 28 x 128k writes to the same file before 
> anything is written to disk and the disk are idle the entire time. 
> There is no cost to writing to disk if the disk is not doing 
> anything or is under capacity. (Not a perfect example)
Zfs can be tuned to write out transaction groups much more quickly if 
you like.

It is not true that there is "no cost" though.  Since ZFS uses COW, 
this approach requires that new blocks be allocated and written at a 
much higher rate.  There is also an "opportunity cost" in that if a 
read comes in while these continuous writes are occuring, the read 
will be delayed.

There are many applications which continually write/overwrite file 
content, or which update a file at a slow pace.  For example, log 
files are typically updated at a slow rate.  Updating a block requires 
reading it first (if it is not already cached in the ARC), which can 
be quite expensive.  By waiting a bit longer, there is a much better 
chance that the whole block is overwritten, so zfs can discard the 
existing block on disk without bothering to re-read it.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Damon Atkins

2010-Mar-09 12:45 UTC

head link

[zfs-discuss] Should ZFS write data out when disk are idle

I am talking about having a write queue, which points to ready to write, full
stripes.

Ready to write full stripes would be
*The last byte of the full stripe has been updated. 
*The file has been closed for writing. (Exception to the above rule)

I believe there is now a scheduler for ZFS, to handle reads and write conflicts.

For example on a large Multi-Gigabyte NVRAM array, the only big consideration is
how big is the Fibre Channel pipe is and the limit on outstanding I/Os

But on SATA off the motherboard, then it is about how much RAM cache each disk
has is a consideration as well as the speed of the SATA connection as well as
the number of outstanding I/Os

When it comes time to do txg some of the record blocks (most of the full 128k
ones) will have been written out already. If we have only written out full
record blocks then there has been no performance loss.

Eventually a txg going to happen, eventually these full writes will need to
happen, but if we can choose a less busy time for them all the better.

e.g. on a raidz with 5 disks, if I have 128x4 worth of data to write, lets write
it.
       on a mirror if I have 128k worth to write, lets write it. (record size
128k), or let it be a tunable for zpool, as some arrays (RAID5) like to have
larger chunks of data.

Why wait for the txg if the disk are not being pressured for reads. Rather than
a pause every 30 seconds.

Bob wrote :> (I may not have explained it well
enough)>It is not true that there is "no cost" though. Since ZFS uses COW,
>this approach requires that new blocks be allocated and written at a
>much higher rate. There is also an "opportunity cost" in that if a
>read comes in while these continuous writes are occurring, the read
>will be delayed.
At some stage a write needs to happen. **Full** writes have very small COW cost
compare with small writes. As I said above I talking about a write of 4x128k on
a 5 disk raidz before the write would happen early.
>There are many applications which continually write/overwrite file
>content, or which update a file at a slow pace. For example, log
>files are typically updated at a slow rate. Updating a block requires
>reading it first (if it is not already cached in the ARC), which can
>be quite expensive. By waiting a bit longer, there is a much better
>chance that the whole block is overwritten, so zfs can discard the
>existing block on disk without bothering to re-read it.
Apps which update at slow pace will not trigger the above early write, until
they have at least written a record size worth of data, application which write
slow than 128k (recordsize) in more than 30 secs will never trigger the early
write on a mirrored disk or even a raidz setup.

What this will catch is the big writer of files greater than 128k (recordsize)
on mirrored disk; and files larger than (4x128k) on RaidZ 5disks sets.

So that commands like dd if=x of=y bs=512k will not cause issues (pauses/delays)
when the txg timeout.

PS I already set zfs:zfs_write_limit_override and I would not recommend anyone
to set this very low to get the above effect.

It''s just an idea on how to prevent the delay effect, it may not be
practical?
-- 
This message posted from opensolaris.org

Damon Atkins

2010-Mar-10 02:13 UTC

head link

[zfs-discuss] Should ZFS write data out when disk are idle

Sorry, Full Stripe on a RaidZ is the recordsize ie if the record size is 128k on
a RaidZ and its made up of 5 disks, then 128k is spread across 4 disks with the
calc parity on the 5 disk, which means the writes are 32k to each disk.

For a RaidZ, when data is written to a disk, are individual 32k join together to
the same disk and written out as a single I/O to the disk?
e.g. 128k for file a, 128k for file b, 128k for file c.   When written out does
zfs do
 32k+32k+32k i/o to each disk, or will it do one 96k i/o if the space is
available sequentially?

Cheers
-- 
This message posted from opensolaris.org

Richard Elling

2010-Mar-10 04:29 UTC

head link

[zfs-discuss] Should ZFS write data out when disk are idle

On Mar 9, 2010, at 6:13 PM, Damon Atkins wrote:
> Sorry, Full Stripe on a RaidZ is the recordsize ie if the record size is
128k on a RaidZ and its made up of 5 disks, then 128k is spread across 4 disks
with the calc parity on the 5 disk, which means the writes are 32k to each disk.
Nominally.
> For a RaidZ, when data is written to a disk, are individual 32k join
together to the same disk and written out as a single I/O to the disk?
I/Os can be coalesced, but there is no restriction as to what can be coalesced.
In other words, subsequent writes can also be coalesced if they are contiguous.
> e.g. 128k for file a, 128k for file b, 128k for file c.   When written out
does zfs do
> 32k+32k+32k i/o to each disk, or will it do one 96k i/o if the space is
available sequentially?
I''m not sure how one could write one 96KB physical I/O to three
different disks?
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)

Damon Atkins

2010-Mar-10 15:18 UTC

head link

[zfs-discuss] Should ZFS write data out when disk are idle

> 
> > For a RaidZ, when data is written to a disk, are
> individual 32k join together to the same disk and
> written out as a single I/O to the disk?
> 
> I/Os can be coalesced, but there is no restriction as
> to what can be coalesced.
> In other words, subsequent writes can also be
> coalesced if they are contiguous.
> 
> > e.g. 128k for file a, 128k for file b, 128k for
> file c.   When written out does zfs do
> > 32k+32k+32k i/o to each disk, or will it do one 96k
> i/o if the space is available sequentially?Should have written this, for a 5 disk RaidZ
5x(32k(a)+32k(b)+32k(c) i/o to each disk), or will it attempt to do
5x(96k(a+b+c)) combind larger I/O to each disk if all allocated blocks for a,b
and c are sequential on some or every physical disk.> 
> I''m not sure how one could write one 96KB physical
> I/O to three different disks?I meant to a single disk, three sequential 32k i/o''s targeted to the
same disk becomes a single 96k i/o.  (raidz or even if it was
mirrored)>  -- richardGiven you have said ZFS will coalesce contiguous writes together? (???Targeted
to an individual disk?????).
What is the largest physical write ZFS will do to an individual disk?
-- 
This message posted from opensolaris.org

zfs discuss - Mar 2010 - WriteBack versus SSD-ZIL

[zfs-discuss] WriteBack versus SSD-ZIL

[zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL

[zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL

[zfs-discuss] WriteBack versus SSD-ZIL

[zfs-discuss] WriteBack versus SSD-ZIL

[zfs-discuss] WriteBack versus SSD-ZIL

[zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL

[zfs-discuss] Should ZFS write data out when disk are idle

[zfs-discuss] Should ZFS write data out when disk are idle

[zfs-discuss] Should ZFS write data out when disk are idle

[zfs-discuss] Should ZFS write data out when disk are idle

[zfs-discuss] Should ZFS write data out when disk are idle

[zfs-discuss] Should ZFS write data out when disk are idle