In this email, when I say PERC, I really mean either a PERC, or any other hardware WriteBack buffered raid controller with BBU. For future server purchases, I want to know which is faster: (a) A bunch of hard disks with PERC and WriteBack enabled, or (b) A bunch of hard disks, plus one SSD for ZIL. Unfortunately, I don''t have an SSD available for testing. So here is what I was able to do: I measured the write speed of the naked disks (PERC set to WriteThrough). Results were around 350 ops/sec. I measured the write speed with the WriteBack enabled. Results were around 1250 ops/sec. So right from the start, we can see there''s a huge performance boost by enabling the WriteBack. Even for large sequential writes, buffering allows the disks to operate much more continuously. The next question is how it compares against the SSD ZIL. Since I don''t have an SSD available, I created a ram device and put the ZIL there. This is not a measure of the speed if I had an SSD; rather, it is a measure which the SSD cannot possibly achieve. So it serves to establish an upper bound. If the upper and lower bounds are near each other, then we have a good estimate of the speed with SSD ZIL . but if the upper bound and lower bound are far apart, we haven''t gotten much information. With the ZIL in RAM, results were around 1700 ops/sec. This measure is very far from the lower bound. But it still serves to provide some useful knowledge. The take-home knowledge is: . There''s a lot of performance to be gained by accelerating the ZIL. Potentially up to 6x or 7x faster than naked disks. . The WriteBack raid controller achieves a lot of this performance increase. About 3x or 4x acceleration. . I don''t know how much an SSD would help. I don''t know if it''s better, the same, or worse than the PERC. I don''t know if the combination of PERC and SSD together would go faster than either one individually. I have a hypothesis. I think the best configuration will be to use a PERC, with WriteBack enabled on all the spindle hard drives, but include an SSD for ZIL, and set the PERC for WriteThrough on the SSD. This has yet to be proven or disproven. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100305/306ee74c/attachment.html>
Edward Ned Harvey wrote:> > In this email, when I say PERC, I really mean either a PERC, or any > other hardware WriteBack buffered raid controller with BBU. > > For future server purchases, I want to know which is faster: (a) A > bunch of hard disks with PERC and WriteBack enabled, or (b) A bunch of > hard disks, plus one SSD for ZIL. Unfortunately, I don?t have an SSD > available for testing. So here is what I was able to do: > > I measured the write speed of the naked disks (PERC set to > WriteThrough). Results were around 350 ops/sec. > > I measured the write speed with the WriteBack enabled. Results were > around 1250 ops/sec. > > So right from the start, we can see there?s a huge performance boost > by enabling the WriteBack. Even for large sequential writes, buffering > allows the disks to operate much more continuously. The next question > is how it compares against the SSD ZIL. > > Since I don?t have an SSD available, I created a ram device and put > the ZIL there. This is not a measure of the speed if I had an SSD; > rather, it is a measure which the SSD cannot possibly achieve. So it > serves to establish an upper bound. If the upper and lower bounds are > near each other, then we have a good estimate of the speed with SSD > ZIL ? but if the upper bound and lower bound are far apart, we haven?t > gotten much information. > > With the ZIL in RAM, results were around 1700 ops/sec. > > This measure is very far from the lower bound. But it still serves to > provide some useful knowledge. The take-home knowledge is: > > ? There?s a lot of performance to be gained by accelerating the ZIL. > Potentially up to 6x or 7x faster than naked disks. > > ? The WriteBack raid controller achieves a lot of this performance > increase. About 3x or 4x acceleration. > > ? I don?t know how much an SSD would help. I don?t know if it?s > better, the same, or worse than the PERC. I don?t know if the > combination of PERC and SSD together would go faster than either one > individually. > > I have a hypothesis. I think the best configuration will be to use a > PERC, with WriteBack enabled on all the spindle hard drives, but > include an SSD for ZIL, and set the PERC for WriteThrough on the SSD. > This has yet to be proven or disproven. >Cache on controllers is almost always battery-backed DRAM (NVRAM having made an exit from the scene awhile ago). As such, it has fabulous latency and throughput, compared to anything else. From a performance standpoint, it will beat a SSD ZIL on a 1-for-1 basis. However, it''s almost never the case that you find a HBA cache even a fraction of the size of a small SSD. So, what happens when you flood the HBA with more I/O than the on-board cache can handle? It reduces performance back to the level of NO cache. From everything I''ve seen, an SSD wins simply because it''s 20-100x the size. HBAs almost never have more than 512MB of cache, and even fancy SAN boxes generally have 1-2GB max. So, HBAs are subject to being overwhelmed with heavy I/O. The SSD ZIL has a much better chance of being able to weather a heavy I/O period without being filled. Thus, SSDs are better at "average" performance - they provide a relatively steady performance profile, whereas HBA cache is very spiky. The other real advantage of SSD ZIL is that it covers the entire pool. Most larger pools spread their disks over multiple controllers, each of which must have a cache in order for the whole pool to perform evenly. If you know you''re going to be doing very intermittent or modest level of I/O, and that I/O is likely to fit within a HBA''s cache, then it will outperform an SSD. For a continuous heavy load, or for extremely spiky loads (which are considerably in excess of the HBA''s cache), then an SSD will win handily. The key here is the hard drives - the faster they are, the faster the HBA will be able to empty it''s cache to disk, and the lower likelihood it will get overwhelmed by new I/O. All that said, I''m pretty sure most I/O patterns heavily favor SSD ZIL over HBA cache. Note that ZIL applies only to Synchronous I/O, while HBA cache is for all writes. Also note that HBA cache can be used (or shared) for a read cache, as well, according to the HBA setup. And, a good (SLC) SSD can handle 50,000 IOPS until it''s filled. Which takes a very long time relative to an HBA cache. ------ I''ve always wondered what the benefit (and difficulty to add to ZFS) would be to having an async write cache for ZFS - that is, ZFS currently buffers async writes in RAM, until it decides to aggregate enough of them to flush to disk. I think it would be interesting to see what would happen if an async SSD cache was available, since the write pattern is "large, streaming", which means that the same devices useful for L2ARC would perform well as an async write cache. In essence, use the async write SSD as an extra-large buffer, in the same way that L2ARC on SSD supplements main memory for the read cache. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Hi, Erik, I''ve always wondered what the benefit (and difficulty to add to ZFS) would> be to having an async write cache for ZFS - that is, ZFS currently buffers > async writes in RAM, until it decides to aggregate enough of them to flush > to disk. I think it would be interesting to see what would happen if an > async SSD cache was available, since the write pattern is "large, > streaming", which means that the same devices useful for L2ARC would perform > well as an async write cache. In essence, use the async write SSD as an > extra-large buffer, in the same way that L2ARC on SSD supplements main > memory for the read cache. > > I think HDD can handle those "large and streaming like" workload quitewell. You can observe SSD doesn''t play very much better than HDD on streaming write/read. So it could be good enough to flush those async aggreated write data back to HDD instead of bringing another level of write cache.> -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100306/9248bb22/attachment.html>
[moved off osol-discuss] Zhu Han wrote:> Hi, Erik, > > I''ve always wondered what the benefit (and difficulty to add to > ZFS) would be to having an async write cache for ZFS - that is, > ZFS currently buffers async writes in RAM, until it decides to > aggregate enough of them to flush to disk. I think it would be > interesting to see what would happen if an async SSD cache was > available, since the write pattern is "large, streaming", which > means that the same devices useful for L2ARC would perform well as > an async write cache. In essence, use the async write SSD as an > extra-large buffer, in the same way that L2ARC on SSD supplements > main memory for the read cache. > > I think HDD can handle those "large and streaming like" workload quite > well. You can observe SSD doesn''t play very much better than HDD on > streaming write/read. So it could be good enough to flush those async > aggreated write data back to HDD instead of bringing another level of > write cache. >This is true. SSDs and HDs differ little in their ability to handle raw throughput. However, we often still see problems in ZFS associated with periodic system "pauses" where ZFS effectively monopolizes the HDs to write out it''s current buffered I/O. People have been complaining about this for quite awhile. SSDs have a huge advantage where IOPS are concerned, and given that the backing store HDs have to service both read and write requests, they''re severely limited on the number of IOPs they can give to incoming data. You have a good point, but I''d still be curious to see what an async cache would do. After all, that is effectively what the HBA cache is, and we see a significant improvement with it, and not just for sync write. I also don''t know what the threshold is in ZFS for it to consider it time to do a async buffer flush. Is it time based? % of RAM based? Absolute amount? All of that would impact whether an SSD async cache would be useful. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Sat, Mar 6, 2010 at 12:50 PM, Erik Trimble <Erik.Trimble at sun.com> wrote:> This is true. SSDs and HDs differ little in their ability to handle raw > throughput. However, we often still see problems in ZFS associated with > periodic system "pauses" where ZFS effectively monopolizes the HDs to write > out it''s current buffered I/O. People have been complaining about this for > quite awhile. SSDs have a huge advantage where IOPS are concerned, and > given that the backing store HDs have to service both read and write > requests, they''re severely limited on the number of IOPs they can give to > incoming data. > > You have a good point, but I''d still be curious to see what an async cache > would do. After all, that is effectively what the HBA cache is, and we see > a significant improvement with it, and not just for sync write. > >I might see what your mean here. Because ZFS has to aggregate some write data during a short period (txn alive time) to avoid generating too many random write HDD requests, the bandwidth of HDD during this time is wasted. For write heavy streaming workload, especially those who can saturate the HDD pool bandwidth easily, ZFS will make the performance worse than those legacy file system, i.e. UFS or EXT3. The IOPS of the HDD is not the limitation here. The bandwidth of the HDD is the root cause. This is the design choice of ZFS. Reducing the length of period during txn commit can alleviate the problem. So that the size of data needing to flush to the disk every time could be smaller. Replace the HDD with some high-end FC disks may solve this problem.> I also don''t know what the threshold is in ZFS for it to consider it time > to do a async buffer flush. Is it time based? % of RAM based? Absolute > amount? All of that would impact whether an SSD async cache would be useful. > >IMHO, ZFS flush the data back to disk asynchronously every 5 seconds, which is the default configuration of txn commit period. ZFS will also flush the data back to disk even before the 5 second period, based on the estimation of amount of memory has been used for the current txn. This is called as write throttle. See below link: http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle> -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100306/86e92e85/attachment.html>
On Mar 6, 2010, at 1:38 AM, Zhu Han wrote:> On Sat, Mar 6, 2010 at 12:50 PM, Erik Trimble <Erik.Trimble at sun.com> wrote: > This is true. SSDs and HDs differ little in their ability to handle raw throughput. However, we often still see problems in ZFS associated with periodic system "pauses" where ZFS effectively monopolizes the HDs to write out it''s current buffered I/O. People have been complaining about this for quite awhile. SSDs have a huge advantage where IOPS are concerned, and given that the backing store HDs have to service both read and write requests, they''re severely limited on the number of IOPs they can give to incoming data. > > You have a good point, but I''d still be curious to see what an async cache would do. After all, that is effectively what the HBA cache is, and we see a significant improvement with it, and not just for sync write. > > > I might see what your mean here. Because ZFS has to aggregate some write data during a short period (txn alive time) to avoid generating too many random write HDD requests, the bandwidth of HDD during this time is wasted. For write heavy streaming workload, especially those who can saturate the HDD pool bandwidth easily, ZFS will make the performance worse than those legacy file system, i.e. UFS or EXT3. The IOPS of the HDD is not the limitation here. The bandwidth of the HDD is the root cause.This statement is too simple, and thus does not represent reality very well. For a fully streaming workload where the load is near the capacity of the storage, the algorithms in ZFS will work to optimize the match. There is still some work to be done, but I don''t believe UFS has beat ZFS on Solaris for a significant streaming benchmark for several years now. What we do see is that high performance SSDs can saturate the SAS/SATA link for extended periods of time. For example, a Western Digital SiliconEdge Blue (a new, midrange model) can read at 250 MB/sec in contrast to a WD RE4 which has a media transfer rate of 138 MB/sec. High-speed SSDs are already putting the hurt on 6Gbps SAS/SATA -- the Micron models claim 370 MB/sec sustained. Since this can be easily parallelized, expect that the high-end SSDs will saturate whatever you can connect them to. This is one reason why the F5100 has 64 SAS channels for host connections.> This is the design choice of ZFS. Reducing the length of period during txn commit can alleviate the problem. So that the size of data needing to flush to the disk every time could be smaller. Replace the HDD with some high-end FC disks may solve this problem.Properly matching I/O source and sink is still important, no file system can relieve you of that duty :-)> I also don''t know what the threshold is in ZFS for it to consider it time to do a async buffer flush. Is it time based? % of RAM based? Absolute amount? All of that would impact whether an SSD async cache would be useful.The answer is "yes" to all of these questions, but there are many variables to consider, so YMMV. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)
Edward Ned Harvey
2010-Mar-07 03:10 UTC
[zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL
> From everything I''ve seen, an SSD wins simply because it''s 20-100x the > size. HBAs almost never have more than 512MB of cache, and even fancy > SAN boxes generally have 1-2GB max. So, HBAs are subject to being > overwhelmed with heavy I/O. The SSD ZIL has a much better chance of > being able to weather a heavy I/O period without being filled. Thus, > SSDs are better at "average" performance - they provide a relatively > steady performance profile, whereas HBA cache is very spiky.This is a really good point. So you think I may actually get better performance by disabling the WriteBack on all the spindle disks, and enabling it on the SSD instead. This is precisely the opposite of what I was thinking. I''m planning to publish some more results soon, but haven''t gathered it all yet. But see these: Just naked disks, no acceleration. http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteThr ough.txt Same configuration as above, but WriteBack enabled. http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac k.txt Same configuration as the naked disks, but a ramdrive was created for ZIL http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-ramZIL.t xt Using the ramdrive for ZIL, and also WriteBack enabled on PERC http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac k_and_ramZIL.txt This result shows the WriteBack enabled makes a huge performance difference (3-4x higher) for writes, compared to the naked disks. I don''t think it''s because an entire write operation fits into the HBA DRAM, or the HBA is remaining un-saturated. The PERC has 256M, but the test includes 8 threads all simultaneously writing separate 4G files in various sized chunks and patterns. I think when the PERC ram is full of stuff queued for write to disk, it''s simply able to order and organize and optimize the write operations to leverage the disks as much as possible.
Damon Atkins
2010-Mar-08 04:10 UTC
[zfs-discuss] Should ZFS write data out when disk are idle
I think ZFS should look for more opportunities to write to disk rather than leaving it to the last second (5seconds) as it appears it does. e.g. if a file has record size worth of data outstanding it should be queued within ZFS to be written out. If the record is updated again before a txg, then it can be re-queued (if it has left the queue) and written to the same block or a new block. The write queue would empty when there is spare I/O bandwidth capacity and memory capacity on the disk determined thought outstanding I/Os. Once the data is on disk it could be free to be re-used even before the txg has occurred, but checksum details would need to be recorded first. The txg comes along after X seconds and finds most of the data writes have already happen and only metadata writes are left to do. One would should assume this would help with the delays at txg, talked about in this thread. The example below shows 28 x 128k writes to the same file before anything is written to disk and the disk are idle the entire time. There is no cost to writing to disk if the disk is not doing anything or is under capacity. (Not a perfect example) At the other end maybe updates for access time properties should not be updated to disk until there is some real data to write, or 30minutes has passed to allow green disks to power down for a while. (atime= on|off|delay) Cheers No dedup on, but compression on while sleep 1 ; do echo `dd if=/dev/random of=xxxx bs=128k count=1 2>&1` ; done & iostat -zxcnT d 1 us sy wt id 0 5 0 94 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 53.0 0.0 301.5 0.0 0.2 0.0 3.4 0 4 c5t0d0 0.0 53.0 0.0 301.5 0.0 0.2 0.0 3.1 0 4 c5t2d0 0.0 58.0 0.0 127.0 0.0 0.0 0.0 0.1 0 0 c5t1d0 0.0 58.0 0.0 127.0 0.0 0.0 0.0 0.1 0 0 c5t3d0 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:41 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 3.0 0.0 2.0 0.0 0.0 0.0 0.5 0 0 c5t0d0 0.0 3.0 0.0 2.0 0.0 0.0 0.0 0.5 0 0 c5t2d0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t1d0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t3d0 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:42 PM EST cpu us sy wt id 1 3 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:43 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:44 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:45 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:46 PM EST cpu us sy wt id 1 4 0 95 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:47 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:48 PM EST cpu us sy wt id 0 19 0 80 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:49 PM EST cpu us sy wt id 1 27 0 72 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:50 PM EST cpu us sy wt id 0 3 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:51 PM EST cpu us sy wt id 1 3 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:52 PM EST cpu us sy wt id 0 4 0 95 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:53 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device Monday, 8 March 2010 02:51:54 PM EST cpu us sy wt id 1 3 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:55 PM EST cpu us sy wt id 0 1 0 99 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:56 PM EST cpu us sy wt id 1 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:57 PM EST cpu us sy wt id 1 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:58 PM EST cpu us sy wt id 0 3 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:59 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:00 PM EST cpu us sy wt id 1 4 0 95 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:01 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:02 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:03 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:04 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:05 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:06 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:07 PM EST cpu us sy wt id 1 4 0 95 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:08 PM EST cpu us sy wt id 1 3 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:09 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:10 PM EST cpu us sy wt id 1 4 0 95 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 37.0 0.0 140.5 0.0 0.1 0.0 1.9 0 2 c5t0d0 0.0 37.0 0.0 140.5 0.0 0.1 0.0 1.9 0 2 c5t2d0 0.0 41.0 0.0 79.5 0.0 0.0 0.0 0.1 0 0 c5t1d0 0.0 40.0 0.0 79.5 0.0 1.6 0.0 38.8 0 39 c5t3d0 0+1 records in 0+1 records out Monday, 8 March 2010 02:52:11 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t0d0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t2d0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t1d0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t3d0 Monday, 8 March 2010 02:52:12 PM EST cpu us sy wt id 0 1 0 99 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out -- This message posted from opensolaris.org
Bob Friesenhahn
2010-Mar-08 15:54 UTC
[zfs-discuss] Should ZFS write data out when disk are idle
On Sun, 7 Mar 2010, Damon Atkins wrote:> The example below shows 28 x 128k writes to the same file before > anything is written to disk and the disk are idle the entire time. > There is no cost to writing to disk if the disk is not doing > anything or is under capacity. (Not a perfect example)Zfs can be tuned to write out transaction groups much more quickly if you like. It is not true that there is "no cost" though. Since ZFS uses COW, this approach requires that new blocks be allocated and written at a much higher rate. There is also an "opportunity cost" in that if a read comes in while these continuous writes are occuring, the read will be delayed. There are many applications which continually write/overwrite file content, or which update a file at a slow pace. For example, log files are typically updated at a slow rate. Updating a block requires reading it first (if it is not already cached in the ARC), which can be quite expensive. By waiting a bit longer, there is a much better chance that the whole block is overwritten, so zfs can discard the existing block on disk without bothering to re-read it. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Damon Atkins
2010-Mar-09 12:45 UTC
[zfs-discuss] Should ZFS write data out when disk are idle
I am talking about having a write queue, which points to ready to write, full stripes. Ready to write full stripes would be *The last byte of the full stripe has been updated. *The file has been closed for writing. (Exception to the above rule) I believe there is now a scheduler for ZFS, to handle reads and write conflicts. For example on a large Multi-Gigabyte NVRAM array, the only big consideration is how big is the Fibre Channel pipe is and the limit on outstanding I/Os But on SATA off the motherboard, then it is about how much RAM cache each disk has is a consideration as well as the speed of the SATA connection as well as the number of outstanding I/Os When it comes time to do txg some of the record blocks (most of the full 128k ones) will have been written out already. If we have only written out full record blocks then there has been no performance loss. Eventually a txg going to happen, eventually these full writes will need to happen, but if we can choose a less busy time for them all the better. e.g. on a raidz with 5 disks, if I have 128x4 worth of data to write, lets write it. on a mirror if I have 128k worth to write, lets write it. (record size 128k), or let it be a tunable for zpool, as some arrays (RAID5) like to have larger chunks of data. Why wait for the txg if the disk are not being pressured for reads. Rather than a pause every 30 seconds. Bob wrote :> (I may not have explained it well enough)>It is not true that there is "no cost" though. Since ZFS uses COW, >this approach requires that new blocks be allocated and written at a >much higher rate. There is also an "opportunity cost" in that if a >read comes in while these continuous writes are occurring, the read >will be delayed.At some stage a write needs to happen. **Full** writes have very small COW cost compare with small writes. As I said above I talking about a write of 4x128k on a 5 disk raidz before the write would happen early.>There are many applications which continually write/overwrite file >content, or which update a file at a slow pace. For example, log >files are typically updated at a slow rate. Updating a block requires >reading it first (if it is not already cached in the ARC), which can >be quite expensive. By waiting a bit longer, there is a much better >chance that the whole block is overwritten, so zfs can discard the >existing block on disk without bothering to re-read it.Apps which update at slow pace will not trigger the above early write, until they have at least written a record size worth of data, application which write slow than 128k (recordsize) in more than 30 secs will never trigger the early write on a mirrored disk or even a raidz setup. What this will catch is the big writer of files greater than 128k (recordsize) on mirrored disk; and files larger than (4x128k) on RaidZ 5disks sets. So that commands like dd if=x of=y bs=512k will not cause issues (pauses/delays) when the txg timeout. PS I already set zfs:zfs_write_limit_override and I would not recommend anyone to set this very low to get the above effect. It''s just an idea on how to prevent the delay effect, it may not be practical? -- This message posted from opensolaris.org
Damon Atkins
2010-Mar-10 02:13 UTC
[zfs-discuss] Should ZFS write data out when disk are idle
Sorry, Full Stripe on a RaidZ is the recordsize ie if the record size is 128k on a RaidZ and its made up of 5 disks, then 128k is spread across 4 disks with the calc parity on the 5 disk, which means the writes are 32k to each disk. For a RaidZ, when data is written to a disk, are individual 32k join together to the same disk and written out as a single I/O to the disk? e.g. 128k for file a, 128k for file b, 128k for file c. When written out does zfs do 32k+32k+32k i/o to each disk, or will it do one 96k i/o if the space is available sequentially? Cheers -- This message posted from opensolaris.org
Richard Elling
2010-Mar-10 04:29 UTC
[zfs-discuss] Should ZFS write data out when disk are idle
On Mar 9, 2010, at 6:13 PM, Damon Atkins wrote:> Sorry, Full Stripe on a RaidZ is the recordsize ie if the record size is 128k on a RaidZ and its made up of 5 disks, then 128k is spread across 4 disks with the calc parity on the 5 disk, which means the writes are 32k to each disk.Nominally.> For a RaidZ, when data is written to a disk, are individual 32k join together to the same disk and written out as a single I/O to the disk?I/Os can be coalesced, but there is no restriction as to what can be coalesced. In other words, subsequent writes can also be coalesced if they are contiguous.> e.g. 128k for file a, 128k for file b, 128k for file c. When written out does zfs do > 32k+32k+32k i/o to each disk, or will it do one 96k i/o if the space is available sequentially?I''m not sure how one could write one 96KB physical I/O to three different disks? -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)
Damon Atkins
2010-Mar-10 15:18 UTC
[zfs-discuss] Should ZFS write data out when disk are idle
> > > For a RaidZ, when data is written to a disk, are > individual 32k join together to the same disk and > written out as a single I/O to the disk? > > I/Os can be coalesced, but there is no restriction as > to what can be coalesced. > In other words, subsequent writes can also be > coalesced if they are contiguous. > > > e.g. 128k for file a, 128k for file b, 128k for > file c. When written out does zfs do > > 32k+32k+32k i/o to each disk, or will it do one 96k > i/o if the space is available sequentially?Should have written this, for a 5 disk RaidZ 5x(32k(a)+32k(b)+32k(c) i/o to each disk), or will it attempt to do 5x(96k(a+b+c)) combind larger I/O to each disk if all allocated blocks for a,b and c are sequential on some or every physical disk.> > I''m not sure how one could write one 96KB physical > I/O to three different disks?I meant to a single disk, three sequential 32k i/o''s targeted to the same disk becomes a single 96k i/o. (raidz or even if it was mirrored)> -- richardGiven you have said ZFS will coalesce contiguous writes together? (???Targeted to an individual disk?????). What is the largest physical write ZFS will do to an individual disk? -- This message posted from opensolaris.org