Jim Klimov
2012-Jan-07 12:47 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
Hello all, For smaller systems such as laptops or low-end servers, which can house 1-2 disks, would it make sense to dedicate a 2-4Gb slice to the ZIL for the data pool, separate from rpool? Example layout (single-disk or mirrored): s0 - 16Gb - rpool s1 - 4Gb - data-zil s3 - *Gb - data pool The idea would be to decrease fragmentation (committed writes to data pool would be more coalesced) and to keep the ZIL at faster tracks of the HDD drive. I''m actually more interested in the former: would the dedicated ZIL decrease fragmentation of the pool? Likewise, for larger pools (such as my 6-disk raidz2) can fragmentation and/or performance benefit from some dedicated ZIL slices (i.e. s0 = 1-2Gb ZIL per 2Tb disk, with 3 mirrored ZIL sets overall)? Can several ZIL (mirrors) be concatenated for a single data pool, or only one dedicated ZIL vdev can be used? Thanks, //Jim Klimov
Edward Ned Harvey
2012-Jan-07 15:12 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > For smaller systems such as laptops or low-end servers, > which can house 1-2 disks, would it make sense to dedicate > a 2-4Gb slice to the ZIL for the data pool, separate from > rpool? Example layout (single-disk or mirrored): > > The idea would be to decrease fragmentation (committed > writes to data pool would be more coalesced) and to keep > the ZIL at faster tracks of the HDD drive.I''m not authoritative, I''m speaking from memory of former discussions on this list and various sources of documentation. No, it won''t help you. First of all, all your writes to the storage pool are aggregated, so you''re already minimizing fragmentation of writes in your main pool. However, over time, as snapshots are created & destroyed, small changes are made to files, and file contents are overwritten incrementally and internally... The only fragmentation you get creeps in as a result of COW. This fragmentation only impacts sequential reads of files which were previously written in random order. This type of fragmentation has no relation to ZIL or writes. If you don''t split out your ZIL separate from the storage pool, zfs already chooses disk blocks that it believes to be optimized for minimal access time. In fact, I believe, zfs will dedicate a few sectors at the low end, a few at the high end, and various other locations scattered throughout the pool, so whatever the current head position, it tries to go to the closest "landing zone" that''s available for ZIL writes. If anything, splitting out your ZIL to a different partition might actually hurt your performance. Also, the concept of "faster tracks of the HDD" is also incorrect. Yes, there was a time when HDD speeds were limited by rotational speed and magnetic density, so the outer tracks of the disk could serve up more data because more magnetic material passed over the head in each rotation. But nowadays, the hard drive sequential speed is limited by the head speed, which is invariably right around 1Gbps. So the inner and outer sectors of the HDD are equally fast - the outer sectors are actually less magnetically dense because the head can''t handle it. And the random IO speed is limited by head seek + rotational latency, where seek is typically several times longer than latency. So basically, the only thing that matters, to optimize the performance of any modern typical HDD, is to minimize the head travel. You want to be seeking sectors which are on tracks that are nearby to the present head position. Of course, if you want to test & benchmark the performance of splitting apart the ZIL to a different partition, I encourage that. I''m only speaking my beliefs based on my understanding of the architectures and limitations involved. This is my best prediction. And I''ve certainly been wrong before. ;-) Sometimes, being wrong is my favorite thing, because you learn so much from it. ;-)
Richard Elling
2012-Jan-08 01:45 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
On Jan 7, 2012, at 7:12 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> For smaller systems such as laptops or low-end servers, >> which can house 1-2 disks, would it make sense to dedicate >> a 2-4Gb slice to the ZIL for the data pool, separate from >> rpool? Example layout (single-disk or mirrored): >> >> The idea would be to decrease fragmentation (committed >> writes to data pool would be more coalesced) and to keep >> the ZIL at faster tracks of the HDD drive. > > I''m not authoritative, I''m speaking from memory of former discussions on > this list and various sources of documentation. > > No, it won''t help you.Correct :-)> First of all, all your writes to the storage pool are aggregated, so you''re > already minimizing fragmentation of writes in your main pool. However, over > time, as snapshots are created & destroyed, small changes are made to files, > and file contents are overwritten incrementally and internally... The only > fragmentation you get creeps in as a result of COW. This fragmentation only > impacts sequential reads of files which were previously written in random > order. This type of fragmentation has no relation to ZIL or writes. > > If you don''t split out your ZIL separate from the storage pool, zfs already > chooses disk blocks that it believes to be optimized for minimal access > time. In fact, I believe, zfs will dedicate a few sectors at the low end, a > few at the high end, and various other locations scattered throughout the > pool, so whatever the current head position, it tries to go to the closest > "landing zone" that''s available for ZIL writes. If anything, splitting out > your ZIL to a different partition might actually hurt your performance. > > Also, the concept of "faster tracks of the HDD" is also incorrect. Yes, > there was a time when HDD speeds were limited by rotational speed and > magnetic density, so the outer tracks of the disk could serve up more data > because more magnetic material passed over the head in each rotation. But > nowadays, the hard drive sequential speed is limited by the head speed, > which is invariably right around 1Gbps. So the inner and outer sectors of > the HDD are equally fast - the outer sectors are actually less magnetically > dense because the head can''t handle it. And the random IO speed is limited > by head seek + rotational latency, where seek is typically several times > longer than latency.Disagree. My data, and the vendor specs, continue to show different sequential media bandwidth speed for inner vs outer cylinders.> > So basically, the only thing that matters, to optimize the performance of > any modern typical HDD, is to minimize the head travel. You want to be > seeking sectors which are on tracks that are nearby to the present head > position. > > Of course, if you want to test & benchmark the performance of splitting > apart the ZIL to a different partition, I encourage that. I''m only speaking > my beliefs based on my understanding of the architectures and limitations > involved. This is my best prediction. And I''ve certainly been wrong > before. ;-) Sometimes, being wrong is my favorite thing, because you learn > so much from it. ;-)Good idea. I think you will see a tradeoff on the read side of the mixed read/write workload. Sync writes have higher priority than reads so the order of I/O sent to the disk will appear to be very random and not significantly coalesced. This is the pathological worst case workload for a HDD. OTOH, you''re not trying to get high performance from an HDD are you? That game is over. -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/
Edward Ned Harvey
2012-Jan-08 14:56 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
> From: Richard Elling [mailto:richard.elling at gmail.com] > > > Also, the concept of "faster tracks of the HDD" is also incorrect. Yes, > > there was a time when HDD speeds were limited by rotational speed and > > magnetic density, so the outer tracks of the disk could serve up moredata> > because more magnetic material passed over the head in each rotation. > But > > nowadays, the hard drive sequential speed is limited by the head speed, > > which is invariably right around 1Gbps. So the inner and outer sectorsof> > the HDD are equally fast - the outer sectors are actually lessmagnetically> > dense because the head can''t handle it. And the random IO speed is > limited > > by head seek + rotational latency, where seek is typically several times > > longer than latency. > > Disagree. My data, and the vendor specs, continue to show different > sequential > media bandwidth speed for inner vs outer cylinders.Any reference? I know, as I sit and dd from some disk | pv > /dev/null, it will tell me something like 1.0Gbps... I periodically check its progress while it''s in progress, and while it varies a little (say, sometimes 1.0, 1.1, 1.2) it goes up and down throughout the process. There is no noticeable difference between the early, mid, and late behavior, sequentially reading the whole disk. If the performance of the outer tracks is better than the performance of the inner tracks due to limitations of magnetic density or rotation speed (not being limited by the head speed or bus speed), then the sequential performance of the drive should increase as a square function, going toward the outer tracks. c = pi * r^2 It is my belief, based on specs I''ve previously looked at, that mfgrs break the drive down into zones. So, something like the inner 20% of the tracks will have magnetic layout pattern A, and the next 20% will have magnetic layout pattern B, and so forth... Within a single magnetic layout pattern, jumping from individual track to individual track can yield a difference of performance, but it''s not a huge step from one to the next. And when you transition from layout pattern to layout pattern, the pattern just repeats itself again. They''re trying to optimize, to a first order, ensure the performance limitations are mostly caused by head and/or bus speed. If those are the bottlenecks, let them be the bottlenecks, and at least solve all the other problems that are solvable. So, small variations of sequential performance are possible, jumping from track to track, but based on what I''ve seen, the maximum performance difference from the absolute slowest track to the absolute fastest track (which may or may not have any relation to inner vs outer) ... maximum variation on-par with 10% performance difference. Not a square function.> OTOH, you''re not trying to get high performance from an HDD are you? That > game is over.Lots of us still have to live with HDD''s, due to capacity and cost requirements. We accept a relative definition of "high performance," and still want to get all the performance we can out of whatever device we''re using. Even if there exists a faster device somewhere in the world. Also, for sequential performance, HDD''s are on-par with, and often better than SSD''s. (For now.) While many SSD''s publish specs including something like "220 MB/s" which is higher than HDD''s can reach... SSD''s publish their maximum performance, which is not typical performance. After you use them for a month, they slow down. Often to half or worse, of the speed they originally were able to run. Which is... as I say... on-par with, or worse than, the sequential speed of an HDD. Even crappy SSD''s can have random IO worse than HDD''s. Just benchmark any high-cost top-tier USB3 flash memory stick, and you''ll see what I mean. ;-) The only SSD''s that are faster than HDD''s in any way are *actual* internal sas/sata/etc SSD''s, which are faster than HDD in terms of random IOPS and maybe sequential.
Jim Klimov
2012-Jan-08 16:34 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
2012-01-08 18:56, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:richard.elling at gmail.com] >> Disagree. My data, and the vendor specs, continue to show different >> sequential >> media bandwidth speed for inner vs outer cylinders. > > Any reference?Well, Richard''s data matches mine with tests of my HDDs at home: I read in some 10-gb blocks at different offsets (dd > /dev/null), and "linear" speeds dropped from about 150MBps to about 80-100MBps. This was tested on a relatively modern 2TB Seagate drive. Random IOs are still crappy on mechanical drives, often under 10MBps ;) //Jim
Bob Friesenhahn
2012-Jan-08 18:21 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
On Sat, 7 Jan 2012, Edward Ned Harvey wrote:> > If you don''t split out your ZIL separate from the storage pool, zfs already > chooses disk blocks that it believes to be optimized for minimal access > time. In fact, I believe, zfs will dedicate a few sectors at the low end, a > few at the high end, and various other locations scattered throughout the > pool, so whatever the current head position, it tries to go to the closest > "landing zone" that''s available for ZIL writes. If anything, splitting out > your ZIL to a different partition might actually hurt your performance.Something else to be aware of is that even if you don''t have a dedicated ZIL device, zfs will create a ZIL using devices in the main pool so there is always a ZIL, even if you don''t see it. Also, the ZIL is only used to record pending small writes. Larger writes (I think 128K or more) are written to their pre-allocated final location in the main pool. This choice is made since the purpose of the ZIL is to minimize random I/O to disk, and writing large amounts of data to the ZIL would create a bandwidth bottleneck. There are postings by Matt Ahrens to this list (and elsewhere) which provide an accurate description of how the ZIL works. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Casper.Dik at oracle.com
2012-Jan-08 18:39 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
>If the performance of the outer tracks is better than the performance of the >inner tracks due to limitations of magnetic density or rotation speed (not >being limited by the head speed or bus speed), then the sequential >performance of the drive should increase as a square function, going toward >the outer tracks. c = pi * r^2Decrease because the outer tracks are the lower numbered tracks; they have the same density but they are larger.>So, small variations of sequential performance are possible, jumping from >track to track, but based on what I''ve seen, the maximum performance >difference from the absolute slowest track to the absolute fastest track >(which may or may not have any relation to inner vs outer) ... maximum >variation on-par with 10% performance difference. Not a square function.I''ve noticed a change of 50% in speed or more between the lower and the higher numbers. (60MB to 30MB) In benchmark land, they do short-stroke disks for better performance; I believe the Pillar boxes do similar tricks under the covers (if you want more performance, it gives you the faster tracks) Casper
Darren J Moffat
2012-Jan-09 10:34 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
On 01/08/12 18:21, Bob Friesenhahn wrote:> Something else to be aware of is that even if you don''t have a dedicated > ZIL device, zfs will create a ZIL using devices in the main pool soTerminology nit: The log device is a SLOG. Every ZFS dataset has a ZIL. Where the ZIL writes (slog or main pool devices) go for a given dataset are determined by a combination of things including (but not limited to) the presence of a SLOG device, the logbias property and the size of the data. -- Darren J Moffat
Jim Klimov
2012-Jan-09 13:38 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
2012-01-08 5:45, Richard Elling wrote:> I think you will see a tradeoff on the read side of the mixed read/write workload. > Sync writes have higher priority than reads so the order of I/O sent to the disk > will appear to be very random and not significantly coalesced. This is the > pathological worst case workload for a HDD.I guess this is what I''m trying to combat when thinking about a dedicated ZIL (SLOG device) in ordedr to reduce pool''s fragmentation. It is my understanding (which may be wrong and often is) that without a dedicated SLOG: 1) Sync writes will land on disk randomly into nearest (to disk heads) available blocks, in order to have them committed ASAP; 2) Coalesced writes (at TXG sync) may have intermixed data and metadata blocks, of which metadata may soon expire due to whatever updates, snapshots or deletions involving the blocks this metadata references. If this is true, then after a while there will be many "available" cheese-holes from expired metadata among larger data blocks. 3) Now, this might be further complicated (or relieved) if the metadata blocks are stored in separate groupings from the "bulk" user-data, which I don''t know about yet. In that case it would be easier for ZFS to prefetch metadata from disk in one IO (as we discussed in another thread), as well as to effectively reuse the small cheese-holes from freed older metadata blocks. --- If any of the above is true, then it is my "blind expectation" that a dedicated ZIL/SLOG area would decrease fragmentation at least due to sync writes of metadata, and possibly of data, into nearest HDD locations. Again, this is based on my possibly wrong understanding that the blocks committed to a SLOG would be neatly recommitted to the main pool during a TXG close with coalesced writes. I do understand the argument that if the SLOG is dedicated from a certain area on the same HDD, then in fact this would be slowing down the writes by creating more random IO and extra seeks. But as a trade-off I hope for more linear faster reads, including pool import, scrubbing and ZDB walks; and less fragmented free space. Is there any truth to these words? ;) Thanks, //Jim Klimov
Edward Ned Harvey
2012-Jan-09 13:55 UTC
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > 1) Sync writes will land on disk randomly into nearest > (to disk heads) available blocks, in order to have them > committed ASAP;This is true - but you need to make the distinction - if you don''t have a dedicated slog, and you haven''t disabled zil, then the sync writes you''re talking about land into dedicated zil sectors of the disk. This is write-only space, consider it temporary. The only time it will ever be read is after an ungraceful system reboot, the system will scan these sectors to see if anything is there. As soon as the sync writes are written to the zil, they become async writes, which are buffered in memory with all the other async writes, and they will be written *again* into permanent storage in the main pool. At that point, the previously written copy in zil becomes irrelevant.> If any of the above is true, then it is my "blind > expectation" that a dedicated ZIL/SLOG area would > decrease fragmentation at least due to sync writessync writes to zil aren''t causing fragmentation, because they''re only temporary writes as long as they''re sync mode. Then they become async mode, and they will be aggregated with all the other async writes. This isn''t saying fragmentation doesn''t happen. It''s just saying there''s no special relationship between sync mode and fragmentation.