thr3ads.net - zfs discuss - [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems) [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Jan-07 12:47 UTC

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

Hello all,

   For smaller systems such as laptops or low-end servers,
which can house 1-2 disks, would it make sense to dedicate
a 2-4Gb slice to the ZIL for the data pool, separate from
rpool? Example layout (single-disk or mirrored):

s0 - 16Gb - rpool
s1 - 4Gb  - data-zil
s3 - *Gb  - data pool

   The idea would be to decrease fragmentation (committed
writes to data pool would be more coalesced) and to keep
the ZIL at faster tracks of the HDD drive.

   I''m actually more interested in the former: would the
dedicated ZIL decrease fragmentation of the pool?

   Likewise, for larger pools (such as my 6-disk raidz2)
can fragmentation and/or performance benefit from some
dedicated ZIL slices (i.e. s0 = 1-2Gb ZIL per 2Tb disk,
with 3 mirrored ZIL sets overall)?

   Can several ZIL (mirrors) be concatenated for a single
data pool, or only one dedicated ZIL vdev can be used?

Thanks,
//Jim Klimov

Edward Ned Harvey

2012-Jan-07 15:12 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
>    For smaller systems such as laptops or low-end servers,
> which can house 1-2 disks, would it make sense to dedicate
> a 2-4Gb slice to the ZIL for the data pool, separate from
> rpool? Example layout (single-disk or mirrored):
>
>    The idea would be to decrease fragmentation (committed
> writes to data pool would be more coalesced) and to keep
> the ZIL at faster tracks of the HDD drive.
I''m not authoritative, I''m speaking from memory of former
discussions on
this list and various sources of documentation.

No, it won''t help you.

First of all, all your writes to the storage pool are aggregated, so
you''re
already minimizing fragmentation of writes in your main pool.  However, over
time, as snapshots are created & destroyed, small changes are made to files,
and file contents are overwritten incrementally and internally...  The only
fragmentation you get creeps in as a result of COW.  This fragmentation only
impacts sequential reads of files which were previously written in random
order.  This type of fragmentation has no relation to ZIL or writes.

If you don''t split out your ZIL separate from the storage pool, zfs
already
chooses disk blocks that it believes to be optimized for minimal access
time.  In fact, I believe, zfs will dedicate a few sectors at the low end, a
few at the high end, and various other locations scattered throughout the
pool, so whatever the current head position, it tries to go to the closest
"landing zone" that''s available for ZIL writes.  If anything,
splitting out
your ZIL to a different partition might actually hurt your performance.

Also, the concept of "faster tracks of the HDD" is also incorrect. 
Yes,
there was a time when HDD speeds were limited by rotational speed and
magnetic density, so the outer tracks of the disk could serve up more data
because more magnetic material passed over the head in each rotation.  But
nowadays, the hard drive sequential speed is limited by the head speed,
which is invariably right around 1Gbps.  So the inner and outer sectors of
the HDD are equally fast - the outer sectors are actually less magnetically
dense because the head can''t handle it.  And the random IO speed is
limited
by head seek + rotational latency, where seek is typically several times
longer than latency.  

So basically, the only thing that matters, to optimize the performance of
any modern typical HDD, is to minimize the head travel.  You want to be
seeking sectors which are on tracks that are nearby to the present head
position.

Of course, if you want to test & benchmark the performance of splitting
apart the ZIL to a different partition, I encourage that.  I''m only
speaking
my beliefs based on my understanding of the architectures and limitations
involved.  This is my best prediction.  And I''ve certainly been wrong
before.  ;-)  Sometimes, being wrong is my favorite thing, because you learn
so much from it.  ;-)

Richard Elling

2012-Jan-08 01:45 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

On Jan 7, 2012, at 7:12 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>> 
>>   For smaller systems such as laptops or low-end servers,
>> which can house 1-2 disks, would it make sense to dedicate
>> a 2-4Gb slice to the ZIL for the data pool, separate from
>> rpool? Example layout (single-disk or mirrored):
>> 
>>   The idea would be to decrease fragmentation (committed
>> writes to data pool would be more coalesced) and to keep
>> the ZIL at faster tracks of the HDD drive.
> 
> I''m not authoritative, I''m speaking from memory of former
discussions on
> this list and various sources of documentation.
> 
> No, it won''t help you.
Correct :-)
> First of all, all your writes to the storage pool are aggregated, so
you''re
> already minimizing fragmentation of writes in your main pool.  However,
over
> time, as snapshots are created & destroyed, small changes are made to
files,
> and file contents are overwritten incrementally and internally...  The only
> fragmentation you get creeps in as a result of COW.  This fragmentation
only
> impacts sequential reads of files which were previously written in random
> order.  This type of fragmentation has no relation to ZIL or writes.
> 
> If you don''t split out your ZIL separate from the storage pool,
zfs already
> chooses disk blocks that it believes to be optimized for minimal access
> time.  In fact, I believe, zfs will dedicate a few sectors at the low end,
a
> few at the high end, and various other locations scattered throughout the
> pool, so whatever the current head position, it tries to go to the closest
> "landing zone" that''s available for ZIL writes.  If
anything, splitting out
> your ZIL to a different partition might actually hurt your performance.
> 
> Also, the concept of "faster tracks of the HDD" is also
incorrect.  Yes,
> there was a time when HDD speeds were limited by rotational speed and
> magnetic density, so the outer tracks of the disk could serve up more data
> because more magnetic material passed over the head in each rotation.  But
> nowadays, the hard drive sequential speed is limited by the head speed,
> which is invariably right around 1Gbps.  So the inner and outer sectors of
> the HDD are equally fast - the outer sectors are actually less magnetically
> dense because the head can''t handle it.  And the random IO speed
is limited
> by head seek + rotational latency, where seek is typically several times
> longer than latency.  
Disagree. My data, and the vendor specs, continue to show different sequential
media bandwidth speed for inner vs outer cylinders.
> 
> So basically, the only thing that matters, to optimize the performance of
> any modern typical HDD, is to minimize the head travel.  You want to be
> seeking sectors which are on tracks that are nearby to the present head
> position.
> 
> Of course, if you want to test & benchmark the performance of splitting
> apart the ZIL to a different partition, I encourage that.  I''m
only speaking
> my beliefs based on my understanding of the architectures and limitations
> involved.  This is my best prediction.  And I''ve certainly been
wrong
> before.  ;-)  Sometimes, being wrong is my favorite thing, because you
learn
> so much from it.  ;-)
Good idea.

I think you will see a tradeoff on the read side of the mixed read/write
workload.
Sync writes have higher priority than reads so the order of I/O sent to the disk
will appear to be very random and not significantly coalesced. This is the 
pathological worst case workload for a HDD.

OTOH, you''re not trying to get high performance from an HDD are you? 
That
game is over.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/

Edward Ned Harvey

2012-Jan-08 14:56 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

> From: Richard Elling [mailto:richard.elling at gmail.com]
> 
> > Also, the concept of "faster tracks of the HDD" is also
incorrect.  Yes,
> > there was a time when HDD speeds were limited by rotational speed and
> > magnetic density, so the outer tracks of the disk could serve up more
data> > because more magnetic material passed over the head in each rotation.
> But
> > nowadays, the hard drive sequential speed is limited by the head
speed,
> > which is invariably right around 1Gbps.  So the inner and outer
sectors
of> > the HDD are equally fast - the outer sectors are actually less
magnetically> > dense because the head can''t handle it.  And the random IO
speed is
> limited
> > by head seek + rotational latency, where seek is typically several
times
> > longer than latency.
> 
> Disagree. My data, and the vendor specs, continue to show different
> sequential
> media bandwidth speed for inner vs outer cylinders.
Any reference?  I know, as I sit and dd from some disk | pv > /dev/null, it
will tell me something like 1.0Gbps...  I periodically check its progress
while it''s in progress, and while it varies a little (say, sometimes
1.0,
1.1, 1.2) it goes up and down throughout the process.  There is no
noticeable difference between the early, mid, and late behavior,
sequentially reading the whole disk.

If the performance of the outer tracks is better than the performance of the
inner tracks due to limitations of magnetic density or rotation speed (not
being limited by the head speed or bus speed), then the sequential
performance of the drive should increase as a square function, going toward
the outer tracks.  c = pi * r^2

It is my belief, based on specs I''ve previously looked at, that mfgrs
break
the drive down into zones.  So, something like the inner 20% of the tracks
will have magnetic layout pattern A, and the next 20% will have magnetic
layout pattern B, and so forth...  Within a single magnetic layout pattern,
jumping from individual track to individual track can yield a difference of
performance, but it''s not a huge step from one to the next.  And when
you
transition from layout pattern to layout pattern, the pattern just repeats
itself again.  They''re trying to optimize, to a first order, ensure the
performance limitations are mostly caused by head and/or bus speed.  If
those are the bottlenecks, let them be the bottlenecks, and at least solve
all the other problems that are solvable.

So, small variations of sequential performance are possible, jumping from
track to track, but based on what I''ve seen, the maximum performance
difference from the absolute slowest track to the absolute fastest track
(which may or may not have any relation to inner vs outer) ... maximum
variation on-par with 10% performance difference.  Not a square function.

> OTOH, you''re not trying to get high performance from an HDD are
you?  That
> game is over.
Lots of us still have to live with HDD''s, due to capacity and cost
requirements.  We accept a relative definition of "high performance,"
and
still want to get all the performance we can out of whatever device
we''re
using.  Even if there exists a faster device somewhere in the world.

Also, for sequential performance, HDD''s are on-par with, and often
better
than SSD''s.  (For now.)  While many SSD''s publish specs
including something
like "220 MB/s" which is higher than HDD''s can reach... 
SSD''s publish their
maximum performance, which is not typical performance.  After you use them
for a month, they slow down.  Often to half or worse, of the speed they
originally were able to run.  Which is... as I say...  on-par with, or worse
than, the sequential speed of an HDD.

Even crappy SSD''s can have random IO worse than HDD''s.  Just
benchmark any
high-cost top-tier USB3 flash memory stick, and you''ll see what I mean.
;-)
The only SSD''s that are faster than HDD''s in any way are
*actual* internal
sas/sata/etc SSD''s, which are faster than HDD in terms of random IOPS
and
maybe sequential.

Jim Klimov

2012-Jan-08 16:34 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-08 18:56, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:richard.elling at gmail.com]
>> Disagree. My data, and the vendor specs, continue to show different
>> sequential
>> media bandwidth speed for inner vs outer cylinders.
>
> Any reference?
Well, Richard''s data matches mine with tests of my HDDs
at home: I read in some 10-gb blocks at different offsets
(dd > /dev/null), and "linear" speeds dropped from about
150MBps to about 80-100MBps.

This was tested on a relatively modern 2TB Seagate drive.

Random IOs are still crappy on mechanical drives, often
under 10MBps ;)

//Jim

Bob Friesenhahn

2012-Jan-08 18:21 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

On Sat, 7 Jan 2012, Edward Ned Harvey wrote:>
> If you don''t split out your ZIL separate from the storage pool,
zfs already
> chooses disk blocks that it believes to be optimized for minimal access
> time.  In fact, I believe, zfs will dedicate a few sectors at the low end,
a
> few at the high end, and various other locations scattered throughout the
> pool, so whatever the current head position, it tries to go to the closest
> "landing zone" that''s available for ZIL writes.  If
anything, splitting out
> your ZIL to a different partition might actually hurt your performance.
Something else to be aware of is that even if you don''t have a 
dedicated ZIL device, zfs will create a ZIL using devices in the main 
pool so there is always a ZIL, even if you don''t see it.  Also, the 
ZIL is only used to record pending small writes.  Larger writes (I 
think 128K or more) are written to their pre-allocated final location 
in the main pool.  This choice is made since the purpose of the ZIL is 
to minimize random I/O to disk, and writing large amounts of data to 
the ZIL would create a bandwidth bottleneck.

There are postings by Matt Ahrens to this list (and elsewhere) which 
provide an accurate description of how the ZIL works.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Casper.Dik at oracle.com

2012-Jan-08 18:39 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

>If the performance of the outer tracks is better than the performance of the
>inner tracks due to limitations of magnetic density or rotation speed (not
>being limited by the head speed or bus speed), then the sequential
>performance of the drive should increase as a square function, going toward
>the outer tracks.  c = pi * r^2
Decrease because the outer tracks are the lower numbered tracks; they
have the same density but they are larger.
>So, small variations of sequential performance are possible, jumping from
>track to track, but based on what I''ve seen, the maximum
performance
>difference from the absolute slowest track to the absolute fastest track
>(which may or may not have any relation to inner vs outer) ... maximum
>variation on-par with 10% performance difference.  Not a square function.
I''ve noticed a change of 50% in speed or more between the lower and the
higher numbers.  (60MB to 30MB)

In benchmark land, they do short-stroke disks for better performance;
I believe the Pillar boxes do similar tricks under the covers (if you want 
more performance, it gives you the faster tracks)

Casper

Darren J Moffat

2012-Jan-09 10:34 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

On 01/08/12 18:21, Bob Friesenhahn wrote:> Something else to be aware of is that even if you don''t have a
dedicated
> ZIL device, zfs will create a ZIL using devices in the main pool so
Terminology nit:  The log device is a SLOG.  Every ZFS dataset has a 
ZIL.  Where the ZIL writes (slog or main pool devices) go for a given 
dataset are determined by a combination of things including (but not 
limited to) the presence of a SLOG device, the logbias property and the 
size of the data.

-- 
Darren J Moffat

Jim Klimov

2012-Jan-09 13:38 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-08 5:45, Richard Elling wrote:> I think you will see a tradeoff on the read side of the mixed read/write
workload.
> Sync writes have higher priority than reads so the order of I/O sent to the
disk
> will appear to be very random and not significantly coalesced. This is the
> pathological worst case workload for a HDD.
I guess this is what I''m trying to combat when thinking
about a dedicated ZIL (SLOG device) in ordedr to reduce
pool''s fragmentation. It is my understanding (which may
be wrong and often is) that without a dedicated SLOG:

1) Sync writes will land on disk randomly into nearest
(to disk heads) available blocks, in order to have them
committed ASAP;

2) Coalesced writes (at TXG sync) may have intermixed
data and metadata blocks, of which metadata may soon
expire due to whatever updates, snapshots or deletions
involving the blocks this metadata references.
If this is true, then after a while there will be many
"available" cheese-holes from expired metadata among
larger data blocks.

3) Now, this might be further complicated (or relieved)
if the metadata blocks are stored in separate groupings
from the "bulk" user-data, which I don''t know about yet.
In that case it would be easier for ZFS to prefetch
metadata from disk in one IO (as we discussed in another
thread), as well as to effectively reuse the small
cheese-holes from freed older metadata blocks.

---

If any of the above is true, then it is my "blind
expectation" that a dedicated ZIL/SLOG area would
decrease fragmentation at least due to sync writes
of metadata, and possibly of data, into nearest
HDD locations. Again, this is based on my possibly
wrong understanding that the blocks committed to a
SLOG would be neatly recommitted to the main pool
during a TXG close with coalesced writes.

I do understand the argument that if the SLOG is
dedicated from a certain area on the same HDD, then
in fact this would be slowing down the writes by
creating more random IO and extra seeks.
But as a trade-off I hope for more linear faster
reads, including pool import, scrubbing and ZDB
walks; and less fragmented free space.

Is there any truth to these words? ;)

Thanks,
//Jim Klimov

Edward Ned Harvey

2012-Jan-09 13:55 UTC

head link

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> 1) Sync writes will land on disk randomly into nearest
> (to disk heads) available blocks, in order to have them
> committed ASAP;
This is true - but you need to make the distinction - if you don''t have
a
dedicated slog, and you haven''t disabled zil, then the sync writes
you''re
talking about land into dedicated zil sectors of the disk.  This is
write-only space, consider it temporary.  The only time it will ever be read
is after an ungraceful system reboot, the system will scan these sectors to
see if anything is there.

As soon as the sync writes are written to the zil, they become async writes,
which are buffered in memory with all the other async writes, and they will
be written *again* into permanent storage in the main pool.  At that point,
the previously written copy in zil becomes irrelevant.

> If any of the above is true, then it is my "blind
> expectation" that a dedicated ZIL/SLOG area would
> decrease fragmentation at least due to sync writes
sync writes to zil aren''t causing fragmentation, because
they''re only
temporary writes as long as they''re sync mode.  Then they become async
mode,
and they will be aggregated with all the other async writes.

This isn''t saying fragmentation doesn''t happen.  It''s
just saying there''s no
special relationship between sync mode and fragmentation.

zfs discuss - Jan 2012 - ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)