2011-07-13 16:19, Paul Kraus ?????:> On Wed, Jul 13, 2011 at 7:14 AM,<Casper.Dik at oracle.com> wrote:
>
>> The issue is most with "4K underwater" disks; unless you make
sure that
>> all the partitions are on a 4K boundary. If it advertises as a 4K
sector
>> size disk, then there is no issue.
> So if you hand the entire drive to ZFS you should be OK ?
>
> [Not applicable to the root zpool, will the OS installation utility do
> the right thing ?]
Well, with all the troubles I''m having I can''t vouch that 4k
disks
are not the culprit of something, but I don''t think they are.
I used "Seagate model ST2000DL003-9VT166" disks which
are 4k native, not sure if they advertize that. At least parted
thinks they are 512 bytes per sector (see below). After much
consideration I decided to try 4k disks because if I ever need
a replacement, it would be unlikely to find a 512b model.
On my first tests I did take care to have the partitions manually
created and aligned on 4k boundary (basically, sector number
should be divisible by 8), but when I gave zpool the whole disks
it seemed to do the right thing by itself:
# parted /dev/dsk/c7t1d0 unit s print
Model: Generic Ide (ide)
Disk /dev/dsk/c7t1d0: 3907029168s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 256s 3907012750s 3907012495s zfs
9 3907012751s 3907029134s 16384s
I did use ashift=12 forced via a patched zpool binary published
on a blog; as was recently told on the list - I might get away with
a "fake" iSCSI component which would advertize as 4k sectored.
Basically this matches what was advised on some FreeBSD
oriented blogs and their toolsets.
For rpools I think your device (HDD or SSD) should report its
4kb orientation properly. Maybe you can replace the zpool
binary with a forced ashift=12 variant on the liveCD-booted
image - not necessarily in the ISO image, but downloaded
or copied from a flash drive while you are in the LiveCD OS
session, but I did not bother (my boot disk is an old 80Gb one).
The major implication of 4k for me now is that the ZFS metadata
which normally occupied some part of a 512-byte block now
chews up a 4kb block. This probably reduces performance as
less real (meta)data can be cached and more slack space
has to be wasted in the ARC (my guess). At least more KB
reads and writes have to take place than required for the
payload data size.
This also noticeably bumps up the average pool space usage
as compared to the same user-data stored on a 512b system.
Meaning, if your old 512b pool is full, you might not have enough
space on a 4k pool of the same raw size to migrate your data.
If migration of an archive (write once, read many) is the key,
you might want to enforce some gzip-9 compression during
transfer to the new pool. You might turn to dedup, but I still
wonder if this was a good or a bad choice that I''ve made ;)
Still, if you do dedup with compression, make sure to use
the same compression algorithm for any datasets that you
expect to get deduped.
Also I was hit by small blocks (i.e. when creating a volume
dataset with volblocksize=4k, or writing many small files
which fit in a minimal 4k block) - each filled 4k user-data
block required a 4k metadata block, doubling the storage
space requirement as compared to raw userdata size.
I posted this in detail this spring (see web OpenSolaris
Forums archives, mail bridge did not work then), as well
as some findings about loss of benefits in compression,
etc. if the data block sizes are small, i.e. many small files.
I also posted some RFEs into OpenIndiana bug tracker,
with some ideas about using adequately small blocks
for metadata (aggregated into 4kb storage "cells" by
the writing algorithm, so that feature would not necessarily
dictate an on-disk format change) and larger 4k-aligned
blocks for userdata, but so far there was no feedback
on any of the ideas...
Other than that loss of space and possibly performance
due to slack space, these disks work ;)
Not very fast for random IOs (up to 20Mb/s in some of
my tests), but if you happen to have a linear workload
like "dd", they can read/write about 150/130 Mb/s.
Also my system is plagued by dedup tests which can
add to fragmentation and other performance implications.
One thing I do still wonder about, though, is if I were to
add an L2ARC and/or ZIL SSD device (which are all
or mostly reported to have 4k or even 8k cells and are
best used if aligned to that) - I wonder if I could set an
appropriate ashift for the cache device, and how much
would I lose or gain with that (would ZFS care and/or
optimize somehow?)
HTH,
//Jim Klimov