thr3ads.net - zfs discuss - [zfs-discuss] Convert pool from ashift=12 to ashift=9 [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Mar-18 18:16 UTC

[zfs-discuss] Convert pool from ashift=12 to ashift=9

Hello all,

I was asked if it is possible to convert a ZFS pool created
explicitly with ashift=12 (via the tweaked binary) and filled
with data back into ashift=9 so as to use the slack space
from small blocks (BP''s, file tails, etc.)

The user''s HDD marketing text says that it "efficiently"
emulates 512b sectors while using 4Kb ones natively (that''s
why ashift=12 was enforced in the first place).

Questions are:
1) How bad would a performance hit be with 512b blocks used
    on a 4kb drive with such "efficient emulation"? Is it
    possible to model/emulate the situation somehow in advance
    to see if it''s worth that change at all?

2) Is it possible to easily estimate the amount of "wasted"
    disk space in slack areas of the currently active ZFS
    allocation (unused portions of 4kb blocks that might
    become available if the disks were reused with ashift=9)?

3) How many parts of ZFS pool are actually affected by the
    ashift setting?

    From what I gather, it is applied at the top-level vdev
    level (I read that one can mix ashift=9 and ashift=12
    TLVDEVs in one pool spanning several TLVDEVs). Is that
    a correct impression?

    If yes, how does ashift size influence the amount of
    slots in uberblock ring (128 vs. 32 entries) which is
    applied at the leaf vdev level (right?) but should be
    consistent across the pool?

    As far as I see in ZFS on-disk format, all sizes and
    offsets are in either bytes or 512b blocks, and the
    ashift''ed block size is not actually used anywhere
    except to set the minimal block size and its implicit
    alignment during writes.

    Is it wrong to think that it''s enough to forge an
    uberblock with ashift=9 and a matching self-checksum
    and place that into the pool (leaf vdev labels), and
    magically have all old data 4kb-aligned still available,
    while new writes would be 512b-aligned?

Thanks for helping me grasp the theory,
//Jim

Richard Elling

2012-Mar-18 19:47 UTC

head link

[zfs-discuss] Convert pool from ashift=12 to ashift=9

On Mar 18, 2012, at 11:16 AM, Jim Klimov wrote:
> Hello all,
> 
> I was asked if it is possible to convert a ZFS pool created
> explicitly with ashift=12 (via the tweaked binary) and filled
> with data back into ashift=9 so as to use the slack space
> from small blocks (BP''s, file tails, etc.)
copy out, copy in.  Whether this is easy or not depends on how well
you plan your storage use ...
> The user''s HDD marketing text says that it "efficiently"
> emulates 512b sectors while using 4Kb ones natively (that''s
> why ashift=12 was enforced in the first place).
Marketing: 2 drink minimum
> 
> Questions are:
> 1) How bad would a performance hit be with 512b blocks used
>   on a 4kb drive with such "efficient emulation"?
Depends almost exclusively on the workload and hardware. In my
experience, most folks who bite the 4KB bullet have low-cost HDDs
where one cannot reasonably expect high performance.
> Is it
>   possible to model/emulate the situation somehow in advance
>   to see if it''s worth that change at all?
It will be far more cost effective to just make the change.
> 2) Is it possible to easily estimate the amount of "wasted"
>   disk space in slack areas of the currently active ZFS
>   allocation (unused portions of 4kb blocks that might
>   become available if the disks were reused with ashift=9)?
Detailed space use is available from the zfs_blkstats mdb macro 
as previously described in such threads.
> 3) How many parts of ZFS pool are actually affected by the
>   ashift setting?
Everything is impacted. But that isn''t a useful answer.
>   From what I gather, it is applied at the top-level vdev
>   level (I read that one can mix ashift=9 and ashift=12
>   TLVDEVs in one pool spanning several TLVDEVs). Is that
>   a correct impression?
Yes
>   If yes, how does ashift size influence the amount of
>   slots in uberblock ring (128 vs. 32 entries) which is
>   applied at the leaf vdev level (right?) but should be
>   consistent across the pool?
It should be consistent across the top-level vdev. 

There is 128KB of space available for the uberblock list. The minimum 
size of an uberblock entry is 1KB. Obviously, a 4KB disk can''t write
only 1KB,
so for 4KB sectors, there are 32 entries in theuberblock list.
>   As far as I see in ZFS on-disk format, all sizes and
>   offsets are in either bytes or 512b blocks, and the
>   ashift''ed block size is not actually used anywhere
>   except to set the minimal block size and its implicit
>   alignment during writes.
The on-disk format doc is somewhat dated and unclear here. UTSL.
>   Is it wrong to think that it''s enough to forge an
>   uberblock with ashift=9 and a matching self-checksum
>   and place that into the pool (leaf vdev labels), and
>   magically have all old data 4kb-aligned still available,
>   while new writes would be 512b-aligned?
Yes, it is wrong to think that.
> 
> Thanks for helping me grasp the theory,
> //Jim
 -- richard

--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422






-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120318/5033fdbf/attachment-0002.html>

Jim Klimov

2012-Mar-20 11:29 UTC

head link

[zfs-discuss] Convert pool from ashift=12 to ashift=9

2012-03-18 23:47, Richard Elling wrote:>
> ...
> Yes, it is wrong to think that.
Ok, thanks, we won''t try that :)
> copy out, copy in. Whether this is easy or not depends on how well
> you plan your storage use ...
Home users and personal budgets do tend to have a problem
with planning. Any mistake is to be paid for personally,
and many are left "as is". It is is hard enough already
to justify to an average wife that "a storage box with
large X-Tb disks needs raidz3 or mirroring" and thus
becomes larger and noisier, not to mention almost a
thousand bucks more expensive just for the redundancy
disks, but it will become two times cheaper in a year.

Yup, it is not very easy to find another 10+Tb backup
storage (with ZFS reliability) in a typical home I know
of. Planning is not easy...

But that''s a rant... Hoping that in-place BP Rewrite
would arrive and magically solve many problems =)
>
>>
>> Questions are:
>> 1) How bad would a performance hit be with 512b blocks used
>> on a 4kb drive with such "efficient emulation"?
>
> Depends almost exclusively on the workload and hardware. In my
> experience, most folks who bite the 4KB bullet have low-cost HDDs
> where one cannot reasonably expect high performance.
>
>> Is it
>> possible to model/emulate the situation somehow in advance
>> to see if it''s worth that change at all?
>
> It will be far more cost effective to just make the change.

Meaning altogether? That with consumer disk which will suck
from performance standpoint anyway, it was not a good idea
to use ashift=12 and it was more cost effective to remain
at ashift=9, to begin with?

What about real-people''s tests which seemed to show that
there were substantial performance hits with misaligned
large-block writes (spanning several 4k sectors at wrong
boundaries)?



I had an RFE posted sometime last year about making an
optimisation for both worlds: use formal ashift=9 and allow
writing of small blocks, but align larger blocks at set
boundaries (i.e. offset divisible by 4096 for blocks sized
4096+). Perhaps writing of 512b blocks near each other
should only be reserved for metadata which is dittoed
anyway, so that a whole-sector (4kb) corruption won''t
be irreversible for some data. In effect, minblocksize
for userdata would be enforced (by config) at the same
4kb in such case.

This is a zfs-write only change (and some custom pool
or dataset attributes), so the on-disk format and
compatibility should not suffer with this solution.
But I had little feedback whether the idea was at
all reasonable.

>
>> 2) Is it possible to easily estimate the amount of "wasted"
>> disk space in slack areas of the currently active ZFS
>> allocation (unused portions of 4kb blocks that might
>> become available if the disks were reused with ashift=9)?
>
> Detailed space use is available from the zfs_blkstats mdb macro
> as previously described in such threads.
>
>> 3) How many parts of ZFS pool are actually affected by the
>> ashift setting?
>
> Everything is impacted. But that isn''t a useful answer.
>
>> From what I gather, it is applied at the top-level vdev
>> level (I read that one can mix ashift=9 and ashift=12
>> TLVDEVs in one pool spanning several TLVDEVs). Is that
>> a correct impression?
>
> Yes
>
>> If yes, how does ashift size influence the amount of
>> slots in uberblock ring (128 vs. 32 entries) which is
>> applied at the leaf vdev level (right?) but should be
>> consistent across the pool?
>
> It should be consistent across the top-level vdev.
>
> There is 128KB of space available for the uberblock list. The minimum
> size of an uberblock entry is 1KB. Obviously, a 4KB disk can''t
write
> only 1KB,
> so for 4KB sectors, there are 32 entries in theuberblock list.
So if I have ashift=12 and ashift=9 top-level devices
mixed in the pool, it is okay that some of them would
remember 4x more of pool''s TXG history than others?

>
>> As far as I see in ZFS on-disk format, all sizes and
>> offsets are in either bytes or 512b blocks, and the
>> ashift''ed block size is not actually used anywhere
>> except to set the minimal block size and its implicit
>> alignment during writes.
>
> The on-disk format doc is somewhat dated and unclear here. UTSL.
Are there any updates, or the 2006 pdf is the latest available?
For example, is there an effort in illumos/nexenta/openindiana
to publish their version of the current on-disk format? ;)

Thanks for all the answers,
//Jim

Nathan Kroenert

2012-Mar-20 12:11 UTC

head link

[zfs-discuss] Convert pool from ashift=12 to ashift=9

Jim Klimov wrote:

 >> It is is hard enough already to justify to an average wife
that...<snip>

That made my night. Thanks, Jim. :)



On 03/20/12 10:29 PM, Jim Klimov wrote:> 2012-03-18 23:47, Richard Elling wrote:
>>
>> ...
>> Yes, it is wrong to think that.
>
> Ok, thanks, we won''t try that :)
>
>> copy out, copy in. Whether this is easy or not depends on how well
>> you plan your storage use ...
>
> Home users and personal budgets do tend to have a problem
> with planning. Any mistake is to be paid for personally,
> and many are left "as is". It is is hard enough already
> to justify to an average wife that "a storage box with
> large X-Tb disks needs raidz3 or mirroring" and thus
> becomes larger and noisier, not to mention almost a
> thousand bucks more expensive just for the redundancy
> disks, but it will become two times cheaper in a year.
>
> Yup, it is not very easy to find another 10+Tb backup
> storage (with ZFS reliability) in a typical home I know
> of. Planning is not easy...
>
> But that''s a rant... Hoping that in-place BP Rewrite
> would arrive and magically solve many problems =)
>
>>
>>>
>>> Questions are:
>>> 1) How bad would a performance hit be with 512b blocks used
>>> on a 4kb drive with such "efficient emulation"?
>>
>> Depends almost exclusively on the workload and hardware. In my
>> experience, most folks who bite the 4KB bullet have low-cost HDDs
>> where one cannot reasonably expect high performance.
>>
>>> Is it
>>> possible to model/emulate the situation somehow in advance
>>> to see if it''s worth that change at all?
>>
>> It will be far more cost effective to just make the change.
>
>
> Meaning altogether? That with consumer disk which will suck
> from performance standpoint anyway, it was not a good idea
> to use ashift=12 and it was more cost effective to remain
> at ashift=9, to begin with?
>
> What about real-people''s tests which seemed to show that
> there were substantial performance hits with misaligned
> large-block writes (spanning several 4k sectors at wrong
> boundaries)?
>
>
>
> I had an RFE posted sometime last year about making an
> optimisation for both worlds: use formal ashift=9 and allow
> writing of small blocks, but align larger blocks at set
> boundaries (i.e. offset divisible by 4096 for blocks sized
> 4096+). Perhaps writing of 512b blocks near each other
> should only be reserved for metadata which is dittoed
> anyway, so that a whole-sector (4kb) corruption won''t
> be irreversible for some data. In effect, minblocksize
> for userdata would be enforced (by config) at the same
> 4kb in such case.
>
> This is a zfs-write only change (and some custom pool
> or dataset attributes), so the on-disk format and
> compatibility should not suffer with this solution.
> But I had little feedback whether the idea was at
> all reasonable.
>
>
>>
>>> 2) Is it possible to easily estimate the amount of
"wasted"
>>> disk space in slack areas of the currently active ZFS
>>> allocation (unused portions of 4kb blocks that might
>>> become available if the disks were reused with ashift=9)?
>>
>> Detailed space use is available from the zfs_blkstats mdb macro
>> as previously described in such threads.
>>
>>> 3) How many parts of ZFS pool are actually affected by the
>>> ashift setting?
>>
>> Everything is impacted. But that isn''t a useful answer.
>>
>>> From what I gather, it is applied at the top-level vdev
>>> level (I read that one can mix ashift=9 and ashift=12
>>> TLVDEVs in one pool spanning several TLVDEVs). Is that
>>> a correct impression?
>>
>> Yes
>>
>>> If yes, how does ashift size influence the amount of
>>> slots in uberblock ring (128 vs. 32 entries) which is
>>> applied at the leaf vdev level (right?) but should be
>>> consistent across the pool?
>>
>> It should be consistent across the top-level vdev.
>>
>> There is 128KB of space available for the uberblock list. The minimum
>> size of an uberblock entry is 1KB. Obviously, a 4KB disk can''t
write
>> only 1KB,
>> so for 4KB sectors, there are 32 entries in theuberblock list.
>
> So if I have ashift=12 and ashift=9 top-level devices
> mixed in the pool, it is okay that some of them would
> remember 4x more of pool''s TXG history than others?
>
>
>>
>>> As far as I see in ZFS on-disk format, all sizes and
>>> offsets are in either bytes or 512b blocks, and the
>>> ashift''ed block size is not actually used anywhere
>>> except to set the minimal block size and its implicit
>>> alignment during writes.
>>
>> The on-disk format doc is somewhat dated and unclear here. UTSL.
>
> Are there any updates, or the 2006 pdf is the latest available?
> For example, is there an effort in illumos/nexenta/openindiana
> to publish their version of the current on-disk format? ;)
>
> Thanks for all the answers,
> //Jim
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Mar 2012 - Convert pool from ashift=12 to ashift=9

[zfs-discuss] Convert pool from ashift=12 to ashift=9

[zfs-discuss] Convert pool from ashift=12 to ashift=9

[zfs-discuss] Convert pool from ashift=12 to ashift=9

[zfs-discuss] Convert pool from ashift=12 to ashift=9