thr3ads.net - zfs discuss - [zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy? [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Jul-29 14:03 UTC

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

Hello all,

   Over the past few years there have been many posts suggesting
that for modern HDDs (several TB size, around 100-200MB/s best
speed) the rebuild times grow exponentially, so to build a well
protected pool with these disks one has to plan for about three
disk''s worth of redundancy - that is, three- or four-way mirrors,
or raidz3 - just to allow systems to survive a disk outage (with
accpetably high probability of success) while one is resilvering.

   There were many posts on this matter from esteemed members of
the list, including (but certainly not limited to) these articles:
* https://blogs.oracle.com/ahl/entry/triple_parity_raid_z
* https://blogs.oracle.com/ahl/entry/acm_triple_parity_raid
* http://queue.acm.org/detail.cfm?id=1670144
* http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html

   Now, this brings me to such a question: when people build a
home-NAS box, they are quite constrained in terms of the number
of directly attached disks (about 4-6 bays), or even if they
use external JBODs - to the number of disks in them (up to 8,
which does allow a 5+3 raidz3 set in a single box, which still
seems like a large overhead to some buyers - a 4*2 mirror would
give about as much space and higher performance, but may have
unacceptably less redundancy). If I want to have considerable
storage, with proper reliability, and just a handful of drives,
what are my best options?

   I wondered if the "copies" attribute can be considered sort
of equivalent to the number of physical disks - limited to seek
times though. Namely, for the same amount of storage on a 4-HDD
box I could use raidz1 and 4*1tb at copies=1 or 4*2tb at copies=2 or
even 4*3tb at copies=3, for example.

   To simplify the matters, let''s assume that this is a small
box (under 10GB RAM) not using dedup, though it would likely
use compression :)

   Question to theorists and practicians: is any of these options
better or worse than the others, in terms of reliability and
access/rebuild/scrub speeds, for either a single-sector error
or for a full-disk replacement?

   Would extra copies on larger disks actually provide the extra
reliability, or only add overheads and complicate/degrade the
situation?

   Would the use of several copies cripple the write speeds?

   Can the extra copies be used by zio scheduler to optimize and
speed up reads, like extra mirror sides would?

Thanks,
//Jim Klimov

Roy Sigurd Karlsbakk

2012-Jul-29 15:36 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

"copies" won''t help much if the pool is unavailable. It may,
however, help if, say, you have a RAIDz2, and two drives die, and htere are
errors on a  third drive, but not sufficiently bad for zfs to reject the pool

roy

----- Opprinnelig melding -----> Hello all,
> 
> Over the past few years there have been many posts suggesting
> that for modern HDDs (several TB size, around 100-200MB/s best
> speed) the rebuild times grow exponentially, so to build a well
> protected pool with these disks one has to plan for about three
> disk''s worth of redundancy - that is, three- or four-way mirrors,
> or raidz3 - just to allow systems to survive a disk outage (with
> accpetably high probability of success) while one is resilvering.
> 
> There were many posts on this matter from esteemed members of
> the list, including (but certainly not limited to) these articles:
> * https://blogs.oracle.com/ahl/entry/triple_parity_raid_z
> * https://blogs.oracle.com/ahl/entry/acm_triple_parity_raid
> * http://queue.acm.org/detail.cfm?id=1670144
> *
> http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html
> 
> Now, this brings me to such a question: when people build a
> home-NAS box, they are quite constrained in terms of the number
> of directly attached disks (about 4-6 bays), or even if they
> use external JBODs - to the number of disks in them (up to 8,
> which does allow a 5+3 raidz3 set in a single box, which still
> seems like a large overhead to some buyers - a 4*2 mirror would
> give about as much space and higher performance, but may have
> unacceptably less redundancy). If I want to have considerable
> storage, with proper reliability, and just a handful of drives,
> what are my best options?
> 
> I wondered if the "copies" attribute can be considered sort
> of equivalent to the number of physical disks - limited to seek
> times though. Namely, for the same amount of storage on a 4-HDD
> box I could use raidz1 and 4*1tb at copies=1 or 4*2tb at copies=2 or
> even 4*3tb at copies=3, for example.
> 
> To simplify the matters, let''s assume that this is a small
> box (under 10GB RAM) not using dedup, though it would likely
> use compression :)
> 
> Question to theorists and practicians: is any of these options
> better or worse than the others, in terms of reliability and
> access/rebuild/scrub speeds, for either a single-sector error
> or for a full-disk replacement?
> 
> Would extra copies on larger disks actually provide the extra
> reliability, or only add overheads and complicate/degrade the
> situation?
> 
> Would the use of several copies cripple the write speeds?
> 
> Can the extra copies be used by zio scheduler to optimize and
> speed up reads, like extra mirror sides would?
> 
> Thanks,
> //Jim Klimov
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy at karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Bob Friesenhahn

2012-Jul-29 18:52 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On Sun, 29 Jul 2012, Jim Klimov wrote:>
>  Would extra copies on larger disks actually provide the extra
> reliability, or only add overheads and complicate/degrade the
> situation?
My opinion is that complete hard drive failure and block-level media 
failure are two totally different things.  Complete hard drive failure 
rates should not be directly related to total storage size whereas the 
probabily of media failure per drive is directly related to total 
storage size.  Given this, and assuming that complete hard drive 
failure occurs much less often than partial media failure, using the 
copies feature should be pretty effective.
>  Would the use of several copies cripple the write speeds?
It would reduce the write rate by 1/2 or by whatever number of copies 
you have requested.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

opensolarisisdeadlongliveopensolaris

2012-Jul-29 20:12 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
>    I wondered if the "copies" attribute can be considered sort
> of equivalent to the number of physical disks - limited to seek
> times though. Namely, for the same amount of storage on a 4-HDD
> box I could use raidz1 and 4*1tb at copies=1 or 4*2tb at copies=2 or
> even 4*3tb at copies=3, for example.
The first question - reliability...

"copies" might be on the same disk.  So it''s not guaranteed
to help if you have a disk failure.

Let''s try this:  Take a disk, slice it into two partitions, and then
make a mirror using the 2 partitions.  This is about as useful as the copies
property.  Half the write speed, half the usable disk capacity, improved
redundancy against bad blocks, but no better redundancy against disk failure. 
("copies" will actually be better, because unlike the partitioning
scenario, "copies" will sometimes write the extra copies to other
disks.)

Re: the assumption - lower performance with larger disks...  rebuild time
growing exponentially...

I don''t buy it, and I don''t see that argument being made in
any of the messages you referenced.  Rebuild time is dependent on the amount of
data in the vdev and the layout of said data, so if you consider a mirror of 3T
versus 6 vdev''s all mirroring 500G, then in that situation the larger
disks resilver slower.  (Because it''s a larger amount of data that
needs to resilver.  You have to resilver all your data instead of 1/6th of your
data.)

GREGG WONDERLY

2012-Jul-30 14:11 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On Jul 29, 2012, at 3:12 PM, opensolarisisdeadlongliveopensolaris
<opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>> 
>>   I wondered if the "copies" attribute can be considered sort
>> of equivalent to the number of physical disks - limited to seek
>> times though. Namely, for the same amount of storage on a 4-HDD
>> box I could use raidz1 and 4*1tb at copies=1 or 4*2tb at copies=2 or
>> even 4*3tb at copies=3, for example.
> 
> The first question - reliability...
> 
> "copies" might be on the same disk.  So it''s not
guaranteed to help if you have a disk failure.
I thought I understood that copies would not be on the same disk, I guess I need
to go read up on this again.

Gregg Wonderly

John Martin

2012-Jul-30 17:06 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On 07/29/12 14:52, Bob Friesenhahn wrote:
> My opinion is that complete hard drive failure and block-level media
> failure are two totally different things.
That would depend on the recovery behavior of the drive for
block-level media failure.  A drive whose firmware does excessive
(reports of up to 2 minutes) retries of a bad sector may be
indistinguishable from a failed drive.  See previous discussions
of the firmware differences between desktop and enterprise drives.

Brandon High

2012-Jul-30 20:37 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On Mon, Jul 30, 2012 at 7:11 AM, GREGG WONDERLY <greggwon at gmail.com>
wrote:> I thought I understood that copies would not be on the same disk, I guess I
need to go read up on this again.
ZFS attempts to put copies on separate devices, but there''s no
guarantee.

-B

-- 
Brandon High : bhigh at freaks.com

Nico Williams

2012-Jul-30 20:48 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

The copies thing is a really only for laptops, where the likelihood of
redundancy is very low (there are some high-end laptops with multiple
drives, but those are relatively rare) and where this idea is better
than nothing.  It''s also nice that copies can be set on a per-dataset
manner (whereas RAID-Zn and mirroring are for pool-wide redundancy,
not per-dataset), so you could set it > 1 on home directories but not
/.

Nico
--

opensolarisisdeadlongliveopensolaris

2012-Jul-31 13:55 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Nico Williams
> 
> The copies thing is a really only for laptops, where the likelihood of
> redundancy is very low 
ZFS also stores multiple copies of things that it considers "extra
important."  I''m not sure what exactly - uber block, or stuff like
that...

When you set the "copies" property, you''re just making it
apply to other stuff, that otherwise would be only 1.

Jim Klimov

2012-Aug-01 10:04 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-07-31 17:55, opensolarisisdeadlongliveopensolaris
?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Nico Williams
>>
>> The copies thing is a really only for laptops, where the likelihood of
>> redundancy is very low
>
> ZFS also stores multiple copies of things that it considers "extra
important."  I''m not sure what exactly - uber block, or stuff like
that...
>
> When you set the "copies" property, you''re just making
it apply to other stuff, that otherwise would be only 1.
IIRC, the "copies" defaults are:
1 block for userdata
2 blocks for regular metadata (block-pointer tree)
3 blocks for higher-level metadata (metadata tree root, dataset
   definitions)

The "Uberblock" I am not so sure about, from the top of my head.
There is a record in the ZFS labels, and that is stored 4 times
on each leaf VDEV, and points to a ZFS block with the tree root
for the current (newest consistent flushed-to-pool) TXG number.
Which one of these concepts is named The 00bab10c - *that* I am
a bit vague about ;)

Probably DDT is also stored with 2 or 3 copies of each block,
since it is metadata. It was not in the last ZFS on-disk spec
from 2006 that I found, for some apparent reason ;)

Also, I am not sure whether bumping the copies attribute to,
say, "3" increases only the redundancy of userdata, or of
regular metadata as well.

//Jim

Sašo Kiselkov

2012-Aug-01 12:22 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On 08/01/2012 12:04 PM, Jim Klimov wrote:> Probably DDT is also stored with 2 or 3 copies of each block,
> since it is metadata. It was not in the last ZFS on-disk spec
> from 2006 that I found, for some apparent reason ;)
That''s probably because it''s extremely big (dozens, hundreds
or even
thousands of GB).

Cheers,
--
Saso

Jim Klimov

2012-Aug-01 12:41 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-08-01 16:22, Sa?o Kiselkov ?????:> On 08/01/2012 12:04 PM, Jim Klimov wrote:
>> Probably DDT is also stored with 2 or 3 copies of each block,
>> since it is metadata. It was not in the last ZFS on-disk spec
>> from 2006 that I found, for some apparent reason ;)

The idea of the pun was that the latest available full spec is
over half a decade old, alas. At least I failed to find any one
newer, when I searched last winter. And back in 2006 there was
no dedup nor any mention of it in the spec (surprising, huh? ;)

Hopefully with all the upcoming changes - including integration
of feature flags and new checksum and compression algorithms,
the consistent textual document of "Current ZFS On-Disk spec in
illumos(/FreeBSD/...)" would appear and be maintained up-to-date.

> That''s probably because it''s extremely big (dozens,
hundreds or even
> thousands of GB).
Availability of the DDT is IMHO crucial to a deduped pool, so
I won''t be surprised to see it forced to triple copies. Not
that it is very difficult to check with ZDB, though finding
the DDT "dataset" for inspection (when I last tried) was not
an obvious task.

//Jim

opensolarisisdeadlongliveopensolaris

2012-Aug-01 13:35 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
>  
> Availability of the DDT is IMHO crucial to a deduped pool, so
> I won''t be surprised to see it forced to triple copies. 
Agreed, although, the DDT is also paramount to performance.  In theory, an
online dedup''d pool could be much faster than non-dedup''d
pools, or offline dedup''d pools.  So there''s a lot of
potential here - Lost potential at the present.

IMHO, the more important thing for dedup moving forward is to create an option
to dedicate a fast device (SSD or whatever) to the DDT.  So all those little
random IO operations never hit the rusty side of the pool.

Personally, I''ve never been supportive of the whole "copies"
idea.  If you need more than one redundant copy of some data, that''s
why you have pool redundancy.  You''re just hurting performance by using
"copies."  And protecting against failure conditions that are
otherwise nearly nonexistent...  And just as easily solved (without performance
penalty) via pool redundancy.

Sašo Kiselkov

2012-Aug-01 13:55 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris
wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>  
>> Availability of the DDT is IMHO crucial to a deduped pool, so
>> I won''t be surprised to see it forced to triple copies. 
> 
> IMHO, the more important thing for dedup moving forward is to create an
option to dedicate a fast device (SSD or whatever) to the DDT.  So all those
little random IO operations never hit the rusty side of the pool.
That''s something you can already do with an L2ARC. In the future I plan
on investigating implementing a set of more fine-grained ARC and L2ARC
policy tuning parameters that would give more control into the hands of
admins over how the ARC/L2ARC cache is used.

Cheers,
--
Saso

Jim Klimov

2012-Aug-01 14:09 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-08-01 17:35, opensolarisisdeadlongliveopensolaris
?????:> Personally, I''ve never been supportive of the whole
"copies" idea.  If you need more than one redundant copy of some data,
that''s why you have pool redundancy.  You''re just hurting
performance by using "copies."  And protecting against failure
conditions that are otherwise nearly nonexistent...  And just as easily solved
(without performance penalty) via pool redundancy.
Well, there is at least a couple of failure scenarios where
copies>1 are good:

1) A single-disk pool, as in a laptop. Noise on the bus,
    media degradation, or any other reason to misread or
    miswrite a block can result in a failed pool. One of
    my older test boxes has an untrustworthy 80Gb HDD for
    its rpool, and the system did crash into an unbootable
    image with just half-a-dozen of CKSUM errors.
    Remaking the rpool with copies=2 enforced from the
    start and rsyncing the rootfs files back into the new
    pool - and this thing works well since then, despite
    finding several errors upon each weekly scrub.

2) The data pool on the same box experienced some errors
    where raidz2 failed to recreate a userdata block, thus
    invalidating a file despite having a 2-disk redundancy.
    There was some discussion of that on the list, and my
    ultimate guess is that the six disks'' heads were over
    similar locations of the same file - i.e. during scrub -
    and a power surge or some similar event caused them to
    scramble portions of the disk pertaining to the same
    ZFS block. At least, this could have induced many
    enough errors to make raidz2 protection irrelevant.
    If the pool had copies=2, there would be another replica
    of the same block that would have been not corrupted
    by such assumed failure mechanism - because the disk
    heads were elsewhere.

Hmmm... now I wonder if ZFS checksum validation can try
permutations of should-be-identical sectors from different
copies of a block - in case both copies have received some
non-overlapping errors, and together contain enough data to
reconstruct a ZFS block (and rewrite both its copies now).

//Jim

Jim Klimov

2012-Aug-01 14:14 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-08-01 17:55, Sa?o Kiselkov ?????:> On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>>
>>> Availability of the DDT is IMHO crucial to a deduped pool, so
>>> I won''t be surprised to see it forced to triple copies.
>>
>> IMHO, the more important thing for dedup moving forward is to create an
option to dedicate a fast device (SSD or whatever) to the DDT.  So all those
little random IO operations never hit the rusty side of the pool.
>
> That''s something you can already do with an L2ARC. In the future I
plan
> on investigating implementing a set of more fine-grained ARC and L2ARC
> policy tuning parameters that would give more control into the hands of
> admins over how the ARC/L2ARC cache is used.

Unfortunately, as of current implementations, L2ARC starts up cold.
That is, upon every import of the pool the L2ARC is empty, and the
DDT (as in the example above) would have to migrate into the cache
via read-from-rust to RAM ARC and expiration from the ARC. Getting
it to be hot and fast again takes some time, and chances are that
some blocks of userdata might be more popular than a DDT block and
would push it out of L2ARC as well...

//Jim

Sašo Kiselkov

2012-Aug-01 14:33 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On 08/01/2012 04:14 PM, Jim Klimov wrote:> 2012-08-01 17:55, Sa?o Kiselkov ?????:
>> On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
>>>> From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-
>>>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>>>
>>>> Availability of the DDT is IMHO crucial to a deduped pool, so
>>>> I won''t be surprised to see it forced to triple
copies.
>>>
>>> IMHO, the more important thing for dedup moving forward is to
create
>>> an option to dedicate a fast device (SSD or whatever) to the DDT. 
So
>>> all those little random IO operations never hit the rusty side of
the
>>> pool.
>>
>> That''s something you can already do with an L2ARC. In the
future I plan
>> on investigating implementing a set of more fine-grained ARC and L2ARC
>> policy tuning parameters that would give more control into the hands of
>> admins over how the ARC/L2ARC cache is used.
> 
> 
> Unfortunately, as of current implementations, L2ARC starts up cold.
Yes, that''s by design, because the L2ARC is simply a secondary backing
store for ARC blocks. If the memory pointer isn''t valid, chances are,
you''ll still be able to find the block on the L2ARC devices. You
can''t
scan an L2ARC device and discover some usable structures, as there
aren''t any. It''s literally just a big pile of disk blocks and
their
associated ARC headers only live in RAM.
> chances are that
> some blocks of userdata might be more popular than a DDT block and
> would push it out of L2ARC as well...
Which is why I plan on investigating implementing some tunable policy
module that would allow the administrator to get around this problem.
E.g. administrator dedicates 50G of ARC space to metadata (which
includes the DDT) or only the DDT specifically. My idea is still a bit
fuzzy, but it revolves primarily around allocating and policing min and
max quotas for a given ARC entry type. I''ll start a separate discussion
thread for this later on once I have everything organized in my mind
about where I plan on taking this.

Cheers,
--
Saso

Nigel W

2012-Aug-01 15:30 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On Wed, Aug 1, 2012 at 8:33 AM, Sa?o Kiselkov <skiselkov.ml at gmail.com>
wrote:> On 08/01/2012 04:14 PM, Jim Klimov wrote:
>> chances are that
>> some blocks of userdata might be more popular than a DDT block and
>> would push it out of L2ARC as well...
>
> Which is why I plan on investigating implementing some tunable policy
> module that would allow the administrator to get around this problem.
> E.g. administrator dedicates 50G of ARC space to metadata (which
> includes the DDT) or only the DDT specifically. My idea is still a bit
> fuzzy, but it revolves primarily around allocating and policing min and
> max quotas for a given ARC entry type. I''ll start a separate
discussion
> thread for this later on once I have everything organized in my mind
> about where I plan on taking this.
>
Yes. +1

The L2ARC as is it currently implemented is not terribly useful for
storing the DDT in anyway because each DDT entry is 376 bytes but the
L2ARC reference is 176 bytes, so best case you get just over double
the DDT entries in the L2ARC as what you would get into the ARC but
then you have also have no ARC left for anything else :(.

I think a fantastic idea for dealing with the DDT (and all other
metadata for that matter) would be an option to put (a copy of)
metadata exclusively on a SSD.

opensolarisisdeadlongliveopensolaris

2012-Aug-01 18:07 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com]
> Sent: Wednesday, August 01, 2012 9:56 AM
> 
> On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
> >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> >> bounces at opensolaris.org] On Behalf Of Jim Klimov
> >>
> >> Availability of the DDT is IMHO crucial to a deduped pool, so
> >> I won''t be surprised to see it forced to triple copies.
> >
> > IMHO, the more important thing for dedup moving forward is to create
an
> option to dedicate a fast device (SSD or whatever) to the DDT.  So all
those
> little random IO operations never hit the rusty side of the pool.
> 
> That''s something you can already do with an L2ARC. In the future I
plan
> on investigating implementing a set of more fine-grained ARC and L2ARC
> policy tuning parameters that would give more control into the hands of
> admins over how the ARC/L2ARC cache is used.
L2ARC is a read cache.  Hence the "R" and "C" in
"L2ARC."
This means two major things:
#1  Writes don''t benefit, 
and
#2  There''s no way to load the whole DDT into the cache anyway.  So
you''re guaranteed to have performance degradation with the dedup.

opensolarisisdeadlongliveopensolaris

2012-Aug-01 18:13 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: opensolarisisdeadlongliveopensolaris
> Sent: Wednesday, August 01, 2012 2:08 PM
>  
> L2ARC is a read cache.  Hence the "R" and "C" in
"L2ARC."
> This means two major things:
> #1  Writes don''t benefit,
> and
> #2  There''s no way to load the whole DDT into the cache anyway. 
So you''re
> guaranteed to have performance degradation with the dedup.
In other words, the DDT is always written in rust (written in main pool).  You
gain some performance by adding arc/l2arc/log devices, but it can only reduce
the problem.  Not solved.

The problem would be solved if you could choose to dedicate an SSD mirror for
DDT, and either allow the pool size to be limited by the amount of DDT storage
available, or overflow into the main pool if the DDT device got full.

Jim Klimov

2012-Aug-01 18:14 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-08-01 22:07, opensolarisisdeadlongliveopensolaris
?????:> L2ARC is a read cache.  Hence the "R" and "C" in
"L2ARC."
"R" is replacement, but what the hell ;)
> This means two major things:
> #1  Writes don''t benefit,
> and
> #2  There''s no way to load the whole DDT into the cache anyway. 
So you''re guaranteed to have performance degradation with the dedup.
If the whole DDT does make it into the cache, or onto an SSD
storing an extra copy of all pool metadata, then searching
for a particular entry in DDT would be faster. When you write
(or delete) and need to update the counters in DDT, or even
ultimately remove an unreferenced entry, then you benefit on
writes as well - you don''t take as long to find DDT entries
(or determine lack thereof) for the blocks you add or remove.

Or did I get your answer wrong? ;)

//Jim

Tomas Forsman

2012-Aug-01 18:41 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On 01 August, 2012 - opensolarisisdeadlongliveopensolaris sent me these 1,8K
bytes:
> > From: Sa??o Kiselkov [mailto:skiselkov.ml at gmail.com]
> > Sent: Wednesday, August 01, 2012 9:56 AM
> > 
> > On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
> > >> From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-
> > >> bounces at opensolaris.org] On Behalf Of Jim Klimov
> > >>
> > >> Availability of the DDT is IMHO crucial to a deduped pool, so
> > >> I won''t be surprised to see it forced to triple
copies.
> > >
> > > IMHO, the more important thing for dedup moving forward is to
create an
> > option to dedicate a fast device (SSD or whatever) to the DDT.  So all
those
> > little random IO operations never hit the rusty side of the pool.
> > 
> > That''s something you can already do with an L2ARC. In the
future I plan
> > on investigating implementing a set of more fine-grained ARC and L2ARC
> > policy tuning parameters that would give more control into the hands
of
> > admins over how the ARC/L2ARC cache is used.
> 
> L2ARC is a read cache.  Hence the "R" and "C" in
"L2ARC."
"Adaptive Replacement Cache", right.
> This means two major things:
> #1  Writes don''t benefit, 
> and
> #2  There''s no way to load the whole DDT into the cache anyway. 
So you''re guaranteed to have performance degradation with the dedup.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
/Tomas
-- 
Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

opensolarisisdeadlongliveopensolaris

2012-Aug-01 19:34 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> Well, there is at least a couple of failure scenarios where
> copies>1 are good:
> 
> 1) A single-disk pool, as in a laptop. Noise on the bus,
>     media degradation, or any other reason to misread or
>     miswrite a block can result in a failed pool. 
How does mac/win/lin handle this situation?  (Not counting btrfs.)

Such noise might result in a temporarily faulted pool (blue screen of death)
that is fully recovered after reboot.  Meanwhile you''re always paying
for it in terms of performance, and it''s all solvable via pool
redundancy.

opensolarisisdeadlongliveopensolaris

2012-Aug-01 19:40 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> 2012-08-01 22:07, opensolarisisdeadlongliveopensolaris ?????:
> > L2ARC is a read cache.  Hence the "R" and "C" in
"L2ARC."
> 
> "R" is replacement, but what the hell ;)
> 
> > This means two major things:
> > #1  Writes don''t benefit,
> > and
> > #2  There''s no way to load the whole DDT into the cache
anyway.  So you''re
> guaranteed to have performance degradation with the dedup.
> 
> If the whole DDT does make it into the cache, or onto an SSD
> storing an extra copy of all pool metadata, then searching
> for a particular entry in DDT would be faster. When you write
> (or delete) and need to update the counters in DDT, or even
> ultimately remove an unreferenced entry, then you benefit on
> writes as well - you don''t take as long to find DDT entries
> (or determine lack thereof) for the blocks you add or remove.
> 
> Or did I get your answer wrong? ;)
Agreed, ARC/L2ARC help in finding the DDT, but whenever you''ve got a
snapshot destroy (happens every 15 minutes) you''ve got a lot of entries
you need to write.  Those are all scattered about the pool...  Even if you can
find them fast, it''s still a bear.

Jim Klimov

2012-Aug-01 19:51 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-08-01 23:40, opensolarisisdeadlongliveopensolaris ?????:
> Agreed, ARC/L2ARC help in finding the DDT, but whenever you''ve got
a snapshot destroy (happens every 15 minutes) you''ve got a lot of
entries you need to write.  Those are all scattered about the pool...  Even if
you can find them fast, it''s still a bear.
No, these entries you need to update are scattered around your
SSD (be it ARC or a hypothetical SSD-based copy of metadata
which I also "campaigned" for some time ago). We agreed (or
assumed) that with SSDs in place you can find the DDT entries
to update relatively fast now. The values are changed in RAM
and flushed to disk as part of an upcoming TXG commit, likely
in a limited number of disk head strokes (lots to coalesce),
and the way I see it - the updated copy remains in the ARC
instead of the obsolete DDT entry, and can make it into L2ARC
sometime in the future, as well.

//Jim

Jim Klimov

2012-Aug-01 19:51 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-08-01 23:34, opensolarisisdeadlongliveopensolaris
?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>
>> Well, there is at least a couple of failure scenarios where
>> copies>1 are good:
>>
>> 1) A single-disk pool, as in a laptop. Noise on the bus,
>>      media degradation, or any other reason to misread or
>>      miswrite a block can result in a failed pool.
>
> How does mac/win/lin handle this situation?  (Not counting btrfs.)
>
> Such noise might result in a temporarily faulted pool (blue screen of
death) that is fully recovered after reboot.

In some of my cases I was "lucky" enough to get a corrupted /sbin/init
or something like that once, and the box had no other BE''s yet, so the
OS could not do anything reasonable after boot. It is different from a
"corrupted zpool", but ended in a useless OS image due to one broken
sector nonetheless.


 > Meanwhile you''re always paying for it in terms of performance,
and
it''s all solvable via pool redundancy.

For a single-disk box, "copies" IS the redundancy. ;)

The discussion did stray off from my original question, though ;)

//Jim

Peter Jeremy

2012-Aug-01 21:41 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On 2012-Aug-01 21:00:46 +0530, Nigel W <nigel.w at nosun.ca>
wrote:>I think a fantastic idea for dealing with the DDT (and all other
>metadata for that matter) would be an option to put (a copy of)
>metadata exclusively on a SSD.
This is on my wishlist as well.  I believe ZEVO supports it so possibly
it''ll be available in ZFS in the near future.

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120802/6cc308d0/attachment.bin>

opensolarisisdeadlongliveopensolaris

2012-Aug-02 12:55 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> 2012-08-01 23:40, opensolarisisdeadlongliveopensolaris ?????:
> 
> > Agreed, ARC/L2ARC help in finding the DDT, but whenever
you''ve got a
> snapshot destroy (happens every 15 minutes) you''ve got a lot of
entries you
> need to write.  Those are all scattered about the pool...  Even if you can
find
> them fast, it''s still a bear.
> 
> No, these entries you need to update are scattered around your
> SSD (be it ARC or a hypothetical SSD-based copy of metadata
> which I also "campaigned" for some time ago). 
If they were scattered around the hypothetical dedicated DDT SSD, I would say,
no problem.  But in reality, they''re scattered in your main pool.  DDT
writes don''t get coalesced.  Is this simply because they''re
sync writes?  Or is it because they''re metadata, which is even lower
level than sync writes?  I know, for example, that you can disable ZIL on your
pool, but still the system is going to flush the buffer after certain
operations, such as writing the uberblock.  I have not seen the code that
flushes the buffer after DDT writes, but I have seen the performance evidence.

opensolarisisdeadlongliveopensolaris

2012-Aug-02 13:00 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> In some of my cases I was "lucky" enough to get a corrupted
/sbin/init
> or something like that once, and the box had no other BE''s yet, so
the
> OS could not do anything reasonable after boot. It is different from a
> "corrupted zpool", but ended in a useless OS image due to one
broken
> sector nonetheless.
That''s very annoying, but if "copies" could have saved you,
then pool redundancy could have also saved you.

> For a single-disk box, "copies" IS the redundancy. ;)
Ok, so the point is, in some cases, somebody might want redundancy on a device
that has no redundancy.  They''re willing to pay for it by halving their
performance.  The only situation I''ll acknowledge is the laptop
situation, and I''ll say, present day very few people would be willing
to pay *that* much for this limited use-case redundancy.  The solution that I as
an IT person would recommend and deploy would be to run without
"copies" and instead cover you bum by doing backups.

Toby Thain

2012-Aug-02 17:34 UTC

head link

[zfs-discuss] single-disk pool - Re: Can the ZFS "copies" attribute substitute HW disk redundancy?

On 01/08/12 3:34 PM, opensolarisisdeadlongliveopensolaris
wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>
>> Well, there is at least a couple of failure scenarios where
>> copies>1 are good:
>>
>> 1) A single-disk pool, as in a laptop. Noise on the bus,
>>      media degradation, or any other reason to misread or
>>      miswrite a block can result in a failed pool.
>
> How does mac/win/lin handle this situation?  (Not counting btrfs.)
>
Is this a trick question? :)

--Toby
> Such noise might result in a temporarily faulted pool (blue screen of
death) that is fully recovered after reboot.  Meanwhile you''re always
paying for it in terms of performance, and it''s all solvable via pool
redundancy.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2012-Aug-02 21:36 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On Aug 1, 2012, at 2:41 PM, Peter Jeremy wrote:
> On 2012-Aug-01 21:00:46 +0530, Nigel W <nigel.w at nosun.ca> wrote:
>> I think a fantastic idea for dealing with the DDT (and all other
>> metadata for that matter) would be an option to put (a copy of)
>> metadata exclusively on a SSD.
> 
> This is on my wishlist as well.  I believe ZEVO supports it so possibly
> it''ll be available in ZFS in the near future.
ZEVO does not. The only ZFS vendor I''m aware of with a separate
top-level
vdev for metadata is Tegile, and it is available today. 
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120802/331b12b0/attachment.html>

Richard Elling

2012-Aug-02 21:39 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On Aug 1, 2012, at 8:30 AM, Nigel W wrote:> On Wed, Aug 1, 2012 at 8:33 AM, Sa?o Kiselkov <skiselkov.ml at
gmail.com> wrote:
>> On 08/01/2012 04:14 PM, Jim Klimov wrote:
>>> chances are that
>>> some blocks of userdata might be more popular than a DDT block and
>>> would push it out of L2ARC as well...
>> 
>> Which is why I plan on investigating implementing some tunable policy
>> module that would allow the administrator to get around this problem.
>> E.g. administrator dedicates 50G of ARC space to metadata (which
>> includes the DDT) or only the DDT specifically. My idea is still a bit
>> fuzzy, but it revolves primarily around allocating and policing min and
>> max quotas for a given ARC entry type. I''ll start a separate
discussion
>> thread for this later on once I have everything organized in my mind
>> about where I plan on taking this.
>> 
> 
> Yes. +1
> 
> The L2ARC as is it currently implemented is not terribly useful for
> storing the DDT in anyway because each DDT entry is 376 bytes but the
> L2ARC reference is 176 bytes, so best case you get just over double
> the DDT entries in the L2ARC as what you would get into the ARC but
> then you have also have no ARC left for anything else :(.
You are making the assumption that each DDT table entry consumes one
metadata update. This is not the case. The DDT is implemented as an AVL
tree. As per other metadata in ZFS, the data is compressed. So you cannot
make a direct correlation between the DDT entry size and the affect on the
stored metadata on disk sectors.
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120802/c3f115f0/attachment-0001.html>

Peter Jeremy

2012-Aug-02 21:59 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On 2012-Aug-02 18:30:01 +0530, opensolarisisdeadlongliveopensolaris
<opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:>Ok, so the point is, in some cases, somebody might want redundancy on
>a device that has no redundancy.  They''re willing to pay for it by
>halving their performance.
This isn''t quite true - write performance will be at least halved
(possibly worse due to additional seeking) but read performance
could potentially improve (more copies means, on average, there should
be less seeking to get a a copy than if there was only one copy).
And non-IO performance is unaffected.
>  The only situation I''ll acknowledge is
>the laptop situation, and I''ll say, present day very few people
would
>be willing to pay *that* much for this limited use-case redundancy.
My guess is that, for most people, the overall performance impact
would be minimal because disk write performance isn''t the limiting
factor for most laptop usage scenarios.
>The solution that I as an IT person would recommend and deploy would
>be to run without "copies" and instead cover you bum by doing
backups.
You need backups in any case but backups won''t help you if you
can''t
conveniently access them.  Before giving a blanket recommendation, you
need to consider how the person uses their laptop.  Consider the
following scenario:  You''re in the middle of a week-long business trip
and your laptop develops a bad sector in an inconvenient spot.  Do you:
a) Let ZFS automagically repair the sector thanks to copies=2.
b) Attempt to rebuild your laptop and restore from backups (left securely
   at home) via the dodgy hotel wifi.

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120803/994774be/attachment.bin>

Nigel W

2012-Aug-03 00:40 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On Thu, Aug 2, 2012 at 3:39 PM, Richard Elling <richard.elling at
gmail.com> wrote:> On Aug 1, 2012, at 8:30 AM, Nigel W wrote:
>
>
> Yes. +1
>
> The L2ARC as is it currently implemented is not terribly useful for
> storing the DDT in anyway because each DDT entry is 376 bytes but the
> L2ARC reference is 176 bytes, so best case you get just over double
> the DDT entries in the L2ARC as what you would get into the ARC but
> then you have also have no ARC left for anything else :(.
>
>
> You are making the assumption that each DDT table entry consumes one
> metadata update. This is not the case. The DDT is implemented as an AVL
> tree. As per other metadata in ZFS, the data is compressed. So you cannot
> make a direct correlation between the DDT entry size and the affect on the
> stored metadata on disk sectors.
>  -- richard
>It''s compressed even when in the ARC?

Richard Elling

2012-Aug-03 14:55 UTC

head link

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

On Aug 2, 2012, at 5:40 PM, Nigel W wrote:
> On Thu, Aug 2, 2012 at 3:39 PM, Richard Elling <richard.elling at
gmail.com> wrote:
>> On Aug 1, 2012, at 8:30 AM, Nigel W wrote:
>> 
>> 
>> Yes. +1
>> 
>> The L2ARC as is it currently implemented is not terribly useful for
>> storing the DDT in anyway because each DDT entry is 376 bytes but the
>> L2ARC reference is 176 bytes, so best case you get just over double
>> the DDT entries in the L2ARC as what you would get into the ARC but
>> then you have also have no ARC left for anything else :(.
>> 
>> 
>> You are making the assumption that each DDT table entry consumes one
>> metadata update. This is not the case. The DDT is implemented as an AVL
>> tree. As per other metadata in ZFS, the data is compressed. So you
cannot
>> make a direct correlation between the DDT entry size and the affect on
the
>> stored metadata on disk sectors.
>> -- richard
>> 
> It''s compressed even when in the ARC?

That is a slightly odd question. The ARC contains ZFS blocks. DDT metadata is 
manipulated in memory as an AVL tree, so what you can see in the ARC is the
metadata blocks that were read and uncompressed from the pool or packaged
in blocks and written to the pool. Perhaps it is easier to think of them as
metadata
in transition? :-)
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120803/0d8ae109/attachment.html>

zfs discuss - Jul 2012 - Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] single-disk pool - Re: Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?