Jim Klimov
2012-Jul-29 14:03 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
Hello all, Over the past few years there have been many posts suggesting that for modern HDDs (several TB size, around 100-200MB/s best speed) the rebuild times grow exponentially, so to build a well protected pool with these disks one has to plan for about three disk''s worth of redundancy - that is, three- or four-way mirrors, or raidz3 - just to allow systems to survive a disk outage (with accpetably high probability of success) while one is resilvering. There were many posts on this matter from esteemed members of the list, including (but certainly not limited to) these articles: * https://blogs.oracle.com/ahl/entry/triple_parity_raid_z * https://blogs.oracle.com/ahl/entry/acm_triple_parity_raid * http://queue.acm.org/detail.cfm?id=1670144 * http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html Now, this brings me to such a question: when people build a home-NAS box, they are quite constrained in terms of the number of directly attached disks (about 4-6 bays), or even if they use external JBODs - to the number of disks in them (up to 8, which does allow a 5+3 raidz3 set in a single box, which still seems like a large overhead to some buyers - a 4*2 mirror would give about as much space and higher performance, but may have unacceptably less redundancy). If I want to have considerable storage, with proper reliability, and just a handful of drives, what are my best options? I wondered if the "copies" attribute can be considered sort of equivalent to the number of physical disks - limited to seek times though. Namely, for the same amount of storage on a 4-HDD box I could use raidz1 and 4*1tb at copies=1 or 4*2tb at copies=2 or even 4*3tb at copies=3, for example. To simplify the matters, let''s assume that this is a small box (under 10GB RAM) not using dedup, though it would likely use compression :) Question to theorists and practicians: is any of these options better or worse than the others, in terms of reliability and access/rebuild/scrub speeds, for either a single-sector error or for a full-disk replacement? Would extra copies on larger disks actually provide the extra reliability, or only add overheads and complicate/degrade the situation? Would the use of several copies cripple the write speeds? Can the extra copies be used by zio scheduler to optimize and speed up reads, like extra mirror sides would? Thanks, //Jim Klimov
Roy Sigurd Karlsbakk
2012-Jul-29 15:36 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
"copies" won''t help much if the pool is unavailable. It may, however, help if, say, you have a RAIDz2, and two drives die, and htere are errors on a third drive, but not sufficiently bad for zfs to reject the pool roy ----- Opprinnelig melding -----> Hello all, > > Over the past few years there have been many posts suggesting > that for modern HDDs (several TB size, around 100-200MB/s best > speed) the rebuild times grow exponentially, so to build a well > protected pool with these disks one has to plan for about three > disk''s worth of redundancy - that is, three- or four-way mirrors, > or raidz3 - just to allow systems to survive a disk outage (with > accpetably high probability of success) while one is resilvering. > > There were many posts on this matter from esteemed members of > the list, including (but certainly not limited to) these articles: > * https://blogs.oracle.com/ahl/entry/triple_parity_raid_z > * https://blogs.oracle.com/ahl/entry/acm_triple_parity_raid > * http://queue.acm.org/detail.cfm?id=1670144 > * > http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html > > Now, this brings me to such a question: when people build a > home-NAS box, they are quite constrained in terms of the number > of directly attached disks (about 4-6 bays), or even if they > use external JBODs - to the number of disks in them (up to 8, > which does allow a 5+3 raidz3 set in a single box, which still > seems like a large overhead to some buyers - a 4*2 mirror would > give about as much space and higher performance, but may have > unacceptably less redundancy). If I want to have considerable > storage, with proper reliability, and just a handful of drives, > what are my best options? > > I wondered if the "copies" attribute can be considered sort > of equivalent to the number of physical disks - limited to seek > times though. Namely, for the same amount of storage on a 4-HDD > box I could use raidz1 and 4*1tb at copies=1 or 4*2tb at copies=2 or > even 4*3tb at copies=3, for example. > > To simplify the matters, let''s assume that this is a small > box (under 10GB RAM) not using dedup, though it would likely > use compression :) > > Question to theorists and practicians: is any of these options > better or worse than the others, in terms of reliability and > access/rebuild/scrub speeds, for either a single-sector error > or for a full-disk replacement? > > Would extra copies on larger disks actually provide the extra > reliability, or only add overheads and complicate/degrade the > situation? > > Would the use of several copies cripple the write speeds? > > Can the extra copies be used by zio scheduler to optimize and > speed up reads, like extra mirror sides would? > > Thanks, > //Jim Klimov > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 98013356 roy at karlsbakk.net http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Bob Friesenhahn
2012-Jul-29 18:52 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Sun, 29 Jul 2012, Jim Klimov wrote:> > Would extra copies on larger disks actually provide the extra > reliability, or only add overheads and complicate/degrade the > situation?My opinion is that complete hard drive failure and block-level media failure are two totally different things. Complete hard drive failure rates should not be directly related to total storage size whereas the probabily of media failure per drive is directly related to total storage size. Given this, and assuming that complete hard drive failure occurs much less often than partial media failure, using the copies feature should be pretty effective.> Would the use of several copies cripple the write speeds?It would reduce the write rate by 1/2 or by whatever number of copies you have requested. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
opensolarisisdeadlongliveopensolaris
2012-Jul-29 20:12 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > I wondered if the "copies" attribute can be considered sort > of equivalent to the number of physical disks - limited to seek > times though. Namely, for the same amount of storage on a 4-HDD > box I could use raidz1 and 4*1tb at copies=1 or 4*2tb at copies=2 or > even 4*3tb at copies=3, for example.The first question - reliability... "copies" might be on the same disk. So it''s not guaranteed to help if you have a disk failure. Let''s try this: Take a disk, slice it into two partitions, and then make a mirror using the 2 partitions. This is about as useful as the copies property. Half the write speed, half the usable disk capacity, improved redundancy against bad blocks, but no better redundancy against disk failure. ("copies" will actually be better, because unlike the partitioning scenario, "copies" will sometimes write the extra copies to other disks.) Re: the assumption - lower performance with larger disks... rebuild time growing exponentially... I don''t buy it, and I don''t see that argument being made in any of the messages you referenced. Rebuild time is dependent on the amount of data in the vdev and the layout of said data, so if you consider a mirror of 3T versus 6 vdev''s all mirroring 500G, then in that situation the larger disks resilver slower. (Because it''s a larger amount of data that needs to resilver. You have to resilver all your data instead of 1/6th of your data.)
GREGG WONDERLY
2012-Jul-30 14:11 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Jul 29, 2012, at 3:12 PM, opensolarisisdeadlongliveopensolaris <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> I wondered if the "copies" attribute can be considered sort >> of equivalent to the number of physical disks - limited to seek >> times though. Namely, for the same amount of storage on a 4-HDD >> box I could use raidz1 and 4*1tb at copies=1 or 4*2tb at copies=2 or >> even 4*3tb at copies=3, for example. > > The first question - reliability... > > "copies" might be on the same disk. So it''s not guaranteed to help if you have a disk failure.I thought I understood that copies would not be on the same disk, I guess I need to go read up on this again. Gregg Wonderly
John Martin
2012-Jul-30 17:06 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On 07/29/12 14:52, Bob Friesenhahn wrote:> My opinion is that complete hard drive failure and block-level media > failure are two totally different things.That would depend on the recovery behavior of the drive for block-level media failure. A drive whose firmware does excessive (reports of up to 2 minutes) retries of a bad sector may be indistinguishable from a failed drive. See previous discussions of the firmware differences between desktop and enterprise drives.
Brandon High
2012-Jul-30 20:37 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Mon, Jul 30, 2012 at 7:11 AM, GREGG WONDERLY <greggwon at gmail.com> wrote:> I thought I understood that copies would not be on the same disk, I guess I need to go read up on this again.ZFS attempts to put copies on separate devices, but there''s no guarantee. -B -- Brandon High : bhigh at freaks.com
Nico Williams
2012-Jul-30 20:48 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
The copies thing is a really only for laptops, where the likelihood of redundancy is very low (there are some high-end laptops with multiple drives, but those are relatively rare) and where this idea is better than nothing. It''s also nice that copies can be set on a per-dataset manner (whereas RAID-Zn and mirroring are for pool-wide redundancy, not per-dataset), so you could set it > 1 on home directories but not /. Nico --
opensolarisisdeadlongliveopensolaris
2012-Jul-31 13:55 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nico Williams > > The copies thing is a really only for laptops, where the likelihood of > redundancy is very lowZFS also stores multiple copies of things that it considers "extra important." I''m not sure what exactly - uber block, or stuff like that... When you set the "copies" property, you''re just making it apply to other stuff, that otherwise would be only 1.
Jim Klimov
2012-Aug-01 10:04 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
2012-07-31 17:55, opensolarisisdeadlongliveopensolaris ?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Nico Williams >> >> The copies thing is a really only for laptops, where the likelihood of >> redundancy is very low > > ZFS also stores multiple copies of things that it considers "extra important." I''m not sure what exactly - uber block, or stuff like that... > > When you set the "copies" property, you''re just making it apply to other stuff, that otherwise would be only 1.IIRC, the "copies" defaults are: 1 block for userdata 2 blocks for regular metadata (block-pointer tree) 3 blocks for higher-level metadata (metadata tree root, dataset definitions) The "Uberblock" I am not so sure about, from the top of my head. There is a record in the ZFS labels, and that is stored 4 times on each leaf VDEV, and points to a ZFS block with the tree root for the current (newest consistent flushed-to-pool) TXG number. Which one of these concepts is named The 00bab10c - *that* I am a bit vague about ;) Probably DDT is also stored with 2 or 3 copies of each block, since it is metadata. It was not in the last ZFS on-disk spec from 2006 that I found, for some apparent reason ;) Also, I am not sure whether bumping the copies attribute to, say, "3" increases only the redundancy of userdata, or of regular metadata as well. //Jim
Sašo Kiselkov
2012-Aug-01 12:22 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On 08/01/2012 12:04 PM, Jim Klimov wrote:> Probably DDT is also stored with 2 or 3 copies of each block, > since it is metadata. It was not in the last ZFS on-disk spec > from 2006 that I found, for some apparent reason ;)That''s probably because it''s extremely big (dozens, hundreds or even thousands of GB). Cheers, -- Saso
Jim Klimov
2012-Aug-01 12:41 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
2012-08-01 16:22, Sa?o Kiselkov ?????:> On 08/01/2012 12:04 PM, Jim Klimov wrote: >> Probably DDT is also stored with 2 or 3 copies of each block, >> since it is metadata. It was not in the last ZFS on-disk spec >> from 2006 that I found, for some apparent reason ;)The idea of the pun was that the latest available full spec is over half a decade old, alas. At least I failed to find any one newer, when I searched last winter. And back in 2006 there was no dedup nor any mention of it in the spec (surprising, huh? ;) Hopefully with all the upcoming changes - including integration of feature flags and new checksum and compression algorithms, the consistent textual document of "Current ZFS On-Disk spec in illumos(/FreeBSD/...)" would appear and be maintained up-to-date.> That''s probably because it''s extremely big (dozens, hundreds or even > thousands of GB).Availability of the DDT is IMHO crucial to a deduped pool, so I won''t be surprised to see it forced to triple copies. Not that it is very difficult to check with ZDB, though finding the DDT "dataset" for inspection (when I last tried) was not an obvious task. //Jim
opensolarisisdeadlongliveopensolaris
2012-Aug-01 13:35 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > Availability of the DDT is IMHO crucial to a deduped pool, so > I won''t be surprised to see it forced to triple copies.Agreed, although, the DDT is also paramount to performance. In theory, an online dedup''d pool could be much faster than non-dedup''d pools, or offline dedup''d pools. So there''s a lot of potential here - Lost potential at the present. IMHO, the more important thing for dedup moving forward is to create an option to dedicate a fast device (SSD or whatever) to the DDT. So all those little random IO operations never hit the rusty side of the pool. Personally, I''ve never been supportive of the whole "copies" idea. If you need more than one redundant copy of some data, that''s why you have pool redundancy. You''re just hurting performance by using "copies." And protecting against failure conditions that are otherwise nearly nonexistent... And just as easily solved (without performance penalty) via pool redundancy.
Sašo Kiselkov
2012-Aug-01 13:55 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> Availability of the DDT is IMHO crucial to a deduped pool, so >> I won''t be surprised to see it forced to triple copies. > > IMHO, the more important thing for dedup moving forward is to create an option to dedicate a fast device (SSD or whatever) to the DDT. So all those little random IO operations never hit the rusty side of the pool.That''s something you can already do with an L2ARC. In the future I plan on investigating implementing a set of more fine-grained ARC and L2ARC policy tuning parameters that would give more control into the hands of admins over how the ARC/L2ARC cache is used. Cheers, -- Saso
Jim Klimov
2012-Aug-01 14:09 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
2012-08-01 17:35, opensolarisisdeadlongliveopensolaris ?????:> Personally, I''ve never been supportive of the whole "copies" idea. If you need more than one redundant copy of some data, that''s why you have pool redundancy. You''re just hurting performance by using "copies." And protecting against failure conditions that are otherwise nearly nonexistent... And just as easily solved (without performance penalty) via pool redundancy.Well, there is at least a couple of failure scenarios where copies>1 are good: 1) A single-disk pool, as in a laptop. Noise on the bus, media degradation, or any other reason to misread or miswrite a block can result in a failed pool. One of my older test boxes has an untrustworthy 80Gb HDD for its rpool, and the system did crash into an unbootable image with just half-a-dozen of CKSUM errors. Remaking the rpool with copies=2 enforced from the start and rsyncing the rootfs files back into the new pool - and this thing works well since then, despite finding several errors upon each weekly scrub. 2) The data pool on the same box experienced some errors where raidz2 failed to recreate a userdata block, thus invalidating a file despite having a 2-disk redundancy. There was some discussion of that on the list, and my ultimate guess is that the six disks'' heads were over similar locations of the same file - i.e. during scrub - and a power surge or some similar event caused them to scramble portions of the disk pertaining to the same ZFS block. At least, this could have induced many enough errors to make raidz2 protection irrelevant. If the pool had copies=2, there would be another replica of the same block that would have been not corrupted by such assumed failure mechanism - because the disk heads were elsewhere. Hmmm... now I wonder if ZFS checksum validation can try permutations of should-be-identical sectors from different copies of a block - in case both copies have received some non-overlapping errors, and together contain enough data to reconstruct a ZFS block (and rewrite both its copies now). //Jim
Jim Klimov
2012-Aug-01 14:14 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
2012-08-01 17:55, Sa?o Kiselkov ?????:> On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote: >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Jim Klimov >>> >>> Availability of the DDT is IMHO crucial to a deduped pool, so >>> I won''t be surprised to see it forced to triple copies. >> >> IMHO, the more important thing for dedup moving forward is to create an option to dedicate a fast device (SSD or whatever) to the DDT. So all those little random IO operations never hit the rusty side of the pool. > > That''s something you can already do with an L2ARC. In the future I plan > on investigating implementing a set of more fine-grained ARC and L2ARC > policy tuning parameters that would give more control into the hands of > admins over how the ARC/L2ARC cache is used.Unfortunately, as of current implementations, L2ARC starts up cold. That is, upon every import of the pool the L2ARC is empty, and the DDT (as in the example above) would have to migrate into the cache via read-from-rust to RAM ARC and expiration from the ARC. Getting it to be hot and fast again takes some time, and chances are that some blocks of userdata might be more popular than a DDT block and would push it out of L2ARC as well... //Jim
Sašo Kiselkov
2012-Aug-01 14:33 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On 08/01/2012 04:14 PM, Jim Klimov wrote:> 2012-08-01 17:55, Sa?o Kiselkov ?????: >> On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote: >>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>>> bounces at opensolaris.org] On Behalf Of Jim Klimov >>>> >>>> Availability of the DDT is IMHO crucial to a deduped pool, so >>>> I won''t be surprised to see it forced to triple copies. >>> >>> IMHO, the more important thing for dedup moving forward is to create >>> an option to dedicate a fast device (SSD or whatever) to the DDT. So >>> all those little random IO operations never hit the rusty side of the >>> pool. >> >> That''s something you can already do with an L2ARC. In the future I plan >> on investigating implementing a set of more fine-grained ARC and L2ARC >> policy tuning parameters that would give more control into the hands of >> admins over how the ARC/L2ARC cache is used. > > > Unfortunately, as of current implementations, L2ARC starts up cold.Yes, that''s by design, because the L2ARC is simply a secondary backing store for ARC blocks. If the memory pointer isn''t valid, chances are, you''ll still be able to find the block on the L2ARC devices. You can''t scan an L2ARC device and discover some usable structures, as there aren''t any. It''s literally just a big pile of disk blocks and their associated ARC headers only live in RAM.> chances are that > some blocks of userdata might be more popular than a DDT block and > would push it out of L2ARC as well...Which is why I plan on investigating implementing some tunable policy module that would allow the administrator to get around this problem. E.g. administrator dedicates 50G of ARC space to metadata (which includes the DDT) or only the DDT specifically. My idea is still a bit fuzzy, but it revolves primarily around allocating and policing min and max quotas for a given ARC entry type. I''ll start a separate discussion thread for this later on once I have everything organized in my mind about where I plan on taking this. Cheers, -- Saso
Nigel W
2012-Aug-01 15:30 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Wed, Aug 1, 2012 at 8:33 AM, Sa?o Kiselkov <skiselkov.ml at gmail.com> wrote:> On 08/01/2012 04:14 PM, Jim Klimov wrote: >> chances are that >> some blocks of userdata might be more popular than a DDT block and >> would push it out of L2ARC as well... > > Which is why I plan on investigating implementing some tunable policy > module that would allow the administrator to get around this problem. > E.g. administrator dedicates 50G of ARC space to metadata (which > includes the DDT) or only the DDT specifically. My idea is still a bit > fuzzy, but it revolves primarily around allocating and policing min and > max quotas for a given ARC entry type. I''ll start a separate discussion > thread for this later on once I have everything organized in my mind > about where I plan on taking this. >Yes. +1 The L2ARC as is it currently implemented is not terribly useful for storing the DDT in anyway because each DDT entry is 376 bytes but the L2ARC reference is 176 bytes, so best case you get just over double the DDT entries in the L2ARC as what you would get into the ARC but then you have also have no ARC left for anything else :(. I think a fantastic idea for dealing with the DDT (and all other metadata for that matter) would be an option to put (a copy of) metadata exclusively on a SSD.
opensolarisisdeadlongliveopensolaris
2012-Aug-01 18:07 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: Sa?o Kiselkov [mailto:skiselkov.ml at gmail.com] > Sent: Wednesday, August 01, 2012 9:56 AM > > On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote: > >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > >> bounces at opensolaris.org] On Behalf Of Jim Klimov > >> > >> Availability of the DDT is IMHO crucial to a deduped pool, so > >> I won''t be surprised to see it forced to triple copies. > > > > IMHO, the more important thing for dedup moving forward is to create an > option to dedicate a fast device (SSD or whatever) to the DDT. So all those > little random IO operations never hit the rusty side of the pool. > > That''s something you can already do with an L2ARC. In the future I plan > on investigating implementing a set of more fine-grained ARC and L2ARC > policy tuning parameters that would give more control into the hands of > admins over how the ARC/L2ARC cache is used.L2ARC is a read cache. Hence the "R" and "C" in "L2ARC." This means two major things: #1 Writes don''t benefit, and #2 There''s no way to load the whole DDT into the cache anyway. So you''re guaranteed to have performance degradation with the dedup.
opensolarisisdeadlongliveopensolaris
2012-Aug-01 18:13 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: opensolarisisdeadlongliveopensolaris > Sent: Wednesday, August 01, 2012 2:08 PM > > L2ARC is a read cache. Hence the "R" and "C" in "L2ARC." > This means two major things: > #1 Writes don''t benefit, > and > #2 There''s no way to load the whole DDT into the cache anyway. So you''re > guaranteed to have performance degradation with the dedup.In other words, the DDT is always written in rust (written in main pool). You gain some performance by adding arc/l2arc/log devices, but it can only reduce the problem. Not solved. The problem would be solved if you could choose to dedicate an SSD mirror for DDT, and either allow the pool size to be limited by the amount of DDT storage available, or overflow into the main pool if the DDT device got full.
Jim Klimov
2012-Aug-01 18:14 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
2012-08-01 22:07, opensolarisisdeadlongliveopensolaris ?????:> L2ARC is a read cache. Hence the "R" and "C" in "L2ARC.""R" is replacement, but what the hell ;)> This means two major things: > #1 Writes don''t benefit, > and > #2 There''s no way to load the whole DDT into the cache anyway. So you''re guaranteed to have performance degradation with the dedup.If the whole DDT does make it into the cache, or onto an SSD storing an extra copy of all pool metadata, then searching for a particular entry in DDT would be faster. When you write (or delete) and need to update the counters in DDT, or even ultimately remove an unreferenced entry, then you benefit on writes as well - you don''t take as long to find DDT entries (or determine lack thereof) for the blocks you add or remove. Or did I get your answer wrong? ;) //Jim
Tomas Forsman
2012-Aug-01 18:41 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On 01 August, 2012 - opensolarisisdeadlongliveopensolaris sent me these 1,8K bytes:> > From: Sa??o Kiselkov [mailto:skiselkov.ml at gmail.com] > > Sent: Wednesday, August 01, 2012 9:56 AM > > > > On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote: > > >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > >> bounces at opensolaris.org] On Behalf Of Jim Klimov > > >> > > >> Availability of the DDT is IMHO crucial to a deduped pool, so > > >> I won''t be surprised to see it forced to triple copies. > > > > > > IMHO, the more important thing for dedup moving forward is to create an > > option to dedicate a fast device (SSD or whatever) to the DDT. So all those > > little random IO operations never hit the rusty side of the pool. > > > > That''s something you can already do with an L2ARC. In the future I plan > > on investigating implementing a set of more fine-grained ARC and L2ARC > > policy tuning parameters that would give more control into the hands of > > admins over how the ARC/L2ARC cache is used. > > L2ARC is a read cache. Hence the "R" and "C" in "L2ARC.""Adaptive Replacement Cache", right.> This means two major things: > #1 Writes don''t benefit, > and > #2 There''s no way to load the whole DDT into the cache anyway. So you''re guaranteed to have performance degradation with the dedup. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss/Tomas -- Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
opensolarisisdeadlongliveopensolaris
2012-Aug-01 19:34 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > Well, there is at least a couple of failure scenarios where > copies>1 are good: > > 1) A single-disk pool, as in a laptop. Noise on the bus, > media degradation, or any other reason to misread or > miswrite a block can result in a failed pool.How does mac/win/lin handle this situation? (Not counting btrfs.) Such noise might result in a temporarily faulted pool (blue screen of death) that is fully recovered after reboot. Meanwhile you''re always paying for it in terms of performance, and it''s all solvable via pool redundancy.
opensolarisisdeadlongliveopensolaris
2012-Aug-01 19:40 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > 2012-08-01 22:07, opensolarisisdeadlongliveopensolaris ?????: > > L2ARC is a read cache. Hence the "R" and "C" in "L2ARC." > > "R" is replacement, but what the hell ;) > > > This means two major things: > > #1 Writes don''t benefit, > > and > > #2 There''s no way to load the whole DDT into the cache anyway. So you''re > guaranteed to have performance degradation with the dedup. > > If the whole DDT does make it into the cache, or onto an SSD > storing an extra copy of all pool metadata, then searching > for a particular entry in DDT would be faster. When you write > (or delete) and need to update the counters in DDT, or even > ultimately remove an unreferenced entry, then you benefit on > writes as well - you don''t take as long to find DDT entries > (or determine lack thereof) for the blocks you add or remove. > > Or did I get your answer wrong? ;)Agreed, ARC/L2ARC help in finding the DDT, but whenever you''ve got a snapshot destroy (happens every 15 minutes) you''ve got a lot of entries you need to write. Those are all scattered about the pool... Even if you can find them fast, it''s still a bear.
Jim Klimov
2012-Aug-01 19:51 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
2012-08-01 23:40, opensolarisisdeadlongliveopensolaris ?????:> Agreed, ARC/L2ARC help in finding the DDT, but whenever you''ve got a snapshot destroy (happens every 15 minutes) you''ve got a lot of entries you need to write. Those are all scattered about the pool... Even if you can find them fast, it''s still a bear.No, these entries you need to update are scattered around your SSD (be it ARC or a hypothetical SSD-based copy of metadata which I also "campaigned" for some time ago). We agreed (or assumed) that with SSDs in place you can find the DDT entries to update relatively fast now. The values are changed in RAM and flushed to disk as part of an upcoming TXG commit, likely in a limited number of disk head strokes (lots to coalesce), and the way I see it - the updated copy remains in the ARC instead of the obsolete DDT entry, and can make it into L2ARC sometime in the future, as well. //Jim
Jim Klimov
2012-Aug-01 19:51 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
2012-08-01 23:34, opensolarisisdeadlongliveopensolaris ?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> Well, there is at least a couple of failure scenarios where >> copies>1 are good: >> >> 1) A single-disk pool, as in a laptop. Noise on the bus, >> media degradation, or any other reason to misread or >> miswrite a block can result in a failed pool. > > How does mac/win/lin handle this situation? (Not counting btrfs.) > > Such noise might result in a temporarily faulted pool (blue screen of death) that is fully recovered after reboot.In some of my cases I was "lucky" enough to get a corrupted /sbin/init or something like that once, and the box had no other BE''s yet, so the OS could not do anything reasonable after boot. It is different from a "corrupted zpool", but ended in a useless OS image due to one broken sector nonetheless. > Meanwhile you''re always paying for it in terms of performance, and it''s all solvable via pool redundancy. For a single-disk box, "copies" IS the redundancy. ;) The discussion did stray off from my original question, though ;) //Jim
Peter Jeremy
2012-Aug-01 21:41 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On 2012-Aug-01 21:00:46 +0530, Nigel W <nigel.w at nosun.ca> wrote:>I think a fantastic idea for dealing with the DDT (and all other >metadata for that matter) would be an option to put (a copy of) >metadata exclusively on a SSD.This is on my wishlist as well. I believe ZEVO supports it so possibly it''ll be available in ZFS in the near future. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120802/6cc308d0/attachment.bin>
opensolarisisdeadlongliveopensolaris
2012-Aug-02 12:55 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > 2012-08-01 23:40, opensolarisisdeadlongliveopensolaris ?????: > > > Agreed, ARC/L2ARC help in finding the DDT, but whenever you''ve got a > snapshot destroy (happens every 15 minutes) you''ve got a lot of entries you > need to write. Those are all scattered about the pool... Even if you can find > them fast, it''s still a bear. > > No, these entries you need to update are scattered around your > SSD (be it ARC or a hypothetical SSD-based copy of metadata > which I also "campaigned" for some time ago).If they were scattered around the hypothetical dedicated DDT SSD, I would say, no problem. But in reality, they''re scattered in your main pool. DDT writes don''t get coalesced. Is this simply because they''re sync writes? Or is it because they''re metadata, which is even lower level than sync writes? I know, for example, that you can disable ZIL on your pool, but still the system is going to flush the buffer after certain operations, such as writing the uberblock. I have not seen the code that flushes the buffer after DDT writes, but I have seen the performance evidence.
opensolarisisdeadlongliveopensolaris
2012-Aug-02 13:00 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > In some of my cases I was "lucky" enough to get a corrupted /sbin/init > or something like that once, and the box had no other BE''s yet, so the > OS could not do anything reasonable after boot. It is different from a > "corrupted zpool", but ended in a useless OS image due to one broken > sector nonetheless.That''s very annoying, but if "copies" could have saved you, then pool redundancy could have also saved you.> For a single-disk box, "copies" IS the redundancy. ;)Ok, so the point is, in some cases, somebody might want redundancy on a device that has no redundancy. They''re willing to pay for it by halving their performance. The only situation I''ll acknowledge is the laptop situation, and I''ll say, present day very few people would be willing to pay *that* much for this limited use-case redundancy. The solution that I as an IT person would recommend and deploy would be to run without "copies" and instead cover you bum by doing backups.
Toby Thain
2012-Aug-02 17:34 UTC
[zfs-discuss] single-disk pool - Re: Can the ZFS "copies" attribute substitute HW disk redundancy?
On 01/08/12 3:34 PM, opensolarisisdeadlongliveopensolaris wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> Well, there is at least a couple of failure scenarios where >> copies>1 are good: >> >> 1) A single-disk pool, as in a laptop. Noise on the bus, >> media degradation, or any other reason to misread or >> miswrite a block can result in a failed pool. > > How does mac/win/lin handle this situation? (Not counting btrfs.) >Is this a trick question? :) --Toby> Such noise might result in a temporarily faulted pool (blue screen of death) that is fully recovered after reboot. Meanwhile you''re always paying for it in terms of performance, and it''s all solvable via pool redundancy. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Richard Elling
2012-Aug-02 21:36 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Aug 1, 2012, at 2:41 PM, Peter Jeremy wrote:> On 2012-Aug-01 21:00:46 +0530, Nigel W <nigel.w at nosun.ca> wrote: >> I think a fantastic idea for dealing with the DDT (and all other >> metadata for that matter) would be an option to put (a copy of) >> metadata exclusively on a SSD. > > This is on my wishlist as well. I believe ZEVO supports it so possibly > it''ll be available in ZFS in the near future.ZEVO does not. The only ZFS vendor I''m aware of with a separate top-level vdev for metadata is Tegile, and it is available today. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120802/331b12b0/attachment.html>
Richard Elling
2012-Aug-02 21:39 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Aug 1, 2012, at 8:30 AM, Nigel W wrote:> On Wed, Aug 1, 2012 at 8:33 AM, Sa?o Kiselkov <skiselkov.ml at gmail.com> wrote: >> On 08/01/2012 04:14 PM, Jim Klimov wrote: >>> chances are that >>> some blocks of userdata might be more popular than a DDT block and >>> would push it out of L2ARC as well... >> >> Which is why I plan on investigating implementing some tunable policy >> module that would allow the administrator to get around this problem. >> E.g. administrator dedicates 50G of ARC space to metadata (which >> includes the DDT) or only the DDT specifically. My idea is still a bit >> fuzzy, but it revolves primarily around allocating and policing min and >> max quotas for a given ARC entry type. I''ll start a separate discussion >> thread for this later on once I have everything organized in my mind >> about where I plan on taking this. >> > > Yes. +1 > > The L2ARC as is it currently implemented is not terribly useful for > storing the DDT in anyway because each DDT entry is 376 bytes but the > L2ARC reference is 176 bytes, so best case you get just over double > the DDT entries in the L2ARC as what you would get into the ARC but > then you have also have no ARC left for anything else :(.You are making the assumption that each DDT table entry consumes one metadata update. This is not the case. The DDT is implemented as an AVL tree. As per other metadata in ZFS, the data is compressed. So you cannot make a direct correlation between the DDT entry size and the affect on the stored metadata on disk sectors. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120802/c3f115f0/attachment-0001.html>
Peter Jeremy
2012-Aug-02 21:59 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On 2012-Aug-02 18:30:01 +0530, opensolarisisdeadlongliveopensolaris <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>Ok, so the point is, in some cases, somebody might want redundancy on >a device that has no redundancy. They''re willing to pay for it by >halving their performance.This isn''t quite true - write performance will be at least halved (possibly worse due to additional seeking) but read performance could potentially improve (more copies means, on average, there should be less seeking to get a a copy than if there was only one copy). And non-IO performance is unaffected.> The only situation I''ll acknowledge is >the laptop situation, and I''ll say, present day very few people would >be willing to pay *that* much for this limited use-case redundancy.My guess is that, for most people, the overall performance impact would be minimal because disk write performance isn''t the limiting factor for most laptop usage scenarios.>The solution that I as an IT person would recommend and deploy would >be to run without "copies" and instead cover you bum by doing backups.You need backups in any case but backups won''t help you if you can''t conveniently access them. Before giving a blanket recommendation, you need to consider how the person uses their laptop. Consider the following scenario: You''re in the middle of a week-long business trip and your laptop develops a bad sector in an inconvenient spot. Do you: a) Let ZFS automagically repair the sector thanks to copies=2. b) Attempt to rebuild your laptop and restore from backups (left securely at home) via the dodgy hotel wifi. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120803/994774be/attachment.bin>
Nigel W
2012-Aug-03 00:40 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Thu, Aug 2, 2012 at 3:39 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Aug 1, 2012, at 8:30 AM, Nigel W wrote: > > > Yes. +1 > > The L2ARC as is it currently implemented is not terribly useful for > storing the DDT in anyway because each DDT entry is 376 bytes but the > L2ARC reference is 176 bytes, so best case you get just over double > the DDT entries in the L2ARC as what you would get into the ARC but > then you have also have no ARC left for anything else :(. > > > You are making the assumption that each DDT table entry consumes one > metadata update. This is not the case. The DDT is implemented as an AVL > tree. As per other metadata in ZFS, the data is compressed. So you cannot > make a direct correlation between the DDT entry size and the affect on the > stored metadata on disk sectors. > -- richard >It''s compressed even when in the ARC?
Richard Elling
2012-Aug-03 14:55 UTC
[zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Aug 2, 2012, at 5:40 PM, Nigel W wrote:> On Thu, Aug 2, 2012 at 3:39 PM, Richard Elling <richard.elling at gmail.com> wrote: >> On Aug 1, 2012, at 8:30 AM, Nigel W wrote: >> >> >> Yes. +1 >> >> The L2ARC as is it currently implemented is not terribly useful for >> storing the DDT in anyway because each DDT entry is 376 bytes but the >> L2ARC reference is 176 bytes, so best case you get just over double >> the DDT entries in the L2ARC as what you would get into the ARC but >> then you have also have no ARC left for anything else :(. >> >> >> You are making the assumption that each DDT table entry consumes one >> metadata update. This is not the case. The DDT is implemented as an AVL >> tree. As per other metadata in ZFS, the data is compressed. So you cannot >> make a direct correlation between the DDT entry size and the affect on the >> stored metadata on disk sectors. >> -- richard >> > It''s compressed even when in the ARC?That is a slightly odd question. The ARC contains ZFS blocks. DDT metadata is manipulated in memory as an AVL tree, so what you can see in the ARC is the metadata blocks that were read and uncompressed from the pool or packaged in blocks and written to the pool. Perhaps it is easier to think of them as metadata in transition? :-) -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120803/0d8ae109/attachment.html>