Hi! I''m having hard time finding out if it''s possible to force ditto blocks on different devices. This mode has many benefits, the least not being that is practically creates a fully dynamic mode of mirroring (replacing raid1 and raid10 variants), especially when combined with the upcoming vdev remove and defrag/rebalance features. Is this already available? Is it scheduled? Whyt not? - Tuomas
> This mode has many benefits, the least not being that is practically > creates a fully dynamic mode of mirroring (replacing raid1 and raid10 > variants), especially when combined with the upcoming vdev remove and > defrag/rebalance features.Vdev remove, that''s a sure thing. I''ve heard about defrag before, but when I asked, no one confirmed it. The same goes for that mention of single disk "RAID", that''s I think supposed to write one parity block for n data blocks, so that disk errors can be healed without having to have a real redundant setup.> Is this already available? Is it scheduled? Whyt not?Actually, ZFS is already supposed to try to write the ditto copies of a block on different vdevs if multiple are available. As far as finding out goes, I suppose if you use a simple JBOD, in theory, you could try by offlining one disk. But I think in a non-redundant setup, the pool refuses to start if a disk is missing (I think that should be changed, to allow evacuation of properly dittoed data). -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 648 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070809/71d2e99a/attachment.bin>
> > Actually, ZFS is already supposed to try to write the ditto copies of a > block on different vdevs if multiple are available. >*TRY* being the keyword here. What I''m looking for is a disk full error if ditto cannot be written to different disks. This would guarantee that a mirror is written on a separate disk - and the entire filesystem can be salvaged from a full disk failure. Think about having the classic case of 50M, 100M and 200M disks. only 150M can be really mirrored and the remaining 50M can only be used non-redundantly.> ...But I think in a > non-redundant setup, the pool refuses to start if a disk is missing (I > think that should be changed, to allow evacuation of properly dittoed data).IIRC this is already considered a bug.
>> Actually, ZFS is already supposed to try to write the ditto copies of a >> block on different vdevs if multiple are available. > > *TRY* being the keyword here. > > What I''m looking for is a disk full error if ditto cannot be written > to different disks. This would guarantee that a mirror is written on a > separate disk - and the entire filesystem can be salvaged from a full > disk failure.If you''re that bent on having maximum redundancy, I think you should consider implementing real redundancy. I''m also biting the bullet and going mirrors (cheaper than RAID-Z for home, less disks needed to start with). The problem here is that the filesystem, especially with a considerable fill factor, can''t guarantee the necessary allocation balance across the vdevs (that is maintaining necessary free space) to spread the ditto blocks as optimal as you''d like. Implementing the required code would increase the overhead a lot. Not to mention that ZFS may have to defrag on the fly more than not to make sure the ditto spread can be maintained balanced. And then snapshots on top of that, which are supposed to be physically and logically immovable (unless you execute commands affecting the pool, like a vdev remove, I suppose), just increase the existing complexity, where all that would have to be hammered into. My 2c. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 648 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070809/84adcbf7/attachment.bin>
Tuomas Leikola wrote:>> Actually, ZFS is already supposed to try to write the ditto copies of a >> block on different vdevs if multiple are available. > > *TRY* being the keyword here. > > What I''m looking for is a disk full error if ditto cannot be written > to different disks. This would guarantee that a mirror is written on a > separate disk - and the entire filesystem can be salvaged from a full > disk failure.We call that a "mirror" :-)> Think about having the classic case of 50M, 100M and 200M disks. only > 150M can be really mirrored and the remaining 50M can only be used > non-redundantly.Assuming full disk failure mode, yes. -- richard
On 8/9/07, Mario Goebbels <me at tomservo.cc> wrote:> If you''re that bent on having maximum redundancy, I think you should > consider implementing real redundancy. I''m also biting the bullet and > going mirrors (cheaper than RAID-Z for home, less disks needed to start > with).Currently I am, and as I''m stuck with different sized disks I first have to slice them up to similarly sized chunks and .. well you get the idea. It''s a pain.> The problem here is that the filesystem, especially with a considerable > fill factor, can''t guarantee the necessary allocation balance across the > vdevs (that is maintaining necessary free space) to spread the ditto > blocks as optimal as you''d like. Implementing the required code would > increase the overhead a lot. Not to mention that ZFS may have to defrag > on the fly more than not to make sure the ditto spread can be maintained > balanced.I Feel that for most purposes, this could be fixed with an allocator strategy option, like: Prefer vdevs with most free space (which is not that good a default as it has performance implications).> And then snapshots on top of that, which are supposed to be physically > and logically immovable (unless you execute commands affecting the pool, > like a vdev remove, I suppose), just increase the existing complexity, > where all that would have to be hammered into.I''m not that familiar with the code, but i get the feeling that if vdev remove is a given, rebalance would not be a huge step? The code to migrate data blocks would already be there.
On 8/9/07, Richard Elling <Richard.Elling at sun.com> wrote:> > What I''m looking for is a disk full error if ditto cannot be written > > to different disks. This would guarantee that a mirror is written on a > > separate disk - and the entire filesystem can be salvaged from a full > > disk failure. > > We call that a "mirror" :-) >Mirror and raidz suffer from the classic blockdevice abstraction problem in that they need disks of equal size. Not really a problem for most people, but inconvenient for everyone. Isn''t flexibility and ease of administration "the zfs way"? ;)
On August 10, 2007 12:34:23 PM +0300 Tuomas Leikola <tuomas.leikola at gmail.com> wrote:> On 8/9/07, Richard Elling <Richard.Elling at sun.com> wrote: >> > What I''m looking for is a disk full error if ditto cannot be written >> > to different disks. This would guarantee that a mirror is written on a >> > separate disk - and the entire filesystem can be salvaged from a full >> > disk failure. >> >> We call that a "mirror" :-) >> > > Mirror and raidz suffer from the classic blockdevice abstraction > problem in that they need disks of equal size.Not that I''m aware of. Mirror and raid-z will simply use the smallest size of your available disks. -frank
> >> We call that a "mirror" :-) > >> > > > > Mirror and raidz suffer from the classic blockdevice abstraction > > problem in that they need disks of equal size. > > Not that I''m aware of. Mirror and raid-z will simply use the smallest > size of your available disks. >Exactly. The rest is not usable.
On August 10, 2007 2:20:30 PM +0300 Tuomas Leikola <tuomas.leikola at gmail.com> wrote:>> >> We call that a "mirror" :-) >> >> >> > >> > Mirror and raidz suffer from the classic blockdevice abstraction >> > problem in that they need disks of equal size. >> >> Not that I''m aware of. Mirror and raid-z will simply use the smallest >> size of your available disks. >> > > Exactly. The rest is not usable.Well I don''t understand how you suggest to use it if you want redundancy. -frank
Tuomas Leikola wrote:>>>> We call that a "mirror" :-) >>>> >>> Mirror and raidz suffer from the classic blockdevice abstraction >>> problem in that they need disks of equal size. >> Not that I''m aware of. Mirror and raid-z will simply use the smallest >> size of your available disks. >> > > Exactly. The rest is not usable.For what you are asking, forcing ditto blocks on to separate vdevs, to work you effectively end up with the same restriction as mirroing. For example lets say you have a two disk pool of 50 and 100 sized disks, if ZFS only ever put a ditto block onto a separate vdev to the original block you can still only use 50 not 100. What do you do when the disk of size 50 is full yet you have more ditto blocks to write ? I can see only two options: 1) fail the write due to lack of space. Which is basically the same as mirroring today. 2) break the requirement that the ditto must be on an alternate vdev. If you break the requirement you are back to what the current design does which is *try* to use an alternate vdev for the ditto. However I suspect you will say that unlike mirroring only some of your datasets will have ditto blocks turned on. The only way I could see this working is if *all* datasets that have copies > 1 were "quotaed" down to the size of the smallest disk. Which basically ends up back at a real mirror or a really hard to understand system IMO. -- Darren J Moffat
> >> >> We call that a "mirror" :-) > >> >> > >> > > >> > Mirror and raidz suffer from the classic blockdevice abstraction > >> > problem in that they need disks of equal size. > >> > >> Not that I''m aware of. Mirror and raid-z will simply use the smallest > >> size of your available disks. > >> > > > > Exactly. The rest is not usable. > > Well I don''t understand how you suggest to use it if you want redundancy.Well it is possible if you ''slice'' the disks up as Tuomas suggested previously (or do something slightly cleverer -- though equivalent -- under the hood). See my recent most on zfs-code: http://mail.opensolaris.org/pipermail/zfs-code/2007-August/000583.html . I made this work as a university project, though as it currently stands you can''t replace disks with my implementation -- which I''m hoping to solve, along with adding disks to the RAID-Z, when I get some more free time. Sadly it''s not going to come immediately as I''m now working full time but certainly a fun project as the ZFS guys have higher priority items on their list. James
On 8/10/07, Darren J Moffat <darrenm at opensolaris.org> wrote:> Tuomas Leikola wrote: > >>>> We call that a "mirror" :-) > >>>> > >>> Mirror and raidz suffer from the classic blockdevice abstraction > >>> problem in that they need disks of equal size. > >> Not that I''m aware of. Mirror and raid-z will simply use the smallest > >> size of your available disks. > >> > > > > Exactly. The rest is not usable. > > For what you are asking, forcing ditto blocks on to separate vdevs, to > work you effectively end up with the same restriction as mirroing.In theory, correct. In practice, administration is much simpler when there are multiple devices. Simplicity of administration really being the point here - sorry I didn''t make it clear at first. I''m skipping the two-disk example as trivial - which it is. Howerer: administration becomes a real mess when you have multiple (say, 10) disks, all differing sizes, and want to use all the space - think about the home user with a constrained budget or just a huge pile of random oldish disk lying around. It is possible to merge disks before (or after) setting up the mirrors, but it is a tedious job, especially when you start replacing small disks one by one with larger ones, etc. This can be - relatively easily - automated by zfs block allocation strategies and this is why I consider it a worthwhile feature.> However I suspect you will say that unlike mirroring only some of your > datasets will have ditto blocks turned on. >That''s one good point. Maybe I don''t want to decide in advance how much mirrored storage i really need - or just use all the "free" mirrored space for nonmirrored temporary storage. I''d call this flexibility.> The only way I could see this working is if *all* datasets that have > copies > 1 were "quotaed" down to the size of the smallest disk. >Admittedly, in the two-disk scenario the benefit is relatively low, but in the most multi-disk scenarios the disks can be practically full before running out of ditto locations - minus the block(s). (This holds for copies=2 if largest disk < sum of others).> Which basically ends up back at a real mirror or a really hard to > understand system IMO.I find volume manager mess hard to understand - and it is a mess in the multidisk scenario when you start adding and removing disks. For a real-world use case, i''ll present my home fileserver. 11 disks, sizes vary between 80 and 400 gigabytes. The disks are concatenated together into 6 "stacks" that are raid6:d together - with only 40G or so "wasted" space. I had to write a program to optimize the disk arrangement. Raid6 isn''t exactly mirroring, but the administrative hurdles are the same.
> From: zfs-discuss-bounces at opensolaris.org > [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Frank Cusack > Sent: Friday, August 10, 2007 7:26 AM > To: Tuomas Leikola > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Force ditto block on different vdev? > > On August 10, 2007 2:20:30 PM +0300 Tuomas Leikola > <tuomas.leikola at gmail.com> wrote: > >> >> We call that a "mirror" :-) > >> >> > >> > > >> > Mirror and raidz suffer from the classic blockdevice abstraction > >> > problem in that they need disks of equal size. > >> > >> Not that I''m aware of. Mirror and raid-z will simply use > the smallest > >> size of your available disks. > >> > > > > Exactly. The rest is not usable. > > Well I don''t understand how you suggest to use it if you want > redundancy.Since copies=N is a per-filesystem setting, you fail writes to /tank/important_documents (copies=2) when you run out of ditto blocks on another VDEV, but still allow /tank/torrentcache (copies=1) to use the other space. With disks of 100 and 50 GB mirrored, /tank/torrentcache would be "more redundant than necessary", and you run out of capacity too soon. Wishlist: It would be nice to put the whole redundancy definitions into the zfs filesystem layer (rather than the pool layer): Imagine being able to "set copies=5+2" for a filesystem... (requires a 7-VDEV pool, and stripes via RAIDz2, otherwise the zfs create/set fails) --Joe
> >> > Mirror and raidz suffer from the classic blockdevice abstraction > >> > problem in that they need disks of equal size. > >> > >> Not that I''m aware of. Mirror and raid-z will simply use the smallest > >> size of your available disks. > >> > > > > Exactly. The rest is not usable. > > Well I don''t understand how you suggest to use it if you want redundancy.With more than two disks involved, you might have the space available, but not in simple 1:1 configurations. For instance, it might be nice to create a "mirror" with a 100G disk and two 50G disks. Right now someone has to create slices on the big disk manually and feed them to zpool. Letting ZFS handle everything itself might be a win for some cases. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
On 8/10/07, Moore, Joe <jmoore at ugs.com> wrote:> Wishlist: It would be nice to put the whole redundancy definitions into > the zfs filesystem layer (rather than the pool layer): Imagine being > able to "set copies=5+2" for a filesystem... (requires a 7-VDEV pool, > and stripes via RAIDz2, otherwise the zfs create/set fails)Yes please ;) This is practically the holy grail of "dynamic raid" - the ability to dynamically use different redundancy settings on a per-directory level, and to use a mix of different sized devices and add/remove them at will. I guess one would call this feature (ditto block setting of stripe+parity). It''s doable but probably requires large(ish) changes to on-disk structures as block pointer will look different. James, did you look at this? With vdev removal (which I suppose will be implemented with some kind of "rewrite block" -type code) in place, "reshape" and rebalance functionality would propably be relatively small improvements. BTW here''s more wishlist items now that we''re at it: - copies=max+2 (use as many stripes as possible, with border case of 3-way mirror) - minchunk=8kb (dont spread smaller stripes than this - performance optimization) - checksum on every disk independently (instead of full stripe) - fixes raidz random read performance .. And one crazy idea just popped into my head: fs-level raid could be implemented with separate parity blocks instead of the ditto mechanism. Say, when data first is written, normal ditto block is used. Then later, asynchronously, the block is combined with some other blocks (that may be unrelated), the parity is written to a new allocation and the ditto block(s) are freed. When data blocks are freed (by COW) the parity needs to be recalculated before the data block can actually be forgotten. This can be thought of as combining a number of ditto blocks into a parity block. That may be easier or more complicated to implement than saving the block as stripe+parity in the first place. Depends on the data structures, which I don''t yet know intimately. Come to think of this, it''s probably best to get all these ideas out there _before_ I start looking into the code - knowing the details has the tendency to kill all the crazy ideas :)
On 8/10/07, Darren Dunham <ddunham at taos.com> wrote:> For instance, it might be nice to create a "mirror" with a 100G disk and > two 50G disks. Right now someone has to create slices on the big disk > manually and feed them to zpool. Letting ZFS handle everything itself > might be a win for some cases.Especially performance-wise. AFAIK ZFS doesn''t understand that the two vdevs actually share a physical disk and therefore should not be used as raid0-like stripes.
> This is practically the holy grail of "dynamic raid" - the ability to > dynamically use different redundancy settings on a per-directory > level, and to use a mix of different sized devices and add/remove them > at will.Well I suspect that arbitrary redundancy configuration is not something we''ll see anytime soon, nor is it something we should necessarily want. The main reason being it''s very difficult to see it being used effectively; it''s difficult enough to reason about data loss characteristics currently. (Not even considering the implementation complexity.) If you really need different guarantees on integrity, you could create separate specialized pools of mirrors -- or use the ditto blocks feature. As Richard Elling points out (http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl), some redundancy (RAID-Z) is much better than none, and mirroring your data increases the MTTDL by another 5/6 orders of magnitude (though ditto''ing isn''t quite doing that), and interestingly the RAID data points clump together. I think the important thing is that the system should be a little more flexible than it is currently (allow variably sized disks, adding/removing them), but not so much so that it''s a completely different system. James