thr3ads.net - zfs discuss - [zfs-discuss] Force ditto block on different vdev? [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Tuomas Leikola

2007-Aug-09 14:04 UTC

[zfs-discuss] Force ditto block on different vdev?

Hi!

I''m having hard time finding out if it''s possible to force
ditto
blocks on different devices.

This mode has many benefits, the least not being that is practically
creates a fully dynamic mode of mirroring (replacing raid1 and raid10
variants), especially when combined with the upcoming vdev remove and
defrag/rebalance features.

Is this already available? Is it scheduled? Whyt not?

- Tuomas

Mario Goebbels

2007-Aug-09 14:22 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

> This mode has many benefits, the least not being that is practically
> creates a fully dynamic mode of mirroring (replacing raid1 and raid10
> variants), especially when combined with the upcoming vdev remove and
> defrag/rebalance features.
Vdev remove, that''s a sure thing. I''ve heard about defrag
before, but
when I asked, no one confirmed it.

The same goes for that mention of single disk "RAID", that''s
I think
supposed to write one parity block for n data blocks, so that disk
errors can be healed without having to have a real redundant setup.
> Is this already available? Is it scheduled? Whyt not?
Actually, ZFS is already supposed to try to write the ditto copies of a
block on different vdevs if multiple are available.

As far as finding out goes, I suppose if you use a simple JBOD, in
theory, you could try by offlining one disk. But I think in a
non-redundant setup, the pool refuses to start if a disk is missing (I
think that should be changed, to allow evacuation of properly dittoed data).

-mg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 648 bytes
Desc: OpenPGP digital signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070809/71d2e99a/attachment.bin>

Tuomas Leikola

2007-Aug-09 14:30 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

>
> Actually, ZFS is already supposed to try to write the ditto copies of a
> block on different vdevs if multiple are available.
>
*TRY*  being the keyword here.

What I''m looking for is a disk full error if ditto cannot be written
to different disks. This would guarantee that a mirror is written on a
separate disk - and the entire filesystem can be salvaged from a full
disk failure.

Think about having the classic case of 50M, 100M and 200M disks. only
150M can be really mirrored and the remaining 50M can only be used
non-redundantly.
> ...But I think in a
> non-redundant setup, the pool refuses to start if a disk is missing (I
> think that should be changed, to allow evacuation of properly dittoed
data).
IIRC this is already considered a bug.

Mario Goebbels

2007-Aug-09 15:03 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

>> Actually, ZFS is already supposed to try to write the ditto copies of a
>> block on different vdevs if multiple are available.
> 
> *TRY*  being the keyword here.
> 
> What I''m looking for is a disk full error if ditto cannot be
written
> to different disks. This would guarantee that a mirror is written on a
> separate disk - and the entire filesystem can be salvaged from a full
> disk failure.
If you''re that bent on having maximum redundancy, I think you should
consider implementing real redundancy. I''m also biting the bullet and
going mirrors (cheaper than RAID-Z for home, less disks needed to start
with).

The problem here is that the filesystem, especially with a considerable
fill factor, can''t guarantee the necessary allocation balance across
the
vdevs (that is maintaining necessary free space) to spread the ditto
blocks as optimal as you''d like. Implementing the required code would
increase the overhead a lot. Not to mention that ZFS may have to defrag
on the fly more than not to make sure the ditto spread can be maintained
balanced.

And then snapshots on top of that, which are supposed to be physically
and logically immovable (unless you execute commands affecting the pool,
like a vdev remove, I suppose), just increase the existing complexity,
where all that would have to be hammered into.

My 2c.

-mg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 648 bytes
Desc: OpenPGP digital signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070809/84adcbf7/attachment.bin>

Richard Elling

2007-Aug-09 16:31 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

Tuomas Leikola wrote:>> Actually, ZFS is already supposed to try to write the ditto copies of a
>> block on different vdevs if multiple are available.
> 
> *TRY*  being the keyword here.
> 
> What I''m looking for is a disk full error if ditto cannot be
written
> to different disks. This would guarantee that a mirror is written on a
> separate disk - and the entire filesystem can be salvaged from a full
> disk failure.
We call that a "mirror" :-)
> Think about having the classic case of 50M, 100M and 200M disks. only
> 150M can be really mirrored and the remaining 50M can only be used
> non-redundantly.
Assuming full disk failure mode, yes.
  -- richard

Tuomas Leikola

2007-Aug-10 09:28 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

On 8/9/07, Mario Goebbels <me at tomservo.cc>
wrote:> If you''re that bent on having maximum redundancy, I think you
should
> consider implementing real redundancy. I''m also biting the bullet
and
> going mirrors (cheaper than RAID-Z for home, less disks needed to start
> with).
Currently I am, and as I''m stuck with different sized disks I first
have to slice them up to similarly sized chunks and .. well you get
the idea. It''s a pain.
> The problem here is that the filesystem, especially with a considerable
> fill factor, can''t guarantee the necessary allocation balance
across the
> vdevs (that is maintaining necessary free space) to spread the ditto
> blocks as optimal as you''d like. Implementing the required code
would
> increase the overhead a lot. Not to mention that ZFS may have to defrag
> on the fly more than not to make sure the ditto spread can be maintained
> balanced.
I Feel that for most purposes, this could be fixed with an allocator
strategy option, like: Prefer vdevs with most free space (which is not
that good a default as it has performance implications).
> And then snapshots on top of that, which are supposed to be physically
> and logically immovable (unless you execute commands affecting the pool,
> like a vdev remove, I suppose), just increase the existing complexity,
> where all that would have to be hammered into.
I''m not that familiar with the code, but i get the feeling that if
vdev remove is a given, rebalance would not be a huge step? The code
to migrate data blocks would already be there.

Tuomas Leikola

2007-Aug-10 09:34 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

On 8/9/07, Richard Elling <Richard.Elling at sun.com>
wrote:> > What I''m looking for is a disk full error if ditto cannot be
written
> > to different disks. This would guarantee that a mirror is written on a
> > separate disk - and the entire filesystem can be salvaged from a full
> > disk failure.
>
> We call that a "mirror" :-)
>
Mirror and raidz suffer from the classic blockdevice abstraction
problem in that they need disks of equal size. Not really a problem
for most people, but inconvenient for everyone.

Isn''t flexibility and ease of administration "the zfs way"?
;)

Frank Cusack

2007-Aug-10 10:19 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

On August 10, 2007 12:34:23 PM +0300 Tuomas Leikola 
<tuomas.leikola at gmail.com> wrote:> On 8/9/07, Richard Elling <Richard.Elling at sun.com> wrote:
>> > What I''m looking for is a disk full error if ditto cannot
be written
>> > to different disks. This would guarantee that a mirror is written
on a
>> > separate disk - and the entire filesystem can be salvaged from a
full
>> > disk failure.
>>
>> We call that a "mirror" :-)
>>
>
> Mirror and raidz suffer from the classic blockdevice abstraction
> problem in that they need disks of equal size.
Not that I''m aware of.  Mirror and raid-z will simply use the smallest
size of your available disks.

-frank

Tuomas Leikola

2007-Aug-10 11:20 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

> >> We call that a "mirror" :-)
> >>
> >
> > Mirror and raidz suffer from the classic blockdevice abstraction
> > problem in that they need disks of equal size.
>
> Not that I''m aware of.  Mirror and raid-z will simply use the
smallest
> size of your available disks.
>
Exactly. The rest is not usable.

Frank Cusack

2007-Aug-10 11:26 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

On August 10, 2007 2:20:30 PM +0300 Tuomas Leikola 
<tuomas.leikola at gmail.com> wrote:>> >> We call that a "mirror" :-)
>> >>
>> >
>> > Mirror and raidz suffer from the classic blockdevice abstraction
>> > problem in that they need disks of equal size.
>>
>> Not that I''m aware of.  Mirror and raid-z will simply use the
smallest
>> size of your available disks.
>>
>
> Exactly. The rest is not usable.
Well I don''t understand how you suggest to use it if you want
redundancy.

-frank

Darren J Moffat

2007-Aug-10 11:46 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

Tuomas Leikola wrote:>>>> We call that a "mirror" :-)
>>>>
>>> Mirror and raidz suffer from the classic blockdevice abstraction
>>> problem in that they need disks of equal size.
>> Not that I''m aware of.  Mirror and raid-z will simply use the
smallest
>> size of your available disks.
>>
> 
> Exactly. The rest is not usable.
For what you are asking, forcing ditto blocks on to separate vdevs, to 
work you effectively end up with the same restriction as mirroing.

For example lets say you have a two disk pool of 50 and 100 sized disks, 
if ZFS only ever put a ditto block onto a separate vdev to the original 
block you can still only use 50 not 100.  What do you do when the disk 
of size 50 is full yet you have more ditto blocks to write ?

I can see only two options:

1) fail the write due to lack of space. Which is basically the same as 
mirroring today.

2) break the requirement that the ditto must be on an alternate vdev. 
If you break the requirement you are back to what the current design 
does which is *try* to use an alternate vdev for the ditto.

However I suspect you will say that unlike mirroring only some of your 
datasets will have ditto blocks turned on.

The only way I could see this working is if *all* datasets that have 
copies > 1 were "quotaed" down to the size of the smallest disk.

Which basically ends up back at a real mirror or a really hard to 
understand system IMO.

-- 
Darren J Moffat

James Blackburn

2007-Aug-10 13:45 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

> >> >> We call that a "mirror" :-)
> >> >>
> >> >
> >> > Mirror and raidz suffer from the classic blockdevice
abstraction
> >> > problem in that they need disks of equal size.
> >>
> >> Not that I''m aware of.  Mirror and raid-z will simply use
the smallest
> >> size of your available disks.
> >>
> >
> > Exactly. The rest is not usable.
>
> Well I don''t understand how you suggest to use it if you want
redundancy.
Well it is possible if you ''slice'' the disks up as  Tuomas
suggested
previously (or do something slightly cleverer -- though equivalent --
under the hood).  See my recent most on zfs-code:
http://mail.opensolaris.org/pipermail/zfs-code/2007-August/000583.html
.  I made this work as a university project, though as it currently
stands you can''t replace disks with my implementation -- which
I''m
hoping to solve, along with adding disks to the RAID-Z,  when I get
some more free time.  Sadly it''s not going to come immediately as
I''m
now working full time but certainly a fun project as the ZFS guys have
higher priority items on their list.

James

Tuomas Leikola

2007-Aug-10 13:54 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

On 8/10/07, Darren J Moffat <darrenm at opensolaris.org>
wrote:> Tuomas Leikola wrote:
> >>>> We call that a "mirror" :-)
> >>>>
> >>> Mirror and raidz suffer from the classic blockdevice
abstraction
> >>> problem in that they need disks of equal size.
> >> Not that I''m aware of.  Mirror and raid-z will simply use
the smallest
> >> size of your available disks.
> >>
> >
> > Exactly. The rest is not usable.
>
> For what you are asking, forcing ditto blocks on to separate vdevs, to
> work you effectively end up with the same restriction as mirroing.
In theory, correct. In practice, administration is much simpler when
there are multiple devices.

Simplicity of administration really being the point here - sorry I
didn''t make it clear at first.

I''m skipping the two-disk example as trivial - which it is. Howerer:
administration becomes a real mess when you have multiple (say, 10)
disks, all differing sizes, and want to use all the space - think
about the home user with a constrained budget or just a huge pile of
random oldish disk lying around.

It is possible to merge disks before (or after) setting up the
mirrors, but it is a tedious job, especially when you start replacing
small disks one by one with larger ones, etc.

This can be - relatively easily - automated by zfs block allocation
strategies and this is why I consider it a worthwhile feature.
> However I suspect you will say that unlike mirroring only some of your
> datasets will have ditto blocks turned on.
>
That''s one good point. Maybe I don''t want to decide in advance
how
much mirrored storage i really need - or just use all the "free"
mirrored space for nonmirrored temporary storage. I''d call this
flexibility.
> The only way I could see this working is if *all* datasets that have
> copies > 1 were "quotaed" down to the size of the smallest
disk.
>
Admittedly, in the two-disk scenario the benefit is relatively low,
but in the most multi-disk scenarios the disks can be practically full
before running out of ditto locations - minus the block(s). (This
holds for copies=2 if largest disk < sum of others).
> Which basically ends up back at a real mirror or a really hard to
> understand system IMO.
I find volume manager mess hard to understand - and it is a mess in
the multidisk scenario when you start adding and removing disks.

For a real-world use case, i''ll present my home fileserver. 11 disks,
sizes vary between 80 and 400 gigabytes. The disks are concatenated
together into 6 "stacks" that are raid6:d together - with only 40G or
so "wasted" space. I had to write a program to optimize the disk
arrangement. Raid6 isn''t exactly mirroring, but the administrative
hurdles are the same.

Moore, Joe

2007-Aug-10 14:02 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

> From: zfs-discuss-bounces at opensolaris.org 
> [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Frank Cusack
> Sent: Friday, August 10, 2007 7:26 AM
> To: Tuomas Leikola
> Cc: zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] Force ditto block on different vdev?
> 
> On August 10, 2007 2:20:30 PM +0300 Tuomas Leikola 
> <tuomas.leikola at gmail.com> wrote:
> >> >> We call that a "mirror" :-)
> >> >>
> >> >
> >> > Mirror and raidz suffer from the classic blockdevice
abstraction
> >> > problem in that they need disks of equal size.
> >>
> >> Not that I''m aware of.  Mirror and raid-z will simply use
> the smallest
> >> size of your available disks.
> >>
> >
> > Exactly. The rest is not usable.
> 
> Well I don''t understand how you suggest to use it if you want 
> redundancy.
Since copies=N is a per-filesystem setting, you fail writes to
/tank/important_documents (copies=2) when you run out of ditto blocks on
another VDEV, but still allow /tank/torrentcache (copies=1) to use the
other space.

With disks of 100 and 50 GB mirrored, /tank/torrentcache would be "more
redundant than necessary", and you run out of capacity too soon.

Wishlist: It would be nice to put the whole redundancy definitions into
the zfs filesystem layer (rather than the pool layer):  Imagine being
able to "set copies=5+2" for a filesystem... (requires a 7-VDEV pool,
and stripes via RAIDz2, otherwise the zfs create/set fails)

--Joe

Darren Dunham

2007-Aug-10 15:43 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

> >> > Mirror and raidz suffer from the classic blockdevice
abstraction
> >> > problem in that they need disks of equal size.
> >>
> >> Not that I''m aware of.  Mirror and raid-z will simply use
the smallest
> >> size of your available disks.
> >>
> >
> > Exactly. The rest is not usable.
> 
> Well I don''t understand how you suggest to use it if you want
redundancy.
With more than two disks involved, you might have the space available,
but not in simple 1:1 configurations.

For instance, it might be nice to create a "mirror" with a 100G disk
and
two 50G disks.  Right now someone has to create slices on the big disk
manually and feed them to zpool.  Letting ZFS handle everything itself
might be a win for some cases.

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Tuomas Leikola

2007-Aug-10 16:41 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

On 8/10/07, Moore, Joe <jmoore at ugs.com> wrote:
> Wishlist: It would be nice to put the whole redundancy definitions into
> the zfs filesystem layer (rather than the pool layer):  Imagine being
> able to "set copies=5+2" for a filesystem... (requires a 7-VDEV
pool,
> and stripes via RAIDz2, otherwise the zfs create/set fails)
Yes please ;)

This is practically the holy grail of "dynamic raid" - the ability to
dynamically use different redundancy settings on a per-directory
level, and to use a mix of different sized devices and add/remove them
at will.

I guess one would call this feature (ditto block setting of
stripe+parity). It''s doable but probably requires large(ish) changes
to on-disk structures as block pointer will look different.

James, did you look at this? With vdev removal (which I suppose will
be implemented with some kind of "rewrite block" -type code) in place,
"reshape" and rebalance functionality would propably be relatively
small improvements.

BTW here''s more wishlist items now that we''re at it:

- copies=max+2 (use as many stripes as possible, with border case of
3-way mirror)
- minchunk=8kb (dont spread smaller stripes than this - performance
optimization)
- checksum on every disk independently (instead of full stripe) -
fixes raidz random read performance

.. And one crazy idea just popped into my head: fs-level raid could be
implemented with separate parity blocks instead of the ditto
mechanism. Say, when data first is written,  normal ditto block is
used. Then later, asynchronously, the block is combined with some
other blocks (that may be unrelated), the parity is written to a new
allocation and the ditto block(s) are freed. When data blocks are
freed (by COW) the parity needs to be recalculated before the data
block can actually be forgotten. This can be thought of as combining a
number of ditto blocks into a parity block.

That may be easier or more complicated to implement than saving the
block as stripe+parity in the first place. Depends on the data
structures, which I don''t yet know intimately.

Come to think of this, it''s probably best to get all these ideas out
there _before_ I start looking into the code - knowing the details has
the tendency to kill all the crazy ideas :)

Tuomas Leikola

2007-Aug-10 16:43 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

On 8/10/07, Darren Dunham <ddunham at taos.com>
wrote:> For instance, it might be nice to create a "mirror" with a 100G
disk and
> two 50G disks.  Right now someone has to create slices on the big disk
> manually and feed them to zpool.  Letting ZFS handle everything itself
> might be a win for some cases.
Especially performance-wise. AFAIK ZFS doesn''t understand that the two
vdevs actually share a physical disk and therefore should not be used
as raid0-like stripes.

James Blackburn

2007-Aug-10 18:22 UTC

head link

[zfs-discuss] Force ditto block on different vdev?

> This is practically the holy grail of "dynamic raid" - the
ability to
> dynamically use different redundancy settings on a per-directory
> level, and to use a mix of different sized devices and add/remove them
> at will.
Well I suspect that arbitrary redundancy configuration is not
something we''ll see anytime soon, nor is it something we should
necessarily want. The main reason being it''s very difficult to see it
being used effectively; it''s difficult enough to reason about data
loss characteristics currently.  (Not even considering the
implementation complexity.)

If you really need different guarantees on integrity, you could create
separate specialized pools of mirrors -- or use the ditto blocks
feature.  As Richard Elling points out
(http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl),
some redundancy (RAID-Z) is much better than none, and mirroring your
data increases the MTTDL by another 5/6 orders of magnitude (though
ditto''ing isn''t quite doing that), and interestingly the RAID
data
points clump together.

I think the important thing is that the system should be a little more
flexible than it is currently (allow variably sized disks,
adding/removing them), but not so much so that it''s a completely
different system.

James

zfs discuss - Aug 2007 - Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?

[zfs-discuss] Force ditto block on different vdev?