thr3ads.net - zfs discuss - [zfs-discuss] Proposal: multiple copies of user data [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Matthew Ahrens

2006-Sep-11 23:46 UTC

[zfs-discuss] Proposal: multiple copies of user data

Here is a proposal for a new ''copies'' property which would
allow
different levels of replication for different filesystems.

Your comments are appreciated!

--matt

A. INTRODUCTION

ZFS stores multiple copies of all metadata.  This is accomplished by
storing up to three DVAs (Disk Virtual Addresses) in each block pointer.
This feature is known as "Ditto Blocks".  When possible, the copies
are
stored on different disks.

See bug 6410698 "ZFS metadata needs to be more highly replicated (ditto
blocks)" for details on ditto blocks.

This case will extend this feature to allow system administrators to
store multiple copies of user data as well, on a per-filesystem basis.
These copies are in addition to any redundancy provided at the pool
level (mirroring, raid-z, etc).

B. DESCRIPTION

A new property will be added, ''copies'', which specifies how
many copies
of the given filesystem will be stored.  Its value must be 1, 2, or 3.
Like other properties (eg.  checksum, compression), it only affects
newly-written data.  As such, it is recommended that the
''copies''
property be set at filesystem-creation time
(eg. ''zfs create -o copies=2 pool/fs'').

The pool must be at least on-disk version 2 to use this feature (see
''zfs upgrade'').

By default (copies=1), only two copies of most filesystem metadata are
stored.  However, if we are storing multiple copies of user data, then 3
copies (the maximum) of filesystem metadata will be stored.

This feature is similar to using mirroring, but differs in several
important ways:

* Different filesystems in the same pool can have different numbers of
   copies.
* The storage configuration is not constrained as it is with mirroring
   (eg. you can have multiple copies even on a single disk).
* Mirroring offers slightly better performance, because only one DVA
   needs to be allocated.
* Mirroring offers slightly better redundancy, because one disk from
   each mirror can fail without data loss.

It is important to note that the copies provided by this feature are in
addition to any redundancy provided by the pool configuration or the
underlying storage.  For example:

* In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
   will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
   1 disk failing without data loss.
* In a pool with 2-way mirrors, a filesystem with copies=3
   will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
   5 disks failing without data loss (assuming that there are at least
   ncopies=3 mirror groups).
* In a pool with single-parity raid-z a filesystem with copies=2
   will be stored with 2 copies, each copy protected by its own parity
   block.  The filesystem can tolerate any 3 disks failing without data
   loss (assuming that there are at least ncopies=2 raid-z groups).


C. MANPAGE CHANGES
*** zfs.man4    Tue Jun 13 10:15:38 2006
--- zfs.man5    Mon Sep 11 16:34:37 2006
***************
*** 708,714 ****
--- 708,725 ----
            they are inherited.


+      copies=1 | 2 | 3

+        Controls the number of copies of data stored for this dataset.
+        These copies are in addition to any redundancy provided by the
+        pool (eg. mirroring or raid-z).  The copies will be stored on
+        different disks if possible.
+
+        Changing this property only affects newly-written data.
+        Therefore, it is recommended that this property be set at
+        filesystem creation time, using the ''-o copies=''
option.
+
+
     Temporary Mountpoint Properties
        When a file system is mounted, either through mount(1M)  for
        legacy  mounts  or  the  "zfs mount" command for normal file


D. REFERENCES

James Dickens

2006-Sep-12 00:02 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Here is a proposal for a new ''copies'' property which
would allow
> different levels of replication for different filesystems.
>
> Your comments are appreciated!
>
> --matt
>
> A. INTRODUCTION
>
> ZFS stores multiple copies of all metadata.  This is accomplished by
> storing up to three DVAs (Disk Virtual Addresses) in each block pointer.
> This feature is known as "Ditto Blocks".  When possible, the
copies are
> stored on different disks.
>
> See bug 6410698 "ZFS metadata needs to be more highly replicated
(ditto
> blocks)" for details on ditto blocks.
>
> This case will extend this feature to allow system administrators to
> store multiple copies of user data as well, on a per-filesystem basis.
> These copies are in addition to any redundancy provided at the pool
> level (mirroring, raid-z, etc).
>
> B. DESCRIPTION
>
> A new property will be added, ''copies'', which specifies
how many copies
> of the given filesystem will be stored.  Its value must be 1, 2, or 3.
> Like other properties (eg.  checksum, compression), it only affects
> newly-written data.  As such, it is recommended that the
''copies''
> property be set at filesystem-creation time
> (eg. ''zfs create -o copies=2 pool/fs'').
>would the user be held acountable for the space used by the extra
copies? so if a user has a 1GB quota and stores one  512MB file with
two copies activated, all his space will be used? what happens if the
same user stores a file that is 756MB on the filesystem with multiple
copies enabled an a 1GB quota, does the save fail? How would the user
tell that his filesystem is full since all the tools he is used to
report he is using only 1/2 the space?

Is there a way for the sysdmin to get rid of the excess copies should
disk space needs require it?

If I start out 2 copies and later change it to on 1 copy,  do the
files created before keep there 2 copies?

what happens if root needs to store a copy of an important file and
there is no space but there is space if extra copies are reclaimed?
Will this be configurable behavior?

James Dickens
uadmin.blogpsot.com

> The pool must be at least on-disk version 2 to use this feature (see
> ''zfs upgrade'').
>
> By default (copies=1), only two copies of most filesystem metadata are
> stored.  However, if we are storing multiple copies of user data, then 3
> copies (the maximum) of filesystem metadata will be stored.
>
> This feature is similar to using mirroring, but differs in several
> important ways:
>
> * Different filesystems in the same pool can have different numbers of
>    copies.
> * The storage configuration is not constrained as it is with mirroring
>    (eg. you can have multiple copies even on a single disk).
> * Mirroring offers slightly better performance, because only one DVA
>    needs to be allocated.
> * Mirroring offers slightly better redundancy, because one disk from
>    each mirror can fail without data loss.
>
> It is important to note that the copies provided by this feature are in
> addition to any redundancy provided by the pool configuration or the
> underlying storage.  For example:
>
> * In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
>    will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
>    1 disk failing without data loss.
> * In a pool with 2-way mirrors, a filesystem with copies=3
>    will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
>    5 disks failing without data loss (assuming that there are at least
>    ncopies=3 mirror groups).
> * In a pool with single-parity raid-z a filesystem with copies=2
>    will be stored with 2 copies, each copy protected by its own parity
>    block.  The filesystem can tolerate any 3 disks failing without data
>    loss (assuming that there are at least ncopies=2 raid-z groups).
>
>
> C. MANPAGE CHANGES
> *** zfs.man4    Tue Jun 13 10:15:38 2006
> --- zfs.man5    Mon Sep 11 16:34:37 2006
> ***************
> *** 708,714 ****
> --- 708,725 ----
>             they are inherited.
>
>
> +      copies=1 | 2 | 3
>
> +        Controls the number of copies of data stored for this dataset.
> +        These copies are in addition to any redundancy provided by the
> +        pool (eg. mirroring or raid-z).  The copies will be stored on
> +        different disks if possible.
> +
> +        Changing this property only affects newly-written data.
> +        Therefore, it is recommended that this property be set at
> +        filesystem creation time, using the ''-o copies=''
option.
> +
> +
>      Temporary Mountpoint Properties
>         When a file system is mounted, either through mount(1M)  for
>         legacy  mounts  or  the  "zfs mount" command for normal
file
>
>
> D. REFERENCES
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Mike Gerdts

2006-Sep-12 00:13 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> B. DESCRIPTION
>
> A new property will be added, ''copies'', which specifies
how many copies
> of the given filesystem will be stored.  Its value must be 1, 2, or 3.
> Like other properties (eg.  checksum, compression), it only affects
> newly-written data.  As such, it is recommended that the
''copies''
> property be set at filesystem-creation time
> (eg. ''zfs create -o copies=2 pool/fs'').
Is there anything in the works to compress (or encrypt) existing data
after the fact?  For example, a special option to scrub that causes
the data to be re-written with the new properties could potentially do
this.  If so, this feature should subscribe to any generic framework
provided by such an effort.
> This feature is similar to using mirroring, but differs in several
> important ways:
>
> * Mirroring offers slightly better redundancy, because one disk from
>    each mirror can fail without data loss.
Is this use of slightly based upon disk failure modes?  That is, when
disks fail do they tend to get isolated areas of badness compared to
complete loss?  I would suggest that complete loss should include
someone tripping over the power cord to the external array that houses
the disk.
> It is important to note that the copies provided by this feature are in
> addition to any redundancy provided by the pool configuration or the
> underlying storage.  For example:
All of these examples seem to assume that there six disks.
> * In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
>    will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
>    1 disk failing without data loss.
> * In a pool with 2-way mirrors, a filesystem with copies=3
>    will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
>    5 disks failing without data loss (assuming that there are at least
>    ncopies=3 mirror groups).
This one assumes best case scenario with 6 disks.  Suppose you had 4 x
72 GB and 2 x 36 GB disks.  You could end up with multiple copies on
the 72 GB disks.
> * In a pool with single-parity raid-z a filesystem with copies=2
>    will be stored with 2 copies, each copy protected by its own parity
>    block.  The filesystem can tolerate any 3 disks failing without data
>    loss (assuming that there are at least ncopies=2 raid-z groups).
>
>
> C. MANPAGE CHANGES
> *** zfs.man4    Tue Jun 13 10:15:38 2006
> --- zfs.man5    Mon Sep 11 16:34:37 2006
> ***************
> *** 708,714 ****
> --- 708,725 ----
>             they are inherited.
>
>
> +      copies=1 | 2 | 3
>
> +        Controls the number of copies of data stored for this dataset.
> +        These copies are in addition to any redundancy provided by the
> +        pool (eg. mirroring or raid-z).  The copies will be stored on
> +        different disks if possible.
Any statement about physical location on the disk?   It would seem as
though locating two copies sequentially on the disk would not provide
nearly the amount of protection as having them fairly distant from
each other.


-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Matthew Ahrens

2006-Sep-12 00:14 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

James Dickens wrote:> On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:
>> B. DESCRIPTION
>>
>> A new property will be added, ''copies'', which
specifies how many copies
>> of the given filesystem will be stored.  Its value must be 1, 2, or 3.
>> Like other properties (eg.  checksum, compression), it only affects
>> newly-written data.  As such, it is recommended that the
''copies''
>> property be set at filesystem-creation time
>> (eg. ''zfs create -o copies=2 pool/fs'').
>>
> would the user be held acountable for the space used by the extra
> copies? 
Doh!  Sorry I forgot to address that.  I''ll amend the proposal and 
manpage to include this information...

Yes, the space used by the extra copies will be accounted for, eg. in 
stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota.
> so if a user has a 1GB quota and stores one  512MB file with
> two copies activated, all his space will be used? 
Yes, and as mentioned this will be reflected in all the space accounting 
tools.
> what happens if the
> same user stores a file that is 756MB on the filesystem with multiple
> copies enabled an a 1GB quota, does the save fail?
Yes, they will get ENOSPC and see that their filesystem is full.
> How would the user
> tell that his filesystem is full since all the tools he is used to
> report he is using only 1/2 the space?
Any tool will report that in fact all space is being used.
> Is there a way for the sysdmin to get rid of the excess copies should
> disk space needs require it?
No, not without rewriting them.  (This is the same behavior we have 
today with the ''compression'' and ''checksum''
properties.  It''s a
long-term goal of ours to be able to go back and change these things 
after the fact ("scrub them in", so to say), but with snapshots, this
is
extremely nontrivial to do efficiently and without increasing the amount 
of space used.)
> If I start out 2 copies and later change it to on 1 copy,  do the
> files created before keep there 2 copies?
Yep, the property only affects newly-written data.
> what happens if root needs to store a copy of an important file and
> there is no space but there is space if extra copies are reclaimed?
They will get ENOSPC.
> Will this be configurable behavior?
No.

--matt

Matthew Ahrens

2006-Sep-12 00:27 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Mike Gerdts wrote:> Is there anything in the works to compress (or encrypt) existing data
> after the fact?  For example, a special option to scrub that causes
> the data to be re-written with the new properties could potentially do
> this.
This is a long-term goal of ours, but with snapshots, this is extremely 
nontrivial to do efficiently and without increasing the amount of space 
used.) .

 > If so, this feature should subscribe to any generic
framework> provided by such an effort.
Yep, absolutely.
>> * Mirroring offers slightly better redundancy, because one disk from
>>    each mirror can fail without data loss.
> 
> Is this use of slightly based upon disk failure modes?  That is, when
> disks fail do they tend to get isolated areas of badness compared to
> complete loss?  I would suggest that complete loss should include
> someone tripping over the power cord to the external array that houses
> the disk.
I''m basing this "slightly better" call on a model of random, 
complete-disk failures.  I know that this is only an approximation. 
With many mirrors, most (but not all) 2-disk failures can be tolerated. 
  With copies=2, almost no 2-top-level-vdev failures will be tolerated, 
because it''s likely that *some* block will have both its copies on
those
2 disks.  With mirrors, you can arrange to mirror across cabinets, not 
within them, which you can''t do with copies.
>> It is important to note that the copies provided by this feature are in
>> addition to any redundancy provided by the pool configuration or the
>> underlying storage.  For example:
> 
> All of these examples seem to assume that there six disks.
Not really.  There could be any number of mirrors or raid-z groups 
(although I note, you need at least ''copies'' groups to survive
the max
whole-disk failures).
>> * In a pool with 2-way mirrors, a filesystem with copies=1 (the
default)
>>    will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate
any
>>    1 disk failing without data loss.
>> * In a pool with 2-way mirrors, a filesystem with copies=3
>>    will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate
any
>>    5 disks failing without data loss (assuming that there are at least
>>    ncopies=3 mirror groups).
> 
> This one assumes best case scenario with 6 disks.  Suppose you had 4 x
> 72 GB and 2 x 36 GB disks.  You could end up with multiple copies on
> the 72 GB disks.
Yes, all these examples assume that our "putting the copies on different 
disks when possible" actually worked out.  It will almost certainly work 
out unless you have a small number of different-sized devices, or are 
running with very little free space.  If you need hard guarantees, you 
need to use actual mirroring.
> Any statement about physical location on the disk?   It would seem as
> though locating two copies sequentially on the disk would not provide
> nearly the amount of protection as having them fairly distant from
> each other.
Yep, if the copies can''t be stored on different disks, they will be 
stored spread-out on the same disk if possible (I think we aim for one 
on each quarter of the disk).

--matt

Mike Gerdts

2006-Sep-12 00:37 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> > would the user be held acountable for the space used by the extra
> > copies?
>
> Doh!  Sorry I forgot to address that.  I''ll amend the proposal and
> manpage to include this information...
>
> Yes, the space used by the extra copies will be accounted for, eg. in
> stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota.
>
> > so if a user has a 1GB quota and stores one  512MB file with
> > two copies activated, all his space will be used?
>
> Yes, and as mentioned this will be reflected in all the space accounting
> tools.
Yuck.  This would be terribly confusing for typical end-users.  I
would say that statvfs() should munge the numbers such that f_bfree
and f_bavail are divided by ncopies.  Else, applications that need
this information will need to know *way* too much about the file
system.  For example, consider the checks performed by
setup_install_server that comes with the Solaris media.  That script
does a du on the media that it came from followed by a df on the
target.  Should that script really need to be modified for the case
where the source and/or target are on zfs with ncopies != 1?

This part of the feature would keep me from using it anywhere that
there is any chance of being space constrained and I have 1 or more
users that can''t read the man page for zfs then explain how it is
different than at least one competing file system.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

James Dickens

2006-Sep-12 01:03 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> James Dickens wrote:
> > On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:
> >> B. DESCRIPTION
> >>
> >> A new property will be added, ''copies'', which
specifies how many copies
> >> of the given filesystem will be stored.  Its value must be 1, 2,
or 3.
> >> Like other properties (eg.  checksum, compression), it only
affects
> >> newly-written data.  As such, it is recommended that the
''copies''
> >> property be set at filesystem-creation time
> >> (eg. ''zfs create -o copies=2 pool/fs'').
> >>
> > would the user be held acountable for the space used by the extra
> > copies?
>
> Doh!  Sorry I forgot to address that.  I''ll amend the proposal and
> manpage to include this information...
>
> Yes, the space used by the extra copies will be accounted for, eg. in
> stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota.
>
> > so if a user has a 1GB quota and stores one  512MB file with
> > two copies activated, all his space will be used?
>
> Yes, and as mentioned this will be reflected in all the space accounting
> tools.
>
> > what happens if the
> > same user stores a file that is 756MB on the filesystem with multiple
> > copies enabled an a 1GB quota, does the save fail?
>
> Yes, they will get ENOSPC and see that their filesystem is full.
>
> > How would the user
> > tell that his filesystem is full since all the tools he is used to
> > report he is using only 1/2 the space?
>
> Any tool will report that in fact all space is being used.
>
> > Is there a way for the sysdmin to get rid of the excess copies should
> > disk space needs require it?
>
> No, not without rewriting them.  (This is the same behavior we have
> today with the ''compression'' and
''checksum'' properties.  It''s a
> long-term goal of ours to be able to go back and change these things
> after the fact ("scrub them in", so to say), but with snapshots,
this is
> extremely nontrivial to do efficiently and without increasing the amount
> of space used.)
>
> > If I start out 2 copies and later change it to on 1 copy,  do the
> > files created before keep there 2 copies?
>
> Yep, the property only affects newly-written data.
>
> > what happens if root needs to store a copy of an important file and
> > there is no space but there is space if extra copies are reclaimed?
>
> They will get ENOSPC.
>though I think this is a cool feature, I think i needs more work. I
think there sould be an option to make extra copies expendible. So the
extra copies are a request, if the space is availible make them, if
not complete the write, and log the event.

It the user really requires guaranteed extra copies, then use mirrored
or raided disks.

It seems just to be a nightmare for the administrator, you start with
3 copies and then change to 2 copies, you will have phantom copies
that are only known to exist to the OS, it won''t show in any reports,
zfs list doesn''t have an option to show which files have multiple
clones and which dont. There is no way to destroy multiple clones
without rewriting every file on the disk.

James

> > Will this be configurable behavior?
>
> No.
>
> --matt
>

Matthew Ahrens

2006-Sep-12 01:28 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

James Dickens wrote:> though I think this is a cool feature, I think i needs more work. I
> think there sould be an option to make extra copies expendible. So the
> extra copies are a request, if the space is availible make them, if
> not complete the write, and log the event.
Are you asking for the extra copies that have already been written to be 
dynamically freed up when we are running low on space?  That could be 
useful, but it isn''t the problem I''m trying to solve with the
''copies''
property (not to mention it would be extremely difficult to implement).
> It the user really requires guaranteed extra copies, then use mirrored
> or raided disks.
Right, if you want everything to have extra redundancy, that use case is 
handled just fine today by mirrors or RAIDZ.

The case where ''copies'' is useful is when you want some data
to be
stored with more redundancy than others, without the burden of setting 
up different pools.
> It seems just to be a nightmare for the administrator, you start with
> 3 copies and then change to 2 copies, you will have phantom copies
> that are only known to exist to the OS, it won''t show in any
reports,
> zfs list doesn''t have an option to show which files have multiple
> clones and which dont. There is no way to destroy multiple clones
> without rewriting every file on the disk.
(I''m assuming you mean copies, not clones.)

So would you prefer that the property be restricted to only being set at 
filesystem creation time, and not changed later?  That way the number of 
copies of all files in the filesystem is always the same.

It seems like the issue of knowing how many copies there are would be 
much worse in the system you''re asking for where the extra copies are 
freed up as needed...

--matt

William D. Hathaway

2006-Sep-12 02:30 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

Hi Matt,
   Interesting proposal.  Has there been any consideration if free space being
reported for a ZFS filesystem would take into account the copies setting?

Example:
    zfs create mypool/nonredundant_data
    zfs create mypool/redundant_data
    df -h /mypool/nonredundant_data /mypool/redundant_data 
    (shows same amount of free space)
    zfs set copies=3 mypool/redundant_data

Would a new df of /mypool/redundant_data now show a different amount of free
space (presumably 1/3 if different) than /mypool/nonredundant_data?
 
 
This message posted from opensolaris.org

Darren J Moffat

2006-Sep-12 09:36 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Mike Gerdts wrote:> On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:
>> B. DESCRIPTION
>>
>> A new property will be added, ''copies'', which
specifies how many copies
>> of the given filesystem will be stored.  Its value must be 1, 2, or 3.
>> Like other properties (eg.  checksum, compression), it only affects
>> newly-written data.  As such, it is recommended that the
''copies''
>> property be set at filesystem-creation time
>> (eg. ''zfs create -o copies=2 pool/fs'').
> 
> Is there anything in the works to compress (or encrypt) existing data
> after the fact?  For example, a special option to scrub that causes
> the data to be re-written with the new properties could potentially do
> this.  If so, this feature should subscribe to any generic framework
> provided by such an effort.
While encryption of existing data is not in scope for the first ZFS 
crypto phase I am being careful in the design to ensure that it can be 
done later if such a ZFS "framework" becomes available.

The biggest problem I see with this is one of observability, if not all 
of the data is encrypted yet what should the encryption property say ? 
If it says encryption is on then the admin might think the data is 
"safe", but if it says it is off that isn''t the truth either
because
some of it maybe in encrypted.

-- 
Darren J Moffat

Dick Davies

2006-Sep-12 10:11 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 12/09/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Here is a proposal for a new ''copies'' property which
would allow
> different levels of replication for different filesystems.
>
> Your comments are appreciated!
Flexibility is always nice, but this seems to greatly complicate things,
both technically and conceptually (sometimes, good design is about what
is left out :) ).

Seems to me this lets you say ''files in this directory are x times more
valuable than files elsewhere''. Others have covered some of my
concerns (guarantees, cleanup, etc.). In addition,

* if I move a file somewhere else, does it become less important?
* zpools let you do that already
  (admittedly with less granularity, but *much* *much* more simply -
  and disk is cheap in my world)
* I don''t need to do that :)

The only real use I''d see would be for redundant copies
on a single disk, but then why wouldn''t I just add a disk?

* disks are cheap, and creating a mirror from a single disk is very easy
  (and conceptually simple)
* *removing* a disk from a mirror pair is simple too - I make mistakes
  sometimes
* in my experience, disks fail. When you get bad errors on part of a disk,
  the disk is about to die.
* you can already create a/several zpools using disk
  partitions as vdevs. That''s not all that safe, and I don''t
see this being
  any safer.

Sorry to be negative, but to me ZFS'' simplicity is one of its major
features.
I think this provides a cool feature, but I question it''s usefulness.

Quite possibly I just don''t have the particular itch this is intended
to scratch - is this a much requested feature?

-- 
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Ceri Davies

2006-Sep-12 10:32 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

> Hi Matt,
> Interesting proposal.  Has there been any
> consideration if free space being reported for a ZFS
> filesystem would take into account the copies
>  setting?
> 
> Example:
>     zfs create mypool/nonredundant_data
> zfs create mypool/redundant_data
> df -h /mypool/nonredundant_data
>  /mypool/redundant_data 
>    (shows same amount of free space)
>  zfs set copies=3 mypool/redundant_data
> 
> Would a new df of /mypool/redundant_data now show a
> different amount of free space (presumably 1/3 if
> different) than /mypool/nonredundant_data?
As I understand the proposal, there''s nothing new to do here.  The
filesystem might be 25% full, and it would be 25% full no matter how many copies
of the filesystem there are.

Similarly with quotas, I''d argue that the extra copies should not count
towards a user''s quota, since a quota is set on the filesystem.  If
I''m using 500M on a filesystem, I only have 500M of data no matter how
many copies of it the administrator has decided to keep (cf. RAID1).

I also don''t see why a copy can''t just be dropped if the
"copies" value is decreased.

Having said this, I don''t see any value in the proposal at all, to be
honest.
 
 
This message posted from opensolaris.org

Darren J Moffat

2006-Sep-12 10:52 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Dick Davies wrote:
> The only real use I''d see would be for redundant copies
> on a single disk, but then why wouldn''t I just add a disk?
Some systems have physical space for only a single drive - think most 
laptops!

-- 
Darren J Moffat

Dick Davies

2006-Sep-12 12:15 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 12/09/06, Darren J Moffat <Darren.Moffat at sun.com>
wrote:> Dick Davies wrote:
>
> > The only real use I''d see would be for redundant copies
> > on a single disk, but then why wouldn''t I just add a disk?
>
> Some systems have physical space for only a single drive - think most
> laptops!
True - I''m a laptop user myself. But as I said, I''d assume the
whole disk
would fail (it does in my experience).

If your hardware craps differently to mine, you could do a similar thing
with partitions (or even files) as vdevs. Wouldn''t be any less
reliable.

I''m still not Feeling the Magic on this one :)

-- 
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Darren J Moffat

2006-Sep-12 12:29 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Dick Davies wrote:> On 12/09/06, Darren J Moffat <Darren.Moffat at sun.com> wrote:
>> Dick Davies wrote:
>>
>> > The only real use I''d see would be for redundant copies
>> > on a single disk, but then why wouldn''t I just add a
disk?
>>
>> Some systems have physical space for only a single drive - think most
>> laptops!
> 
> True - I''m a laptop user myself. But as I said, I''d
assume the whole disk
> would fail (it does in my experience).
Indeed and that is the failure I had recently - complete death of disk 
no longer visible even by the BIOS.
> If your hardware craps differently to mine, you could do a similar thing
> with partitions (or even files) as vdevs. Wouldn''t be any less
reliable.
One downside to that being that ZFS can perform better, due to write 
cache IIRC, when given the whole disk - though in the laptop case this 
isn''t applicable until ZFS boot is ready.

-- 
Darren J Moffat

Darren J Moffat

2006-Sep-12 12:31 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

The multiple copies needs to be thought out carefully for interactions 
with ZFS crypto since.  I''m not sure what the impact is yet, it would 
help to know at what layer in the ZIO pipeline this is done - eg today 
before or after compression.

-- 
Darren J Moffat

Jeff Victor

2006-Sep-12 14:14 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

This proposal would benefit greatly by a "problem statement."  As it
stands, it
feels like a solution looking for a problem.

The Introduction mentions a different problem and solution, but then pretends
that
there is value to this solution.  The Description section mentions some benefits
of ''copies'' relative to the existing situation, but requires
that the reader piece
together the whole picture.  And IMO there aren''t enough pieces :-) ,
i.e. so far
I haven''t seen sufficient justification for the added administrative
complexity
and potential for confusion, both administrative and user.

Matthew Ahrens wrote:> Here is a proposal for a new ''copies'' property which
would allow
> different levels of replication for different filesystems.
> 
> Your comments are appreciated!
> 
> --matt
> 
> A. INTRODUCTION
> 
> ZFS stores multiple copies of all metadata.  This is accomplished by
> storing up to three DVAs (Disk Virtual Addresses) in each block pointer.
> This feature is known as "Ditto Blocks".  When possible, the
copies are
> stored on different disks.
> 
> See bug 6410698 "ZFS metadata needs to be more highly replicated
(ditto
> blocks)" for details on ditto blocks.
> 
> This case will extend this feature to allow system administrators to
> store multiple copies of user data as well, on a per-filesystem basis.
> These copies are in addition to any redundancy provided at the pool
> level (mirroring, raid-z, etc).
> 
> B. DESCRIPTION
> 
> A new property will be added, ''copies'', which specifies
how many copies
> of the given filesystem will be stored.  Its value must be 1, 2, or 3.
> Like other properties (eg.  checksum, compression), it only affects
> newly-written data.  As such, it is recommended that the
''copies''
> property be set at filesystem-creation time
> (eg. ''zfs create -o copies=2 pool/fs'').
> 
> The pool must be at least on-disk version 2 to use this feature (see
> ''zfs upgrade'').
> 
> By default (copies=1), only two copies of most filesystem metadata are
> stored.  However, if we are storing multiple copies of user data, then 3
> copies (the maximum) of filesystem metadata will be stored.
> 
> This feature is similar to using mirroring, but differs in several
> important ways:
> 
> * Different filesystems in the same pool can have different numbers of
>    copies.
> * The storage configuration is not constrained as it is with mirroring
>    (eg. you can have multiple copies even on a single disk).
> * Mirroring offers slightly better performance, because only one DVA
>    needs to be allocated.
> * Mirroring offers slightly better redundancy, because one disk from
>    each mirror can fail without data loss.
> 
> It is important to note that the copies provided by this feature are in
> addition to any redundancy provided by the pool configuration or the
> underlying storage.  For example:
> 
> * In a pool with 2-way mirrors, a filesystem with copies=1 (the default)
>    will be stored with 2 * 1 = 2 copies.  The filesystem can tolerate any
>    1 disk failing without data loss.
> * In a pool with 2-way mirrors, a filesystem with copies=3
>    will be stored with 2 * 3 = 6 copies.  The filesystem can tolerate any
>    5 disks failing without data loss (assuming that there are at least
>    ncopies=3 mirror groups).
> * In a pool with single-parity raid-z a filesystem with copies=2
>    will be stored with 2 copies, each copy protected by its own parity
>    block.  The filesystem can tolerate any 3 disks failing without data
>    loss (assuming that there are at least ncopies=2 raid-z groups).
> 
> 
> C. MANPAGE CHANGES
> *** zfs.man4    Tue Jun 13 10:15:38 2006
> --- zfs.man5    Mon Sep 11 16:34:37 2006
> ***************
> *** 708,714 ****
> --- 708,725 ----
>             they are inherited.
> 
> 
> +      copies=1 | 2 | 3
> 
> +        Controls the number of copies of data stored for this dataset.
> +        These copies are in addition to any redundancy provided by the
> +        pool (eg. mirroring or raid-z).  The copies will be stored on
> +        different disks if possible.
> +
> +        Changing this property only affects newly-written data.
> +        Therefore, it is recommended that this property be set at
> +        filesystem creation time, using the ''-o copies=''
option.
> +
> +
>      Temporary Mountpoint Properties
>         When a file system is mounted, either through mount(1M)  for
>         legacy  mounts  or  the  "zfs mount" command for normal
file
> 
> 
> D. REFERENCES
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
--------------------------------------------------------------------------
Jeff VICTOR              Sun Microsystems            jeff.victor @ sun.com
OS Ambassador            Sr. Technical Specialist
Solaris 10 Zones FAQ:    http://www.opensolaris.org/os/community/zones/faq
--------------------------------------------------------------------------

Anton B. Rang

2006-Sep-12 14:54 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

>The biggest problem I see with this is one of observability, if not all 
>of the data is encrypted yet what should the encryption property say ? 
>If it says encryption is on then the admin might think the data is 
>"safe", but if it says it is off that isn''t the truth
either because
>some of it maybe still encrypted.
>From a user interface perspective, I''d expect something like
  Encryption: Being enabled, 75% complete
or
  Encryption: Being disabled, 25% complete, about 2h23m remaining

I''m not sure how you''d map this into a property (or several),
but it seems like "on"/"off" ought to be paired with
"transitioning to on"/"transitioning to off" for any changes
which aren''t instantaneous.
 
 
This message posted from opensolaris.org

Darren J Moffat

2006-Sep-12 14:59 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

Anton B. Rang wrote:>> The biggest problem I see with this is one of observability, if not all
>> of the data is encrypted yet what should the encryption property say ? 
>> If it says encryption is on then the admin might think the data is 
>> "safe", but if it says it is off that isn''t the
truth either because
>> some of it maybe still encrypted.
> 
>>From a user interface perspective, I''d expect something like
> 
>   Encryption: Being enabled, 75% complete
> or
>   Encryption: Being disabled, 25% complete, about 2h23m remaining
and if we are still writing to the file systems at that time ?

Maybe this really does need to be done with the file system locked.
> I''m not sure how you''d map this into a property (or
several), but it seems like "on"/"off" ought to be paired
with "transitioning to on"/"transitioning to off" for any
changes which aren''t instantaneous.
Agreed, and checksum and compression would have the same issue if there 
was a mechanism to rewrite with the new checksums or compression settings.

-- 
Darren J Moffat

Anton B. Rang

2006-Sep-12 14:59 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

>True - I''m a laptop user myself. But as I said, I''d assume
the whole disk
>would fail (it does in my experience).
That''s usually the case, but single-block failures can occur as well.
They''re rare (check the "uncorrectable bit error rate"
specifications) but if they happen to hit a critical file, they''re
painful.

On the other hand, multiple copies seems (to me) like a really expensive way to
deal with this. ZFS is already using relatively large blocks, so it could add an
erasure code on top of them and have far less storage overhead. If the assumed
problem is multi-block failures in one area of the disk, I''d wonder how
common this failure mode is; in my experience, multi-block failures are
generally due to the head having touched the platter, in which case the whole
drive will shortly fail. (In any case, multi-block failures could be addressed
by spreading the data from a large block and using an erasure code.)
 
 
This message posted from opensolaris.org

Anton B. Rang

2006-Sep-12 15:04 UTC

head link

[zfs-discuss] Re: Re: Proposal: multiple copies of user data

>And if we are still writing to the file systems at that time ?
New writes should be done according to the new state (if encryption is being
enabled, all new writes are encrypted), since the goal is that eventually the
whole disk will be in the new state.

The completion percentage should probably reflect the existing data at the time
that the state change is initiated, since new writes won''t affect how
much data has to be replaced.
>Maybe this really does need to be done with the file system locked.
I don''t see any technical reason to require that, and users expect
better from us these days.  :-)

As you point out, checksum & compression will have the same issue once we
have on-line changes for those as well. The framework ought to take care of
this.
 
 
This message posted from opensolaris.org

David Dyer-Bennet

2006-Sep-12 15:12 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Here is a proposal for a new ''copies'' property which
would allow
> different levels of replication for different filesystems.
>
> Your comments are appreciated!
I''ve read the proposal, and followed the discussion so far.  I have to
say that I don''t see any particular need for this feature.

Possibly there is a need for a different feature, in which the entire
control of redundancy is moved away from the pool level and to the
file or filesystem level.  I definitely see the attraction of being
able to specify by file and directory different degrees of reliability
needed.  However, the details of the feature actually proposed don''t
seem to satisfy the need for extra reliability at the level that
drives people to employ redundancy; it doesn''t provide a guaranty.

I see no need for additional non-guaranteed reliability on top of the
levels of guaranty provided by use of redundancy at the pool level.

Furthermore, as others have pointed out, this feature would add a high
degree of user-visible complexity.
>From what I''ve seen here so far, I think this is a bad idea and
shouldnot be added.
-- 
David Dyer-Bennet, <mailto:dd-b at dd-b.net>,
<http://www.dd-b.net/dd-b/>
RKBA: <http://www.dd-b.net/carry/>
Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

Neil A. Wilson

2006-Sep-12 16:13 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Darren J Moffat wrote:> While encryption of existing data is not in scope for the first ZFS 
> crypto phase I am being careful in the design to ensure that it can be 
> done later if such a ZFS "framework" becomes available.
> 
> The biggest problem I see with this is one of observability, if not all 
> of the data is encrypted yet what should the encryption property say ? 
> If it says encryption is on then the admin might think the data is 
> "safe", but if it says it is off that isn''t the truth
either because
> some of it maybe in encrypted.
I would also think that there''s a significant problem around what to do
about the previously unencrypted data.  I assume that when performing a 
"scrub" to encrypt the data, the encrypted data will not be written on
the same blocks previously used to hold the unencrypted data.  As such, 
there''s a very good chance that the unencrypted data would still be 
there for quite some time.  You may not be able to access it through the 
filesystem, but someone with access to the raw disks may be able to 
recover at least parts of it.  In this case, the "scrub" would not
only
have to write the encrypted data but also overwrite the unencrypted data 
(multiple times?).



Neil

Darren J Moffat

2006-Sep-12 16:17 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Neil A. Wilson wrote:> Darren J Moffat wrote:
>> While encryption of existing data is not in scope for the first ZFS 
>> crypto phase I am being careful in the design to ensure that it can be 
>> done later if such a ZFS "framework" becomes available.
>>
>> The biggest problem I see with this is one of observability, if not 
>> all of the data is encrypted yet what should the encryption property 
>> say ? If it says encryption is on then the admin might think the data 
>> is "safe", but if it says it is off that isn''t the
truth either
>> because some of it maybe in encrypted.
> 
> I would also think that there''s a significant problem around what
to do
> about the previously unencrypted data.  I assume that when performing a 
> "scrub" to encrypt the data, the encrypted data will not be
written on
> the same blocks previously used to hold the unencrypted data.  As such, 
> there''s a very good chance that the unencrypted data would still
be
> there for quite some time.  You may not be able to access it through the 
> filesystem, but someone with access to the raw disks may be able to 
> recover at least parts of it.  In this case, the "scrub" would
not only
> have to write the encrypted data but also overwrite the unencrypted data 
> (multiple times?).
Right, that is a very important issue.  Would a ZFS "scrub" framework
do
copy on write ?  As you point out if it doesn''t then we still need to
do
something about the old clear text blocks because strings(1) over the 
raw disk will show them.

I see the desire to have a knob that says "make this encrypted now"
but
I personally believe that it is actually better if you can make this 
choice at the time you create the ZFS data set.

-- 
Darren J Moffat

Nicolas Williams

2006-Sep-12 16:48 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On Tue, Sep 12, 2006 at 10:36:30AM +0100, Darren J Moffat
wrote:> Mike Gerdts wrote:
> >Is there anything in the works to compress (or encrypt) existing data
> >after the fact?  For example, a special option to scrub that causes
> >the data to be re-written with the new properties could potentially do
> >this.  If so, this feature should subscribe to any generic framework
> >provided by such an effort.
> 
> While encryption of existing data is not in scope for the first ZFS 
> crypto phase I am being careful in the design to ensure that it can be 
> done later if such a ZFS "framework" becomes available.
> 
> The biggest problem I see with this is one of observability, if not all 
> of the data is encrypted yet what should the encryption property say ? 
> If it says encryption is on then the admin might think the data is 
> "safe", but if it says it is off that isn''t the truth
either because
> some of it maybe in encrypted.
I agree -- there needs to be a filesystem re-write option, something
like a "scrub" but at the filesystem level.  Things that might be
accomplished through it:

 - record size changes
 - compression toggling / compression algorithm changes
 - encryption/re-keying/alg. changes
 - checksum alg. changes
 - ditto blocking

What else?

To me it''s important that such "scrubs" not happen simply as
a result of
setting/changing a filesystem property, but it''s also important that
the
user/admin be told that changing the property requires scrubbing in
order to take effect for data/meta-data written before the change.

Nico
--

Nicolas Williams

2006-Sep-12 16:53 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On Tue, Sep 12, 2006 at 05:17:16PM +0100, Darren J Moffat
wrote:> I see the desire to have a knob that says "make this encrypted
now" but
> I personally believe that it is actually better if you can make this 
> choice at the time you create the ZFS data set.
Including when creating the dataset through zfs receive.

I definitely want re-keying/cipher and MAC/checksum algorithm changes to
be supported at zfs send/receive time.

Nico
--

Al Hopper

2006-Sep-12 16:56 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

On Tue, 12 Sep 2006, Anton B. Rang wrote:

.... reformatted ....> >True - I''m a laptop user myself. But as I said, I''d
assume the whole disk
> >would fail (it does in my experience).
Usually a laptop disk suffers a mechanical failure - and the failure rate
is a lot higher than disks in a fixed location environment.
> That''s usually the case, but single-block failures can occur as
well.
> They''re rare (check the "uncorrectable bit error rate"
specifications)
> but if they happen to hit a critical file, they''re painful.
>
> On the other hand, multiple copies seems (to me) like a really expensive
> way to deal with this. ZFS is already using relatively large blocks, so
> it could add an erasure code on top of them and have far less storage
> overhead. If the assumed problem is multi-block failures in one area of
> the disk, I''d wonder how common this failure mode is; in my
experience,
> multi-block failures are generally due to the head having touched the
> platter, in which case the whole drive will shortly fail. (In any case,
The following is based on dated knowledge from personal experience and I
can''t say if its (still) accurate information today.

Drive failures in a localized area are generally caused by the heads being
positioned in the same (general) cylinder position for long periods of
time.  The heads ride on a air bearing - but there is still a lot of
friction caused by the movement of air under the heads.  This is turn
generates heat.  Localized heat buildup can cause some of the material
coated on the disk to break free.  The drive is designed for this
eventuality - since it is equipped with a very fine filter which will
catch and trap anything that breaks free and the airflow is designed to
constantly circulate the air through the filter.  However, some of the
material might get trapped between the head and the disk and possibly
stick to the disk.  In this case, the neighbouring disk cylinders in this
general area will probably be damaged and, if enough material accumulates,
so might the head(s).

In the old days people wrote their own head "floater" programs - to
ensure
that the head was moved randomly across the disk surface from time to
time.

I don''t know if this is still relevant today - since the amount of
firmware a disk drive executes, continues to increase every day.  But in a
typical usage scenario, where a user does, for example, a find operation
in a home directory - and the directory caches are not sized large
enough, there is a good probability that the heads will end up in the same
general area of the disk, after the find op completes.  Assuming that the
box has enough memory, the disk may not be accessed again for a long time
- and possibly only during another find op (wash, rinse, repeat).

Continuing: a buildup of heat in a localized cylinder area, will cause the
disk platter to expand and shift, relative to the heads.  The disk
platter has one surface dedicated to storing servo information - and from
this the disk can "decide" that it is on the wrong cylinder after a
head
movement.  In which case the drive will recalibrate itself (thermal
recalibration) and store a table of offsets for different cylinder ranges.
So when the head it told, for example, to move to cylinder 1000, the
correction table will tell it to move to where physical cylinder 1000
should be and then add the correction delta (plus or minus) for that
cylinder range to figure out where to the actually move the heads to.

Now the heads are positioned on the correct cylinder and should be
centered on it.  If the drive gets a bad CRC after reading a cylinder it
can use the CRC to correct the data or it can command that the data be
re-read, until a correctable read is obtained.  Last I heard, the number
of retries is of the order of 100 to 200 or more(??).  So this will be
noticable - since 100 reads will require 100 revolutions of the disk.
Retries like this will probably continue to provide correctable data to
the user and the disk drive will ignore the fact that there is an area of
disk where retries are constantly required.  This is what Steve Gibson
picked up on for his SpinRite product.  If he runs code that can determine
that CRC corrections or re-reads are required to retrieve good data, then
he "knows" this is a likely area of the disk to fail in the (possibly
near) future.  So he relocates the data in this area, marks the area
"bad", and the drive avoids it.  Given what I wrote earlier, that
there
could be some physical damage in this general area - having the heads
avoid it is a Good Thing.

So the question is, how relevant is storing multiple copies of data on a
disk in terms of the mechanics of modern disk drive failure modes.
Without some "SpinRite" like functionality in the code, the drive will
continue to access the deteriorating disk cylinders, now a localized
failure, and eventually it will deteriorate further and cause enough
material to break free to take out the head(s).  At which time the drive
is toast.
> multi-block failures could be addressed by spreading the data from a
> large block and using an erasure code.)
Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
                OpenSolaris Governing Board (OGB) Member - Feb 2006

Bennett, Steve

2006-Sep-12 19:35 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Darren said:> Right, that is a very important issue.  Would a
> ZFS "scrub" framework do copy on write ?
> As you point out if it doesn''t then we still need
> to do something about the old clear text blocks
> because strings(1) over the raw disk will show them.
> 
> I see the desire to have a knob that says "make this 
> encrypted now" but I personally believe that it is
> actually better if you can make this choice at the
> time you create the ZFS data set.
I''m not sure that that gets rid of the problem at all.

If I have an existing filesystem that I want to encrypt, but I need to
create a new dataset to do so, I''m going to create my new, encrypted
dataset, then copy my data onto it, then (maybe) delete the old one.

If both datasets are in the same pool (which is likely), I''ll still not
be able to securely erase the blocks that have all my cleartext data on
them. The only way to do the job properly would to overwrite the entire
pool, which is likely to be pretty inconvenient in most cases.

So, how about some way to securely erase freed blocks?

It could be implemented as a one-off operation that acts on an entire
pool e.g.
    zfs shred tank
which would walk the free block list and overwrite with random data some
number of times.
Or it might be more useful to have it as a per-dataset option:
    zfs set shred=32 tank/secure
which could overwrite blocks with random data as they are freed.
I have no idea how expensive this might be (both in development time,
and in performance hit), but its use might be a bit wider than just
dealing with encryption and/or rekeying.

I guess that deletion of a snapshot might get a bit expensive, but maybe
there''s some way that blocks awaiting shredding could be queued up and
dealt with at a lower priority...

Steve.

Celso

2006-Sep-12 20:26 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

Take this for what it is: the opinion on someone who knows less about zfs than
probably anyone else on this thread ,but...

I would like to add my support for this proposal.

As I understand it, the reason for using ditto blocks on metadata, is that
maintaining their integrity is vital for the health of the filesystem, even if
the zpool isn''t mirrored  or redundant in any way ie laptops, or people
who just don''t or can''t add another drive.

One of the great things about zfs, is that it protects not just against
mechanical failure, but against silent data corruption. Having this available to
laptop owners seems to me to be important to making zfs even more attractive.

Granted, if you are running a enterprise based fileserver, this probably
isn''t going to be your first choice for data protection. You will
probably be using the other features of zfs like mirroring, raidz raidz2 etc.

Am I correct in assuming that having say 2 copies of your "documents"
filesystem means should silent data corruption occur, your data can be
reconstructed. So that you can leave your os and base applications with 1 copy,
but your important data can be protected.

In a way, this reminds me of intel''s "matrix raid" but much
cooler (it doesn''t rely on a specific motherboard for one thing).

I would also agree that utilities like ''ls'' and quotas should
report both and count against peoples quotas. It just doesn''t seem to
hard to me to understand that because you have 2 copies, you halve the amount of
available space.

Just to reiterate, I think this would be an awesome feature!

Celso.

PS. Please feel free to correct me on any technical inaccuracies. I am trying to
learn about zfs and Solaris 10 in general.
 
 
This message posted from opensolaris.org

Dick Davies

2006-Sep-12 20:45 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

On 12/09/06, Celso <celsouk at gmail.com> wrote:
> One of the great things about zfs, is that it protects not just against
mechanical failure, but against silent data corruption. Having this available to
laptop owners seems to me to be important to making zfs even more attractive.
I''m not arguing against that. I was just saying that *if* this was
useful to you
(and you were happy with the dubious resilience/performance benefits) you can
already create mirrors/raidz on a single disk by using partitions as
building blocks.
There''s no need to implement the proposal to gain that.

> Am I correct in assuming that having say 2 copies of your
"documents" filesystem means should silent data corruption occur, your
data can be reconstructed. So that you can leave your os and base applications
with 1 copy, but your important data can be protected.
Yes.

-- 
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Celso

2006-Sep-12 20:55 UTC

head link

[zfs-discuss] Re: Re: Proposal: multiple copies of user data

> On 12/09/06, Celso <celsouk at gmail.com> wrote:
> 
> > One of the great things about zfs, is that it
> protects not just against mechanical failure, but
> against silent data corruption. Having this available
> to laptop owners seems to me to be important to
> making zfs even more attractive.
> 
> I''m not arguing against that. I was just saying that
> *if* this was useful to you
> (and you were happy with the dubious
> resilience/performance benefits) you can
> already create mirrors/raidz on a single disk by
> using partitions as
> building blocks.
> There''s no need to implement the proposal to gain
> that.
> 
> 

It''s not as granular though is it?

In the situation you  describe:

...you split one disk in two. you then have effectively two partitions which you
can then create a new mirrored zpool with. Then everything is mirrored. Correct?

With ditto blocks, you can selectively add copies (seeing as how filesystem are
so easy to create on zfs). If you are only concerned with copies of your
important documents and email, why should /usr/bin be mirrored.

That''s my opinion anyway. I always enjoy choice, and I really believe
this is a useful and flexible one.

Celso

This message posted from opensolaris.org

Dick Davies

2006-Sep-12 22:08 UTC

head link

[zfs-discuss] Re: Re: Proposal: multiple copies of user data

On 12/09/06, Celso <celsouk at gmail.com> wrote:
> ...you split one disk in two. you then have effectively two partitions
which you can then create a new mirrored zpool with. Then everything is
mirrored. Correct?
Everything in the filesystems in the pool, yes.
> With ditto blocks, you can selectively add copies (seeing as how filesystem
are so easy to create on zfs). If you are only concerned with copies of your
important documents and email, why should /usr/bin be mirrored.
So my machine will boot if a disk fails. Which happened the other day :)

-- 
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Celso

2006-Sep-12 22:24 UTC

head link

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

> On 12/09/06, Celso <celsouk at gmail.com> wrote:
> 
> > ...you split one disk in two. you then have
> effectively two partitions which you can then create
> a new mirrored zpool with. Then everything is
> mirrored. Correct?
> 
> Everything in the filesystems in the pool, yes.
> 
> > With ditto blocks, you can selectively add copies
> (seeing as how filesystem are so easy to create on
> zfs). If you are only concerned with copies of your
> important documents and email, why should /usr/bin be
> mirrored.
> 
> So my machine will boot if a disk fails. Which
> happened the other day :)
> 
> -- 
> Rasputin :: Jack of All Trades - Master of Nuns
> http://number9.hellooperator.net/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
> ok cool.

I think it has already been said that in many peoples experience, when a disk
fails, it completely fails. Especially on laptops. Of course ditto blocks
wouldn''t help you in this situation either!

I still think that silent data corruption is a valid concern, one that ditto
blocks would solve. Also, I am not thrilled about losing that much space for
duplication of unneccessary data (caused by partitioning a disk in two).

I also echo Darren''s comments on zfs performing better when it has the
whole disk.

Hopefully we can agree that you lose nothing by adding this feature, even if you
personally don''t see a need for it.

Celso

This message posted from opensolaris.org

Torrey McMahon

2006-Sep-12 22:37 UTC

head link

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

Celso wrote:>
> Hopefully we can agree that you lose nothing by adding this feature, even
if you personally don''t see a need for it.

If I read correctly user tools will show more space in use when adding 
copies, quotas are impacted, etc. One could argue the added confusion 
outweighs the addition of the feature.

As others have asked I''d like to see the problem that this feature is 
designed to solve.

Matthew Ahrens

2006-Sep-12 22:56 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Matthew Ahrens wrote:> Here is a proposal for a new ''copies'' property which
would allow
> different levels of replication for different filesystems.
Thanks everyone for your input.

The problem that this feature attempts to address is when you have some 
data that is more important (and thus needs a higher level of 
redundancy) than other data.  Of course in some situations you can use 
multiple pools, but that is antithetical to ZFS''s pooled storage model.
  (You have to divide up your storage, you''ll end up with stranded 
storage and bandwidth, etc.)

Given the overwhelming criticism of this feature, I''m going to shelve
it
for now.

Out of curiosity, what would you guys think about addressing this same 
problem by having the option to store some filesystems unreplicated on 
an mirrored (or raid-z) pool?  This would have the same issues of 
unexpected space usage, but since it would be *less* than expected, that 
might be more acceptable.  There are no plans to implement anything like 
this right now, but I just wanted to get a read on it.

--matt

Celso

2006-Sep-12 23:08 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

> Matthew Ahrens wrote:
> > Here is a proposal for a new ''copies'' property
> which would allow 
> > different levels of replication for different
> filesystems.
> 
> Thanks everyone for your input.
> 
> The problem that this feature attempts to address is
> when you have some 
> data that is more important (and thus needs a higher
> level of 
> redundancy) than other data.  Of course in some
> situations you can use 
> multiple pools, but that is antithetical to ZFS''s
> pooled storage model. 
> (You have to divide up your storage, you''ll end up
>  with stranded 
> torage and bandwidth, etc.)
> 
> Given the overwhelming criticism of this feature, I''m
> going to shelve it 
> for now.

Damn! That''s a real shame! I was really starting to look forward to
that. Please reconsider??!

> --matt
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
> 
Celso
 
 
This message posted from opensolaris.org

David Dyer-Bennet

2006-Sep-12 23:09 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/12/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Matthew Ahrens wrote:
> > Here is a proposal for a new ''copies'' property which
would allow
> > different levels of replication for different filesystems.
>
> Thanks everyone for your input.
>
> The problem that this feature attempts to address is when you have some
> data that is more important (and thus needs a higher level of
> redundancy) than other data.  Of course in some situations you can use
> multiple pools, but that is antithetical to ZFS''s pooled storage
model.
>   (You have to divide up your storage, you''ll end up with stranded
> storage and bandwidth, etc.)
>
> Given the overwhelming criticism of this feature, I''m going to
shelve it
> for now.
I think it''s a valid problem.  My understanding was that this
didn''t
give a *guaranteed* solution, though.  I think most people, when
committing to the point of replication (spending actual money), need a
guarantee at some level (not of course of total safety; but that the
data actually does exist on separate disks, and will survive the
destruction of one disk).  A good solution to this problem would be
valuable.  (And I''d accept a non-guarantee on a single disk; or rather
a guarantee that said "if enough blocks to find the data exist, and a
copy of each data block exists, we can retrieve the data"; but that
guarantee *does* exist I think).
> Out of curiosity, what would you guys think about addressing this same
> problem by having the option to store some filesystems unreplicated on
> an mirrored (or raid-z) pool?  This would have the same issues of
> unexpected space usage, but since it would be *less* than expected, that
> might be more acceptable.  There are no plans to implement anything like
> this right now, but I just wanted to get a read on it.
I was never concerned at the free space issues (though I was concerned
by some of the proposed solutions to what I saw as a non-issue).  I''d
be happy if the free space described how many bytes of default files
you could add to the pool, and the user would have to understand that
results would differ if they used non-default parameters.  You''re
probably right that fewer people would mind having *more* space than
an unthinking reading would show than less.
-- 
David Dyer-Bennet, <mailto:dd-b at dd-b.net>,
<http://www.dd-b.net/dd-b/>
RKBA: <http://www.dd-b.net/carry/>
Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

Dick Davies

2006-Sep-12 23:13 UTC

head link

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

On 12/09/06, Celso <celsouk at gmail.com> wrote:
> I think it has already been said that in many peoples experience, when a
disk fails, it completely fails. Especially on laptops. Of course ditto blocks
wouldn''t help you in this situation either!
Exactly.
> I still think that silent data corruption is a valid concern, one that
ditto blocks would solve. > Also, I am not thrilled about losing that much
space for duplication of unneccessary data (caused by partitioning a disk in
two).
Well, you''d only be duplicating the data on the mirror. If you
don''t want to
mirror the base OS, no one''s saying you have to.

For the sake of argument, let''s assume:

1. disk is expensive
2. someone is keeping valuable files on a non-redundant zpool
3. they can''t scrape enough vdevs to make a redundant zpool
    (remembering you can build vdevs out of *flat files*)

Even then, to my mind:

to the user, the *file* (screenplay, movie of childs birth, civ3 saved
game, etc.)
is the logical entity to have a ''duplication level'' attached
to it,
and the only person who can score that is the author of the file.

This proposal says the filesystem creator/admin scores the filesystem.
Your argument against unneccessary data duplication applies to all
''non-special''
files in the ''special'' filesystem. They''re wasting
space too.

If the user wants to make sure the file is ''safer'' than
others, he can
just make
multiple copies. Either to a USB disk/flashdrive, cdrw, dvd, ftp
server, whatever.

The redundancy you''re talking about is what you''d get
from ''cp /foo/bar.jpg /foo/bar.jpg.ok'', except it''s
hidden from the
user and causing
headaches for anyone trying to comprehend, port or extend the codebase in
the future.
> I also echo Darren''s comments on zfs performing better when it has
the whole disk.
Me too, but a lot of laptop users dual-boot, which makes it a moot point.
> Hopefully we can agree that you lose nothing by adding this feature,
> even if you personally don''t see a need for it.
Sorry, I don''t think we''re going to agree on this one :)

I''ve seen dozens of project proposals in the few months I''ve
been lurking
around opensolaris. Most of them have been of no use to me, but
each to their own.

I''m afraid I honestly think this greatly complicates the conceptual
model
(not to mention the technical implementation) of ZFS, and I haven''t
seen
a convincing use case.

All the best
Dick.

-- 
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Neil A. Wilson

2006-Sep-12 23:23 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Matthew Ahrens wrote:> Matthew Ahrens wrote:
>> Here is a proposal for a new ''copies'' property which
would allow
>> different levels of replication for different filesystems.
> 
> Thanks everyone for your input.
> 
> The problem that this feature attempts to address is when you have some 
> data that is more important (and thus needs a higher level of 
> redundancy) than other data.  Of course in some situations you can use 
> multiple pools, but that is antithetical to ZFS''s pooled storage
model.
>  (You have to divide up your storage, you''ll end up with stranded 
> storage and bandwidth, etc.)
> 
> Given the overwhelming criticism of this feature, I''m going to
shelve it
> for now.
This is unfortunate.  As a laptop user with only a single drive, I was 
looking forward to it since I''ve been bitten in the past by data loss 
caused by a bad area on the disk.  I don''t care about the space 
consumption because I generally don''t come anywhere close to filling up
the available space.  It may not be the primary market for ZFS, but it 
could be a very useful side benefit.


> Out of curiosity, what would you guys think about addressing this same 
> problem by having the option to store some filesystems unreplicated on 
> an mirrored (or raid-z) pool?  This would have the same issues of 
> unexpected space usage, but since it would be *less* than expected, that 
> might be more acceptable.  There are no plans to implement anything like 
> this right now, but I just wanted to get a read on it.
I don''t see much need for this in any area that I would use ZFS (either
my own personal use or for any case in which I would recommend it for 
production use).

However, if you think that it''s OK to under-report free space, then why
not just do that for the "data ditto" blocks.  If one or more of my 
filesystems are configured to keep two copies of the data, then simply 
report only half of the available space.  If duplication isn''t enabled 
for the entire pool but only for certain filesystems, then perhaps you 
could even take advantage of quotas for those filesystems to make a more 
accurate calculation.
> 
> --matt
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Matthew Ahrens

2006-Sep-12 23:38 UTC

head link

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

Dick Davies wrote:> For the sake of argument, let''s assume:
> 
> 1. disk is expensive
> 2. someone is keeping valuable files on a non-redundant zpool
> 3. they can''t scrape enough vdevs to make a redundant zpool
>    (remembering you can build vdevs out of *flat files*)
Given those assumptions, I think that the proposed feature is the 
perfect solution.  Simply put those files in a filesystem that has copies>1.

Also note that using files to back vdevs is not a recommended solution.
> If the user wants to make sure the file is ''safer'' than
others, he
> can just make multiple copies. Either to a USB disk/flashdrive, cdrw,
> dvd, ftp server, whatever.
It seems to me that asking the user to solve this problem by manually 
making copies of all his files puts all the burden on the 
user/administrator and is a poor solution.

For one, they have to remember to do it pretty often.  For two, when 
they do experience some data loss, they have to manually reconstruct the 
files!  They could have one file which has part of it missing from copy 
A and part of it missing from copy B.  I''d hate to have to reconstruct 
that manually from two different files, but the proposed solution would 
do this transparently.
> The redundancy you''re talking about is what you''d get
from ''cp
> /foo/bar.jpg /foo/bar.jpg.ok'', except it''s hidden from
the user and
> causing headaches for anyone trying to comprehend, port or extend the
> codebase in the future.
Whether it''s hard to understand is debatable, but this feature 
integrates very smoothly with the existing infrastructure and wouldn''t 
cause any trouble when extending or porting ZFS.
> I''m afraid I honestly think this greatly complicates the
conceptual model
> (not to mention the technical implementation) of ZFS, and I
haven''t seen
> a convincing use case.
Just for the record, these changes are pretty trivial to implement; less 
than 50 lines of code changed.

--matt

Celso

2006-Sep-12 23:39 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

> On 12/09/06, Celso <celsouk at gmail.com> wrote:
> 
> > I think it has already been said that in many
> peoples experience, when a disk fails, it completely
> fails. Especially on laptops. Of course ditto blocks
> wouldn''t help you in this situation either!
> 
> Exactly.
> 
> > I still think that silent data corruption is a
> valid concern, one that ditto blocks would solve. >
> Also, I am not thrilled about losing that much space
> for duplication of unneccessary data (caused by
> partitioning a disk in two).
> 
> Well, you''d only be duplicating the data on the
> mirror. If you don''t want to
> mirror the base OS, no one''s saying you have to.
> 
Yikes! that sounds like even more partitioning!
> For the sake of argument, let''s assume:
> 
> 1. disk is expensive
> 2. someone is keeping valuable files on a
> non-redundant zpool
> 3. they can''t scrape enough vdevs to make a redundant
> zpool
> (remembering you can build vdevs out of *flat
>  files*)
> Even then, to my mind:
> 
> to the user, the *file* (screenplay, movie of childs
> birth, civ3 saved
> game, etc.)
> is the logical entity to have a ''duplication level''
> attached to it,
> and the only person who can score that is the author
> of the file.
> 
> This proposal says the filesystem creator/admin
> scores the filesystem.
> Your argument against unneccessary data duplication
> applies to all ''non-special''
> files in the ''special'' filesystem. They''re
wasting
> space too.
> 
> If the user wants to make sure the file is ''safer''
> than others, he can
> just make
> multiple copies. Either to a USB disk/flashdrive,
> cdrw, dvd, ftp
> server, whatever.
> 
> The redundancy you''re talking about is what you''d get
> from ''cp /foo/bar.jpg /foo/bar.jpg.ok'', except
it''s
> hidden from the
> user and causing
> headaches for anyone trying to comprehend, port or
> extend the codebase in
> the future.
the proposed solution differs in one important aspect: it automatically detects
data corruption.

> > I also echo Darren''s comments on zfs performing
> better when it has the whole disk.
> 
> Me too, but a lot of laptop users dual-boot, which
> makes it a moot point.
> 
> > Hopefully we can agree that you lose nothing by
> adding this feature,
> > even if you personally don''t see a need for it.
> 
> Sorry, I don''t think we''re going to agree on this one
> :)

No worries, that''s cool. > All the best
> Dick.
> 
> -- 
> Rasputin :: Jack of All Trades - Master of Nuns
> http://number9.hellooperator.net/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
> 
Celso
 
 
This message posted from opensolaris.org

Chad Lewis

2006-Sep-12 23:51 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

On Sep 12, 2006, at 4:39 PM, Celso wrote:
>> On 12/09/06, Celso <celsouk at gmail.com> wrote:
>>
>>> I think it has already been said that in many
>> peoples experience, when a disk fails, it completely
>> fails. Especially on laptops. Of course ditto blocks
>> wouldn''t help you in this situation either!
>>
>> Exactly.
>>
>>> I still think that silent data corruption is a
>> valid concern, one that ditto blocks would solve. >
>> Also, I am not thrilled about losing that much space
>> for duplication of unneccessary data (caused by
>> partitioning a disk in two).
>>
>> Well, you''d only be duplicating the data on the
>> mirror. If you don''t want to
>> mirror the base OS, no one''s saying you have to.
>>
>
> Yikes! that sounds like even more partitioning!
>
>>
>> The redundancy you''re talking about is what you''d get
>> from ''cp /foo/bar.jpg /foo/bar.jpg.ok'', except
it''s
>> hidden from the
>> user and causing
>> headaches for anyone trying to comprehend, port or
>> extend the codebase in
>> the future.
>
> the proposed solution differs in one important aspect: it  
> automatically detects data corruption.
>
>
Detecting data corruption is a function of the ZFS checksumming  
feature. The proposed solution
has _nothing_ to do with detecting corruption. The difference is in  
what happens when/if such
bad data is detected. Without a duplicate copy, via some RAID level  
or the proposed ditto block
copies, the file is corrupted.

eric kustarz

2006-Sep-12 23:54 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Matthew Ahrens wrote:
> Matthew Ahrens wrote:
>
>> Here is a proposal for a new ''copies'' property which
would allow
>> different levels of replication for different filesystems.
>
>
> Thanks everyone for your input.
>
> The problem that this feature attempts to address is when you have 
> some data that is more important (and thus needs a higher level of 
> redundancy) than other data.  Of course in some situations you can use 
> multiple pools, but that is antithetical to ZFS''s pooled storage 
> model.  (You have to divide up your storage, you''ll end up with 
> stranded storage and bandwidth, etc.)
>
> Given the overwhelming criticism of this feature, I''m going to
shelve
> it for now.

So it seems to me that having this feature per-file is really useful.  
Say i have a presentation to give in Pleasanton, and the presentation 
lives on my single-disk laptop - I want all the meta-data and the actual 
presentation to be replicated.  We already use ditto blocks for the 
meta-data.  Now we could have an extra copy of the actual data.  When i 
get back from the presentation i can turn off the extra copies.

Doing it for the filesystem is just one step higher (and makes it 
administratively easier as i don''t have to type the same command for 
each file thats important).

Mirroring is just like another step above that - though its possibly 
replicating stuff you just don''t care about.

Now placing extra copies of the data doesn''t guarantee that data will 
survive multiple diskf failures; but neither does having a mirrored pool 
guarantee the data will be there either (2 disk failures).  Both methods 
are about increasing your chances of having your valuable data around.

I for one would have loved to have multiple copy filesystems + ZFS on my 
powerbook when i was travelling in Australia for a month - think of all 
the digital pictures you take and how pissed you would be if the one 
with the wild wombat didn''t survive.

Its maybe not an enterprise solution, but it seems like a consumer solution.

Ensuring that the space accounting tools make sense is definitely a 
valid point though.

eric
>
> Out of curiosity, what would you guys think about addressing this same 
> problem by having the option to store some filesystems unreplicated on 
> an mirrored (or raid-z) pool?  This would have the same issues of 
> unexpected space usage, but since it would be *less* than expected, 
> that might be more acceptable.  There are no plans to implement 
> anything like this right now, but I just wanted to get a read on it.
>
> --matt
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Celso

2006-Sep-13 00:01 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

> 
> It seems to me that asking the user to solve this
> problem by manually 
> making copies of all his files puts all the burden on
> the 
> user/administrator and is a poor solution.
I completely agree

?> For one, they have to remember to do it pretty often.
>  For two, when 
> hey do experience some data loss, they have to
> manually reconstruct the 
> files!  They could have one file which has part of it
> missing from copy 
> A and part of it missing from copy B.  I''d hate to
> have to reconstruct 
> that manually from two different files, but the
> proposed solution would 
> do this transparently.

Again, I agree.
> > The redundancy you''re talking about is what you''d
> get from ''cp
> > /foo/bar.jpg /foo/bar.jpg.ok'', except it''s hidden
> from the user and
> > causing headaches for anyone trying to comprehend,
> port or extend the
> > codebase in the future.
> 
> Whether it''s hard to understand is debatable, but
> this feature 
> integrates very smoothly with the existing
> infrastructure and wouldn''t 
> cause any trouble when extending or porting ZFS.
> 
OK, given this statement...
> 
> Just for the record, these changes are pretty trivial
> to implement; less 
> than 50 lines of code changed.
and this statement, I can''t see any reasons not to include it. If the
changes are easy to do, don''t require anymore of the zfs
team''s valuable time, and don''t hinder other things, I would
plead with you to include them, as I think they are genuinely valuable and would
make zfs not only the best enterprise level filesystem, but also the best
filesystem for laptops/home computers.
?> --matt
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
> 
celso
 
 
This message posted from opensolaris.org

Jeff Victor

2006-Sep-13 00:47 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

Chad Lewis wrote:> 
> On Sep 12, 2006, at 4:39 PM, Celso wrote:
> 
>> the proposed solution differs in one important aspect: it automatically
>> detects data corruption.
> 
> Detecting data corruption is a function of the ZFS checksumming feature.
The
> proposed solution has _nothing_ to do with detecting corruption. The
difference
> is in what happens when/if such bad data is detected. Without a duplicate
copy,
> via some RAID level  or the proposed ditto block copies, the file is
corrupted.

With a mirrored ZFS pool, what are the odds of losing all copies of the 
[meta]data, for N disks (where N = 1, 2, etc)?   I thought we understood this 
pretty well, and that the answer was extremely small.

--------------------------------------------------------------------------
Jeff VICTOR              Sun Microsystems            jeff.victor @ sun.com
OS Ambassador            Sr. Technical Specialist
Solaris 10 Zones FAQ:    http://www.opensolaris.org/os/community/zones/faq
--------------------------------------------------------------------------

Torrey McMahon

2006-Sep-13 03:32 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

eric kustarz wrote:> Matthew Ahrens wrote:
>
>> Matthew Ahrens wrote:
>>
>>> Here is a proposal for a new ''copies'' property
which would allow
>>> different levels of replication for different filesystems.
>>
>>
>> Thanks everyone for your input.
>>
>> The problem that this feature attempts to address is when you have 
>> some data that is more important (and thus needs a higher level of 
>> redundancy) than other data.  Of course in some situations you can 
>> use multiple pools, but that is antithetical to ZFS''s pooled
storage
>> model.  (You have to divide up your storage, you''ll end up
with
>> stranded storage and bandwidth, etc.)
>>
>> Given the overwhelming criticism of this feature, I''m going to
shelve
>> it for now.
>
>
> So it seems to me that having this feature per-file is really useful.  
> Say i have a presentation to give in Pleasanton, and the presentation 
> lives on my single-disk laptop - I want all the meta-data and the 
> actual presentation to be replicated.  We already use ditto blocks for 
> the meta-data.  Now we could have an extra copy of the actual data.  
> When i get back from the presentation i can turn off the extra copies. 
Under what failure nodes would your data still be accessible? What 
things can go wrong that still allow you to access the data because some 
event has removed one copy but left the others?

David Dyer-Bennet

2006-Sep-13 03:38 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/12/06, eric kustarz <eric.kustarz at sun.com> wrote:
> So it seems to me that having this feature per-file is really useful.
> Say i have a presentation to give in Pleasanton, and the presentation
> lives on my single-disk laptop - I want all the meta-data and the actual
> presentation to be replicated.  We already use ditto blocks for the
> meta-data.  Now we could have an extra copy of the actual data.  When i
> get back from the presentation i can turn off the extra copies.
Yes, you could do that.

*I* would make a copy on a CD, which I would carry in a separate case
from the laptop.

I think my presentation is a lot safer than your presentation.

Similarly for your digital images example; I don''t consider it safe
until I have two or more *independent* copies.  Two copies on a single
hard drive doesn''t come even close to passing the test for me; as many
people have pointed out, those tend to fail all at once.  And I will
also point out that laptops get stolen a lot.  And of course all the
accidents involving fumble-fingers, OS bugs, and driver bugs won''t be
helped by the data duplication either.  (Those will mostly be helped
by sensible use of snapshots, though, which is another argument for
ZFS on *any* disk you work on a lot.)

The more I look at it the more I think that a second copy on the same
disk doesn''t protect against very much real-world risk.  Am I wrong
here?  Are partial(small) disk corruptions more common than I think?
I don''t have a good statistical view of disk failures.
-- 
David Dyer-Bennet, <mailto:dd-b at dd-b.net>,
<http://www.dd-b.net/dd-b/>
RKBA: <http://www.dd-b.net/carry/>
Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

David Dyer-Bennet

2006-Sep-13 03:43 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

On 9/12/06, Celso <celsouk at gmail.com> wrote:
> > Whether it''s hard to understand is debatable, but
> > this feature
> > integrates very smoothly with the existing
> > infrastructure and wouldn''t
> > cause any trouble when extending or porting ZFS.
> >
>
> OK, given this statement...
>
> >
> > Just for the record, these changes are pretty trivial
> > to implement; less
> > than 50 lines of code changed.
>
> and this statement, I can''t see any reasons not to include it. If
the changes are easy to do, don''t require anymore of the zfs
team''s valuable time, and don''t hinder other things, I would
plead with you to include them, as I think they are genuinely valuable and would
make zfs not only the best enterprise level filesystem, but also the best
filesystem for laptops/home computers.
While I''m not a big fan of this feature, if the work is that well
understood and that small, I have no objection to it.  (Boy that
sounds snotty; apologies, not what I intend here.  Those of you
reading this know how muich you care about my opinion, that''s up to
you.)

I do pity the people who count on the ZFS redundancy to protect their
presentation on an important sales trip -- and then have their laptop
stolen.  But those people might well be the same ones who would have
*no* redundancy otherwise.  And nothing about this feature prevents
the paranoids like me from still making our backup CD and carrying it
separately.

I''m not prepared to go so far as to argue that it''s bad to
make them
feel safer :-).  At least, to make them feel safer *by making them
actually safer*.
-- 
David Dyer-Bennet, <mailto:dd-b at dd-b.net>,
<http://www.dd-b.net/dd-b/>
RKBA: <http://www.dd-b.net/carry/>
Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

Torrey McMahon

2006-Sep-13 03:53 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

David Dyer-Bennet wrote:>
> The more I look at it the more I think that a second copy on the same
> disk doesn''t protect against very much real-world risk.  Am I
wrong
> here?  Are partial(small) disk corruptions more common than I think?
> I don''t have a good statistical view of disk failures.
I don''t have hard data at hand but you see entire drives go bad much 
more often then a single section....and when you do it is usually a 
notice of a block re-allocation from the disk drive firmware. Often 
you''ll see a bunch of those, sometimes over the course of a month and 
sometimes over the course of a minute, and then the entire drive goes. 
In some cases a raid array will watch for those messages and 
automagically swap the drive with a hot spare after X amount of 
notifications. Al Hopper recently posted some more detailed examples of 
how this can happen.

However, lets move to a different example and say you''ve got six drives
in a raidZ pool. What failure modes - This time I used the ''m''
instead
of the ''n'' - allow your data to survive that can''t
already be taken care
of with underlying raid configurations within the pool?

Torrey McMahon

2006-Sep-13 03:55 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

David Dyer-Bennet wrote:>
> While I''m not a big fan of this feature, if the work is that well
> understood and that small, I have no objection to it.  (Boy that
> sounds snotty; apologies, not what I intend here.  Those of you
> reading this know how muich you care about my opinion, that''s up
to
> you.)
One could make the argument that the feature could cause enough 
confusion to not warrant its inclusion. If I''m a typical user and I 
write a file to the filesystem where the admin set three copies but 
didn''t tell me it might throw me into a tizzy trying to figure out why 
my quota is 3X where I expect it to be.

Torrey McMahon

2006-Sep-13 04:40 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Matthew Ahrens wrote:> Matthew Ahrens wrote:
>> Here is a proposal for a new ''copies'' property which
would allow
>> different levels of replication for different filesystems.
>
> Thanks everyone for your input.
>
> The problem that this feature attempts to address is when you have 
> some data that is more important (and thus needs a higher level of 
> redundancy) than other data.  Of course in some situations you can use 
> multiple pools, but that is antithetical to ZFS''s pooled storage 
> model.  (You have to divide up your storage, you''ll end up with 
> stranded storage and bandwidth, etc.) 
Can you expand? I can think of some examples where using multiple pools 
- even on the same host - is quite useful given the current feature set 
of the product. Or are you only discussing the specific case where a 
host would want more reliability for a certain set of data then an 
other? If that''s the case I''m still confused as to what
failure cases
would still allow you to retrieve your data if there are more then one 
copy in the fs or pool.....but I''ll gladly take some enlightenment. :)

Celso

2006-Sep-13 04:58 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user

a couple of points
> One could make the argument that the feature could
> cause enough 
> confusion to not warrant its inclusion. If I''m a
> typical user and I 
> write a file to the filesystem where the admin set
> three copies but 
> didn''t tell me it might throw me into a tizzy trying
> to figure out why 
> my quota is 3X where I expect it to be.
> 
I don''t think anybody is saying it is going to be the default setup. If
someone is not comfortable with a feature, surely? they can choose to ignore it.
An admin can use actual mirroring, raidz etc, and carry on as before.?

There are many potentially confusing features of almost any computer system.
Computers are complex things.

I admin a couple of schools with a total of about 2000 kids. I really doubt that
any of them would have a problem understanding it.

More importantly, is an institution utilizing quotas really the main market for
this feature. It seems to me that it is clearly aimed at people in control of
their own machines (even though I can see uses for this in pretty much any
environment). I doubt anyone capable of installing and running Solaris on their
laptop would be confused by this issue.

I don''t think anyone is saying that ditto blocks are a complete,
never-lose-data solution. Sure the whole disk can (and probably will) die on
you. If you partitioned the disk, mirrored it, and it died, you would still be
in trouble.

If the disk doesn''t die, but for whatever reason, you get silent data
corruption, the checksums pick up the problem, and the ditto blocks allow
recovery.

Given a situation, where you:?

a) have a laptop or home computer which you have important data on.
b) for whatever reason, you can''t add another disk to utilize mirroring
(and you are between ?? backups)

this seems to me to be a very valid solution. Especially as has already been
said, it takes very little to implement, and doesn''t hinder anything
else within zfs.

I think that people can benefit from this.> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
> 
 
This message posted from opensolaris.org

Nicolas Williams

2006-Sep-13 05:21 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On Tue, Sep 12, 2006 at 03:56:00PM -0700, Matthew Ahrens
wrote:> The problem that this feature attempts to address is when you have some 
> data that is more important (and thus needs a higher level of 
> redundancy) than other data.  Of course in some situations you can use 
> multiple pools, but that is antithetical to ZFS''s pooled storage
model.
>  (You have to divide up your storage, you''ll end up with stranded 
> storage and bandwidth, etc.)
For me this feature is something I would use on a laptop.  I''d set
copies = 2.

The idea is: bad blocks happen, but I don''t want to lose data because
of
random bad blocks.

Dead disks happen also.  This feature won''t help with that.

to deal with bad disks I''d like laptops to have two disks.
Alternatively, I would (but don''t) plug-in a USB disk from time to time
as a mirror vdev.
> Given the overwhelming criticism of this feature, I''m going to
shelve it
> for now.
I don''t see why.  The used/free space UI issues need to be worked out.
And you need to give guidance for when to use this (e.g., "ditto
blocking is primarily intended for use on laptops" -- if you have
mirroring/raid-5/raid-Z then you probably wouldn''t care for ditto
blocking).

Beyond that you do need to introduce a generic filesystem-level
"scrub"
or re-write option for dealing with changes to fs properties that would
otherwise only apply to data/meta-data created/changed after the
property change.  But this was needed before this case, so I don''t see
why this case should have to add that feature.  Plus, I can imagine that
such scrubbing could be difficult to implement, since it must appear to
leave everything untouched, including snapshots and clones, yet actually
have COWed everything that existed prior to the scrub.

Nico
--

Torrey McMahon

2006-Sep-13 05:22 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user

Celso wrote:> a couple of points
>
>   
>> One could make the argument that the feature could
>> cause enough 
>> confusion to not warrant its inclusion. If I''m a
>> typical user and I 
>> write a file to the filesystem where the admin set
>> three copies but 
>> didn''t tell me it might throw me into a tizzy trying
>> to figure out why 
>> my quota is 3X where I expect it to be.
>>
>>     
>
> I don''t think anybody is saying it is going to be the default
setup. If someone is not comfortable with a feature, surely  they can choose to
ignore it. An admin can use actual mirroring, raidz etc, and carry on as before.
>
> There are many potentially confusing features of almost any computer
system. Computers are complex things.
>
> I admin a couple of schools with a total of about 2000 kids. I really doubt
that any of them would have a problem understanding it.
>
> More importantly, is an institution utilizing quotas really the main market
for this feature. It seems to me that it is clearly aimed at people in control
of their own machines (even though I can see uses for this in pretty much any
environment). I doubt anyone capable of installing and running Solaris on their
laptop would be confused by this issue.
>   

Its not the smart people I would be worried about. It''s the ones where 
you would get into endless loops of conversation around "But I only 
wrote 1MB how come it says 2MB?" that worry me. Especially, when it 
impacts a lot of user level tools and could be a surprise if set by a 
BOFH type.

That said I was worried about that type of effect when the change itself 
seemed to have low value. However, you and Richard have pointed to at 
least one example where this would be useful at the file level....
>
> Given a situation, where you: 
>
> a) have a laptop or home computer which you have important data on.
> b) for whatever reason, you can''t add another disk to utilize
mirroring (and you are between    backups)
>
> this seems to me to be a very valid solution.
... and though I see that as a valid solution to the issue does it 
really cover enough ground to warrant inclusion of this feature given 
some of the other issues that have been brought up?

In the above case I think people would me more concerned with the entire 
system going down, a drive crashing, etc. then the possibility of a 
checksum error or data corruption requiring the lookup on a ditto block 
if one exists. In that case they would create a copy on an independent 
system, like a USB disk, some sort of archiving media, like a CD-R, or 
even place a copy on a remote system, to maintain the data in case of a 
failure. Hell, I''ve been known to do all three to meet my own paranoia 
level.

IMHO, It''s more ammo to include the feature but I''m not sure
its enough.
Perhaps Richard''s late breaking data concerning drive failures will add
some more weight?

Dick Davies

2006-Sep-13 06:20 UTC

head link

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

On 13/09/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Dick Davies wrote:
> > For the sake of argument, let''s assume:
> >
> > 1. disk is expensive
> > 2. someone is keeping valuable files on a non-redundant zpool
> > 3. they can''t scrape enough vdevs to make a redundant zpool
> >    (remembering you can build vdevs out of *flat files*)
>
> Given those assumptions, I think that the proposed feature is the
> perfect solution.  Simply put those files in a filesystem that has
copies>1.
I don''t think we disagree that multiple copies in ZFS are a good idea,
I just think the zpool is the right place to do that.

To clarify, I was addressing Celsos laptop scenario here - especially the
idea that you can make a single disk redundant without any risks.

(for bigger systems I''d just mirror at the zpool and have done).
> Also note that using files to back vdevs is not a recommended solution.
Understood. But neither is mirroring on a single disk (which is what is
effectively being suggested for laptop users using this solution).
> > If the user wants to make sure the file is ''safer''
than others, he
> > can just make multiple copies. Either to a USB disk/flashdrive, cdrw,
> > dvd, ftp server, whatever.
>
> It seems to me that asking the user to solve this problem by manually
> making copies of all his files puts all the burden on the
> user/administrator and is a poor solution.
You''ll be being backing up your laptop anyway, aren''t you?
> For one, they have to remember to do it pretty often.  For two, when
> they do experience some data loss, they have to manually reconstruct the
> files!  They could have one file which has part of it missing from copy
> A and part of it missing from copy B.  I''d hate to have to
reconstruct
> that manually from two different files, but the proposed solution would
> do this transparently.
Are you likely to lose parts of both file at the same time, though?
I''d say you''re more likely to have one crap file and one good
one.
And you know which file is crap due to checksumming already.
> > I''m afraid I honestly think this greatly complicates the
conceptual model
> > (not to mention the technical implementation) of ZFS, and I
haven''t seen
> > a convincing use case.
>
> Just for the record, these changes are pretty trivial to implement; less
> than 50 lines of code changed.
But they raise a lot of administrative issues (how many copies do I really
have? Where are they? Have they all been deleted? If I set this property,
how many copies do I have now? How much disk will I get back if I delete
fileX? How much disk do I bill zone admin foo for this month? How much disk
io are ops on this filesystem likely to cause? How do I dtrace this?)


I appreciate the effort and thought that''s gone into it, not to mention
a request for feedback. If I''ve not made that clear, I apologize.
I''m just worried that it muddies the waters for everybody.

The users (me too!) want mirror-level reliability on their laptops.
I don''t think this is the right way to get that feature,
that''s all.

-- 
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Matthew Ahrens

2006-Sep-13 06:22 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Torrey McMahon wrote:> Matthew Ahrens wrote:
>> The problem that this feature attempts to address is when you have 
>> some data that is more important (and thus needs a higher level of 
>> redundancy) than other data.  Of course in some situations you can use 
>> multiple pools, but that is antithetical to ZFS''s pooled
storage
>> model.  (You have to divide up your storage, you''ll end up
with
>> stranded storage and bandwidth, etc.) 
> 
> Can you expand? I can think of some examples where using multiple pools 
> - even on the same host - is quite useful given the current feature set 
> of the product.  Or are you only discussing the specific case where a
> host would want more reliability for a certain set of data then an 
> other? If that''s the case I''m still confused as to what
failure cases
> would still allow you to retrieve your data if there are more then one 
> copy in the fs or pool.....but I''ll gladly take some
enlightenment. :)
(My apologies for the length of this response, I''ll try to address most
of the issues brought up recently...)

When I wrote this proposal, I was only seriously thinking about the case 
where you want different amounts of redundancy for different data. 
Perhaps because I failed to make this clear, discussion has concentrated 
on laptop reliability issues.  It is true that there would be some 
benefit to using multiple copies on a single-disk (eg. laptop) pool, but 
of course it would not protect against the most common failure mode 
(whole disk failure).

One case where this feature would be useful is if you have a pool with 
no redundancy (ie. no mirroring or raid-z), because most of the data in 
the pool is not very important.  However, the pool may have a bunch of 
disks in it (say, four).  The administrator/user may realize (perhaps 
later on) that some of their data really *is* important and they would 
like some protection against losing it if a disk fails.  They may not 
have the option of adding more disks to mirror all of their data (cost 
or physical space constraints may apply here).  Their problem is solved 
by creating a new filesystem with copies=2 and putting the important 
data there.  Now, if a disk fails, then the data in the copies=2 
filesystem will not be lost.  Approximately 1/4 of the data in other 
filesystems will be lost.  (There is a small chance that some tiny 
fraction of the data in the copies=2 filesystem will still be lost if we 
were forced to put both copies on the disk that failed.)

Another plausible use case would be where you have some level of 
redundancy, say you have a Thumper (X4500) with its 48 disks configured 
into 9 5-wide single-parity raid-z groups (with 3 spares).  If a single 
disk fails, there will be no data loss.  However, if two disks within 
the same raid-z group fail, data will be lost.  In this scenario, 
imagine that this data loss probability is acceptable for most of the 
data stored here, but there is some extremely important data for which 
this is unacceptable.  Rather than reconfiguring the entire pool for 
higher redundancy (say, double-parity raid-z) and less usable storage, 
you can simply create a filesystem with copies=2 within the raid-z 
storage pool.  Data within that filesystem will not be lost even if any 
three disks fail.

I believe that these use cases, while not being extremely common, do 
occur.  The extremely low amount of engineering effort required to 
implement the feature (modulo the space accounting issues) seems 
justified.  The fact that this feature does not solve all problems (eg, 
it is not intended to be a replacement for mirroring) is not a downside; 
not all features need to be used in all situations :-)

The real problem with this proposal is the confusion surrounding disk 
space accounting with copies>1.  While the same issues are present when 
using compression, people are understandably less upset when files take 
up less space than expected.  Given the current lack of interest in this 
feature, the effort required to address the space accounting issue does 
not seem justified at this time.

--matt

Richard Elling

2006-Sep-13 06:30 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

[dang, this thread started on the one week this quarter that I don''t
have
any spare time... please accept this one comment, more later...]

Mike Gerdts wrote:> On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:
>> B. DESCRIPTION
>>
>> A new property will be added, ''copies'', which
specifies how many copies
>> of the given filesystem will be stored.  Its value must be 1, 2, or 3.
>> Like other properties (eg.  checksum, compression), it only affects
>> newly-written data.  As such, it is recommended that the
''copies''
>> property be set at filesystem-creation time
>> (eg. ''zfs create -o copies=2 pool/fs'').
>
> Is there anything in the works to compress (or encrypt) existing data
> after the fact?  For example, a special option to scrub that causes
> the data to be re-written with the new properties could potentially do
> this.  If so, this feature should subscribe to any generic framework
> provided by such an effort.
>
>> This feature is similar to using mirroring, but differs in several
>> important ways:
>>
>> * Mirroring offers slightly better redundancy, because one disk from
>>    each mirror can fail without data loss.
>
> Is this use of slightly based upon disk failure modes?  That is, when
> disks fail do they tend to get isolated areas of badness compared to
> complete loss?  I would suggest that complete loss should include
> someone tripping over the power cord to the external array that houses
> the disk.
The field data I have says that complete disk failures are the exception.
I hate to leave this as a teaser, I''ll expand my comments later.

BTW, this feature will be very welcome on my laptop!  I can''t wait :-)
 -- richard

eric kustarz

2006-Sep-13 06:32 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

David Dyer-Bennet wrote:
> On 9/12/06, eric kustarz <eric.kustarz at sun.com> wrote:
>
>> So it seems to me that having this feature per-file is really useful.
>> Say i have a presentation to give in Pleasanton, and the presentation
>> lives on my single-disk laptop - I want all the meta-data and the
actual
>> presentation to be replicated.  We already use ditto blocks for the
>> meta-data.  Now we could have an extra copy of the actual data.  When i
>> get back from the presentation i can turn off the extra copies.
>
>
> Yes, you could do that.
>
> *I* would make a copy on a CD, which I would carry in a separate case
> from the laptop.

Do you backup the presentation to CD everytime you make an edit?
>
> I think my presentation is a lot safer than your presentation.

I''m sure both of our presentations would be equally safe as we would 
know not to have the only copy(ies) on our personage.
>
> Similarly for your digital images example; I don''t consider it
safe
> until I have two or more *independent* copies.  Two copies on a single
> hard drive doesn''t come even close to passing the test for me; as
many
> people have pointed out, those tend to fail all at once.  And I will
> also point out that laptops get stolen a lot.  And of course all the
> accidents involving fumble-fingers, OS bugs, and driver bugs won''t
be
> helped by the data duplication either.  (Those will mostly be helped
> by sensible use of snapshots, though, which is another argument for
> ZFS on *any* disk you work on a lot.)

Well of course you would have a separate, independent copy if it really 
mattered.
>
> The more I look at it the more I think that a second copy on the same
> disk doesn''t protect against very much real-world risk.  Am I
wrong
> here?  Are partial(small) disk corruptions more common than I think?
> I don''t have a good statistical view of disk failures.

Well let''s see - my friend accompanied me on a trip and saved her
photos
daily onto her laptop.  Near the end of the trip her hard drive started 
having problems.  The hard drive was not dead, as it was bootable and 
you could access certain data.  Upon returning home she was able to 
retrieve some of her photos but not all.  She would have been much 
happier having ZFS + "copies".

And yes, you could backup to CD/DVD every night, but its a pain and 
people don''t do it (as much as they should).

Side note: it would have cost hundreds of dollars for data recovery to 
have just the *possibility* to get the other photos.

eric

eric kustarz

2006-Sep-13 06:37 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Torrey McMahon wrote:
> eric kustarz wrote:
>
>> Matthew Ahrens wrote:
>>
>>> Matthew Ahrens wrote:
>>>
>>>> Here is a proposal for a new ''copies''
property which would allow
>>>> different levels of replication for different filesystems.
>>>
>>>
>>>
>>> Thanks everyone for your input.
>>>
>>> The problem that this feature attempts to address is when you have 
>>> some data that is more important (and thus needs a higher level of 
>>> redundancy) than other data.  Of course in some situations you can 
>>> use multiple pools, but that is antithetical to ZFS''s
pooled storage
>>> model.  (You have to divide up your storage, you''ll end up
with
>>> stranded storage and bandwidth, etc.)
>>>
>>> Given the overwhelming criticism of this feature, I''m
going to
>>> shelve it for now.
>>
>>
>>
>> So it seems to me that having this feature per-file is really 
>> useful.  Say i have a presentation to give in Pleasanton, and the 
>> presentation lives on my single-disk laptop - I want all the 
>> meta-data and the actual presentation to be replicated.  We already 
>> use ditto blocks for the meta-data.  Now we could have an extra copy 
>> of the actual data.  When i get back from the presentation i can turn 
>> off the extra copies. 
>
>
> Under what failure nodes would your data still be accessible? What 
> things can go wrong that still allow you to access the data because 
> some event has removed one copy but left the others?
>Silent data corruption of one of the copies.

Matthew Ahrens

2006-Sep-13 06:40 UTC

head link

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

Dick Davies wrote:> On 13/09/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:
>> Dick Davies wrote:
>> > For the sake of argument, let''s assume:
>> >
>> > 1. disk is expensive
>> > 2. someone is keeping valuable files on a non-redundant zpool
>> > 3. they can''t scrape enough vdevs to make a redundant
zpool
>> >    (remembering you can build vdevs out of *flat files*)
>>
>> Given those assumptions, I think that the proposed feature is the
>> perfect solution.  Simply put those files in a filesystem that has 
>> copies>1.
> 
> I don''t think we disagree that multiple copies in ZFS are a good
idea,
> I just think the zpool is the right place to do that.
Sure, if you want *everything* in your pool to be mirrored, there is no 
real need for this feature (you could argue that setting up the pool 
would be easier if you didn''t have to slice up the disk though).
>> Also note that using files to back vdevs is not a recommended solution.
> 
> Understood. But neither is mirroring on a single disk (which is what is
> effectively being suggested for laptop users using this solution).
It could be recommended in some situations.  If you want to protect 
against disk firmware errors, bit flips, part of the disk getting 
scrogged, then mirroring on a single disk (whether via a mirror vdev or 
copies=2) solves your problem.  Admittedly, these problems are probably 
less common that whole-disk failure, which mirroring on a single disk 
does not address.
> But they raise a lot of administrative issues 
Sure, especially if you choose to change the copies property on an 
existing filesystem.  However, if you only set it at filesystem creation 
time (which is the recommended way), then it''s pretty easy to address 
your issues:

(how many copies do I really have?

whatever you set the ''copies'' property to.

Where are they?

If you have multiple disks, they are almost certainly on different 
disks.  Some tiny fraction of the blocks may have both their copies on 
the same disk if enough disks were nearly full.

Have they all been deleted?

No.
> If I set this property, how many copies do I have now?
Whatever you set the ''copies'' property to.
> How much disk will I get back if I delete fileX?
The space you get back is always the space used by the file, as 
specified by st_blocks, ls -s, or du.  Note that this applies even if 
you use compression, or change the ''copies'' property after
creating the
filesystem.
> How much disk do I bill zone admin foo for this month?
That would be up to your policy.  You could bill them for the space 
used, or divide by the number of copies.
> How much disk io are ops on this filesystem likely to cause?
copies * mirror_width * raidz_stripe / (raidz_stripe-1).  (this applies 
even if you change the ''copies'' property after creating the
filesystem).
> How do I dtrace this?)
The same way you''d dtrace ditto blocks today.  Do you have a specific 
event in mind that you''d like to trace?
> The users (me too!) want mirror-level reliability on their laptops.
> I don''t think this is the right way to get that feature,
that''s all.
I agree; there is no magic.  If you want to survive a drive failure in 
your laptop, you need two drives in there.

--matt

Dick Davies

2006-Sep-13 07:05 UTC

head link

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

On 13/09/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Dick Davies wrote:
> > But they raise a lot of administrative issues
>
> Sure, especially if you choose to change the copies property on an
> existing filesystem.  However, if you only set it at filesystem creation
> time (which is the recommended way), then it''s pretty easy to
address
> your issues:
You''re right, that would prevent getting into some nasty messes (I see
this as closer to encryption than compression in that respect).

I still feel we''d be doing the same job in several places.
But I''m sure anyone who cares has a pretty good idea of my opinion,
so I''ll shut up now :)

Thanks for taking the time to feedback on the feedback.

-- 
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Darren J Moffat

2006-Sep-13 09:42 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

eric kustarz wrote:> So it seems to me that having this feature per-file is really useful.  
Per-file with a POSIX filesystem is often not that useful.  That is 
because many applications (since you mentioned a presentation StarOffice 
I know does this) don''t update the file in place.  Instead they write a
temporary file on the same file system in the same directory then do an 
unlink(2) and rename(2).

So that means what you really need to say is per directory, which for 
ZFS you may as well implement as per data set.

-- 
Darren J Moffat

Mike Gerdts

2006-Sep-13 11:47 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/13/06, Richard Elling <Richard.Elling at sun.com>
wrote:> >> * Mirroring offers slightly better redundancy, because one disk
from
> >>    each mirror can fail without data loss.
> >
> > Is this use of slightly based upon disk failure modes?  That is, when
> > disks fail do they tend to get isolated areas of badness compared to
> > complete loss?  I would suggest that complete loss should include
> > someone tripping over the power cord to the external array that houses
> > the disk.
>
> The field data I have says that complete disk failures are the exception.
> I hate to leave this as a teaser, I''ll expand my comments later.
>
> BTW, this feature will be very welcome on my laptop!  I can''t wait
:-)
On servers and stationary desktops, I just don''t care whether it is a
whole disk failure or a few bad blocks.  In that case I have the
resources to mirror, RAID5, perform daily backups, etc.

The laptop disk failures that I have seen have typically been limited
to a few bad blocks.  As Torey McMahon mentioned, they tend to start
out with some warning signs followed by a full failure.  I would
*really* like to have that window between warning signs and full
failure as my opportunity to back up my data and replace my
non-redundant hard drive with no data loss.

The only part of the proposal I don''t like is space accounting.
Double or triple charging for data will only confuse those apps and
users that check for free space or block usage.  If this is worked
out, it would be a great feature for those times when mirroring just
isn''t an option.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Tobias Schacht

2006-Sep-13 12:01 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/13/06, Mike Gerdts <mgerdts at gmail.com>
wrote:> The only part of the proposal I don''t like is space accounting.
> Double or triple charging for data will only confuse those apps and
> users that check for free space or block usage.
Why exactly isn''t reporting the free space divided by the
"copies"
value on that particular file system an easy solution for this? Did I
miss something?


Tobias

Al Hopper

2006-Sep-13 12:31 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On Tue, 12 Sep 2006, Matthew Ahrens wrote:
> Torrey McMahon wrote:
> > Matthew Ahrens wrote:
> >> The problem that this feature attempts to address is when you have
> >> some data that is more important (and thus needs a higher level of
> >> redundancy) than other data.  Of course in some situations you can
use
> >> multiple pools, but that is antithetical to ZFS''s pooled
storage
> >> model.  (You have to divide up your storage, you''ll end
up with
> >> stranded storage and bandwidth, etc.)
> >
> > Can you expand? I can think of some examples where using multiple
pools
> > - even on the same host - is quite useful given the current feature
set
> > of the product.  Or are you only discussing the specific case where a
> > host would want more reliability for a certain set of data then an
> > other? If that''s the case I''m still confused as to
what failure cases
> > would still allow you to retrieve your data if there are more then one
> > copy in the fs or pool.....but I''ll gladly take some
enlightenment. :)
>
> (My apologies for the length of this response, I''ll try to address
most
> of the issues brought up recently...)
>
> When I wrote this proposal, I was only seriously thinking about the case
> where you want different amounts of redundancy for different data.
> Perhaps because I failed to make this clear, discussion has concentrated
> on laptop reliability issues.  It is true that there would be some
> benefit to using multiple copies on a single-disk (eg. laptop) pool, but
> of course it would not protect against the most common failure mode
> (whole disk failure).... lots of Good Stuff elided ....

Soon Samsung will release a 100% flash memory based drive (32Gb) in a
laptop form factor.  But flash memory chips have a limited number of write
cycles available, and when exceeded, this usually results in data
corruption.  Some people have already encountered this issue with USB
thumb drives.  Its especially annoying if you were using the thumb drive
as a, what you thought was, a 100% _reliable_ backup mechanism.

This is a perfect application for ZFS copies=2.  Also, consider that there
is no time penalty for positioning the "heads" on a flash drive.  So
now
you would have 2 options in a laptop type application with a single flash
based drive:

a) create a mirrored pool using 2 slices - expensive in terms of storage
   utilization
b) create a pool with no redundancy
   create a filesystem called "importantPresentationData" within that
pool
   with copies=2 (or more).

Matthew - "build it and they will come"!

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
                OpenSolaris Governing Board (OGB) Member - Feb 2006

eric kustarz

2006-Sep-13 15:26 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Darren J Moffat wrote:
> eric kustarz wrote:
>
>> So it seems to me that having this feature per-file is really useful.  
>
>
> Per-file with a POSIX filesystem is often not that useful.  That is 
> because many applications (since you mentioned a presentation 
> StarOffice I know does this) don''t update the file in place. 
Instead
> they write a temporary file on the same file system in the same 
> directory then do an unlink(2) and rename(2).

That''s too bad, but i guess its the best StarOffice can do.
>
> So that means what you really need to say is per directory, which for 
> ZFS you may as well implement as per data set.
>I want per pool, per dataset, and per file - where all are done by the 
filesystem (ZFS), not the application.  I was talking about a further 
enhancement to "copies" than what Matt is currently proposing - per
file
"copies", but its more work (one thing being we don''t have 
administrative control over files per se).

eric

Bill Sommerfeld

2006-Sep-13 15:42 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On Wed, 2006-09-13 at 02:30, Richard Elling wrote:> The field data I have says that complete disk failures are the exception.
> I hate to leave this as a teaser, I''ll expand my comments later.
That matches my anecdotal experience with laptop drives; maybe I''m just
lucky, or maybe I''m just paying attention than most to the sounds they
start to make when they''re having a bad hair day, but so far
they''ve
always given *me* significant advance warning of impending doom,
generally by failing to read a bunch of disk sectors.

That said, I think the best use case for the copies > 1 config would be
in systems with exactly two disks -- which covers most of the 1U boxes
out there.  

One question for Matt: when ditto blocks are used with raidz1, how well
does this handle the case where you encounter one or more single-sector
read errors on other drive(s) while reconstructing a failed drive?

for a concrete example

	A0 B0 C0 D0 P0
	A1 B1 C1 D1 P1

(A0==A1, B0==B1, ...; A^B^C^D==P)

Does the current implementation of raidz + ditto blocks cope with the
case where all of "A", C0, and D1 are unavailable?

					- Bill

Torrey McMahon

2006-Sep-13 16:57 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

eric kustarz wrote:>
> I want per pool, per dataset, and per file - where all are done by the 
> filesystem (ZFS), not the application.  I was talking about a further 
> enhancement to "copies" than what Matt is currently proposing -
per
> file "copies", but its more work (one thing being we
don''t have
> administrative control over files per se).
Now if you could do that and make it something that can be set at 
install time it would get a lot more interesting. When you install 
Solaris to that single laptop drive you can select files or even 
directories that have more then one copy in case of a problem down the road.

Gregory Shaw

2006-Sep-13 17:05 UTC

head link

[zfs-discuss] Re: Re: Proposal: multiple copies of user data

On Sep 12, 2006, at 2:55 PM, Celso wrote:
>> On 12/09/06, Celso <celsouk at gmail.com> wrote:
>>
>>> One of the great things about zfs, is that it
>> protects not just against mechanical failure, but
>> against silent data corruption. Having this available
>> to laptop owners seems to me to be important to
>> making zfs even more attractive.
>>
>> I''m not arguing against that. I was just saying that
>> *if* this was useful to you
>> (and you were happy with the dubious
>> resilience/performance benefits) you can
>> already create mirrors/raidz on a single disk by
>> using partitions as
>> building blocks.
>> There''s no need to implement the proposal to gain
>> that.
>>
>>
>
>
> It''s not as granular though is it?
>
> In the situation you  describe:
>
> ...you split one disk in two. you then have effectively two  
> partitions which you can then create a new mirrored zpool with.  
> Then everything is mirrored. Correct?
>
> With ditto blocks, you can selectively add copies (seeing as how  
> filesystem are so easy to create on zfs). If you are only concerned  
> with copies of your important documents and email, why should /usr/ 
> bin be mirrored.
>
> That''s my opinion anyway. I always enjoy choice, and I really  
> believe this is a useful and flexible one.
>
> Celso
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
One item missed in the discussion is the idea that individual ZFS  
filesystems can be created in a pool that will have the duplicate  
block behavior.  The idea being that only a vast subset of your data  
may be critical.  This allows additional flexibility in a single disk  
configuration.  Rather than sacrificing 1/2 of the pool storage, I  
can say that my critical documents will reside in a pool that will  
keep two copies on disk.

I think it''s a great idea.  It may not be for everybody, but I think  
the ability to treat some of my files as critical is a excellent  
feature.

-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-2773
ITCTO Group, Sun Microsystems Inc.
500 Eldorado Blvd, UBRM02-401               greg.shaw at sun.com (work)
Broomfield, CO 80021                          shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060913/39a9720c/attachment.html>

Bart Smaalders

2006-Sep-13 17:19 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Torrey McMahon wrote:> eric kustarz wrote:
>>
>> I want per pool, per dataset, and per file - where all are done by the 
>> filesystem (ZFS), not the application.  I was talking about a further 
>> enhancement to "copies" than what Matt is currently proposing
- per
>> file "copies", but its more work (one thing being we
don''t have
>> administrative control over files per se).
> 
> Now if you could do that and make it something that can be set at 
> install time it would get a lot more interesting. When you install 
> Solaris to that single laptop drive you can select files or even 
> directories that have more then one copy in case of a problem down the 
> road.
> 
Actually, this is a perfect use case for setting the copies=2
property after installation.  The original binaries are
quite replaceable; the customizations and personal files
created later on are not.

- Bart

-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Torrey McMahon

2006-Sep-13 17:27 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Bart Smaalders wrote:> Torrey McMahon wrote:
>> eric kustarz wrote:
>>>
>>> I want per pool, per dataset, and per file - where all are done by 
>>> the filesystem (ZFS), not the application.  I was talking about a 
>>> further enhancement to "copies" than what Matt is
currently
>>> proposing - per file "copies", but its more work (one
thing being we
>>> don''t have administrative control over files per se).
>>
>> Now if you could do that and make it something that can be set at 
>> install time it would get a lot more interesting. When you install 
>> Solaris to that single laptop drive you can select files or even 
>> directories that have more then one copy in case of a problem down 
>> the road.
>>
>
> Actually, this is a perfect use case for setting the copies=2
> property after installation.  The original binaries are
> quite replaceable; the customizations and personal files
> created later on are not.
>
We''ve been talking about user data but the chance of corrupting 
something on disk and then detecting a bad checksum on something in 
/kernel is also possible. (Disk drives do weird things from time to 
time.) If I was sufficiently paranoid I would want everything required 
to get into single-user mode, some other stuff, and then my user data, 
duplicated to avoid any issues.

Mark Maybee

2006-Sep-13 18:21 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Bill Sommerfeld wrote:> One question for Matt: when ditto blocks are used with raidz1, how well
> does this handle the case where you encounter one or more single-sector
> read errors on other drive(s) while reconstructing a failed drive?
> 
> for a concrete example
> 
> 	A0 B0 C0 D0 P0
> 	A1 B1 C1 D1 P1
> 
> (A0==A1, B0==B1, ...; A^B^C^D==P)
> 
> Does the current implementation of raidz + ditto blocks cope with the
> case where all of "A", C0, and D1 are unavailable?
> 
> 					- Bill
> I''ll answer for Matt.

Yes, ''A'' will be reconstructed from B[0 or 1], C1, D0, and P[0
or 1].

-Mark

Anton B. Rang

2006-Sep-13 20:37 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user data

Is this true for single-sector, vs. single-ZFS-block, errors?  (Yes,
it''s pathological and probably nobody really cares.) I didn''t
see anything in the code which falls back on single-sector reads. (It''s
slightly annoying that the interface to the block device drivers loses the SCSI
error status, which tells you the first sector which was bad.)
 
 
This message posted from opensolaris.org

Wee Yeh Tan

2006-Sep-14 01:37 UTC

head link

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

On 9/13/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Sure, if you want *everything* in your pool to be mirrored, there is no
> real need for this feature (you could argue that setting up the pool
> would be easier if you didn''t have to slice up the disk though).
Not necessarily.  Implementing this on the FS level will still allow
the administrator to turn on copies on the entire pool if since the
pool is technically also a FS and the property is inherited by child
FS''s.  Of course, this will allow the admin to turn off copies to the
FS containing junk.
> It could be recommended in some situations.  If you want to protect
> against disk firmware errors, bit flips, part of the disk getting
> scrogged, then mirroring on a single disk (whether via a mirror vdev or
> copies=2) solves your problem.  Admittedly, these problems are probably
> less common that whole-disk failure, which mirroring on a single disk
> does not address.
I beg to differ from experience that the above errors are more common
than whole disk failures.  It''s just that we do not notice the disks
are developing problems but panic when they finally fail completely.
That''s what happens to most of my disks anyway.

Disks are much smarter nowadays with hiding bad sectors but it doesn''t
mean that there are none.  If your precious data happens to sit on
one, you''ll be crying for copies.


-- 
Just me,
Wire ...

Jesus Cea

2006-Sep-14 15:09 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matthew Ahrens wrote:> Out of curiosity, what would you guys think about addressing this same
> problem by having the option to store some filesystems unreplicated on
> an mirrored (or raid-z) pool?  This would have the same issues of
> unexpected space usage, but since it would be *less* than expected, that
> might be more acceptable.  There are no plans to implement anything like
> this right now, but I just wanted to get a read on it.
+1, especially in a two disk (mirrored) configuration.

Currently I use two ZFS pools: one mirrored and other unmirrored
spreaded over two disks (each disk partitioned with SVM). And I''m
constantly fighting the fill-up of one pools while the other is empty.
My current setup have the same space balance problem that a traditional
two *static* partition setup.

- --
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at argo.es http://www.argo.es/~jcea/ _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
                               _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRQlwoJlgi5GaxT1NAQLR7gP8C3QHCkvRznthRZNZ6sCfhtD/y+am7b2V
+JrPBD0RRHkD65ZKhj6r3Ss4ypkjlSo82+pMdnPdIQUpNKoqmwEyAqfvXvdqm7A+
Yks5Ac5e9ris2Sz3o7wruFixkLOJSoKrUS8TR1TpvnXlHE8l3U4Q2uEgzwKr4s8F
k/AR3VC70pg=BCz2
-----END PGP SIGNATURE-----

Jesus Cea

2006-Sep-14 15:14 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Neil A. Wilson wrote:> This is unfortunate.  As a laptop user with only a single drive, I was
> looking forward to it since I''ve been bitten in the past by data
loss
> caused by a bad area on the disk.  I don''t care about the space
> consumption because I generally don''t come anywhere close to
filling up
> the available space.  It may not be the primary market for ZFS, but it
> could be a very useful side benefit.
I feel your pain.

Although your harddrive will suffer by the extra seeks, I would suggest
you to partition your HD in two spaces and mount a two-way ZFS mirror
between them. If space is an issue, you can use N partitions to mount a
raid-z, but your performance will suffer a lot because any data read
would require N seeks.

- --
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at argo.es http://www.argo.es/~jcea/ _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
                               _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRQlx0Zlgi5GaxT1NAQLxnAQAnR5ja6G+jzTPC6cNWRpD1BmUnEcXP+k5
KvRuoIAZ2GLLQvKbPYv+KivX9+jZcNW3W73g/HPGrmnMrFwKyVaeotnk5M8z2IH/
mCneF/qfV751eTaWGUXHqCD1bh/jRkxlIHRPU+TvCriE2zJ+N5r+AMOIbAd9oQ6H
9Y9LUSWAK+Q=rNRA
-----END PGP SIGNATURE-----

can you guess?

2006-Sep-15 08:23 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

> On 9/13/06, Matthew Ahrens <Matthew.Ahrens at sun.com>
> wrote:
> > Sure, if you want *everything* in your pool to be
> mirrored, there is no
> > real need for this feature (you could argue that
> setting up the pool
> > would be easier if you didn''t have to slice up the
> disk though).
> 
> Not necessarily.  Implementing this on the FS level
> will still allow
> the administrator to turn on copies on the entire
> pool if since the
> pool is technically also a FS and the property is
> inherited by child
> FS''s.  Of course, this will allow the admin to turn
> off copies to the
> FS containing junk.
Implementing it at the directory and file levels would be even more flexible: 
redundancy strategy would no longer be tightly tied to path location, but
directories and files could themselves still inherit defaults from the
filesystem and pool when appropriate (but could be individually handled when
desirable).

I''ve never understood why redundancy was a pool characteristic in ZFS -
and the addition of ''ditto blocks'' and now this new proposal
(both of which introduce completely new forms of redundancy to compensate for
the fact that pool-level redundancy doesn''t satisfy some needs) just
makes me more skeptical about it.

(Not that I intend in any way to minimize the effort it might take to change
that decision now.)
> 
> > It could be recommended in some situations.  If you
> want to protect
> > against disk firmware errors, bit flips, part of
> the disk getting
> > scrogged, then mirroring on a single disk (whether
> via a mirror vdev or
> > copies=2) solves your problem.  Admittedly, these
> problems are probably
> > less common that whole-disk failure, which
> mirroring on a single disk
> > does not address.
> 
> I beg to differ from experience that the above errors
> are more common
> than whole disk failures.  It''s just that we do not
> notice the disks
> are developing problems but panic when they finally
> fail completely.
It would be interesting to know whether that would still be your experience in
environments that regularly scrub active data as ZFS does (assuming that said
experience was accumulated in environments that don''t).  The theory
behind scrubbing is that all data areas will be hit often enough that they
won''t have time to deteriorate (gradually) to the point where they
can''t be read at all, and early deterioration encountered during the
scrub pass (or other access) in which they have only begun to become difficult
to read will result in immediate revectoring (by the disk or, if not, by the
file system) to healthier locations.

Since ZFS-style scrubbing detects even otherwise-indetectible ''silent
corruption'' missed by the disk''s own ECC mechanisms, that
lower-probability event is also covered (though my impression is that the
probability of even a single such sector may be significantly lower than that of
whole-disk failure, especially in laptop environments).

All that being said, keeping multiple copies on a single disk of most metadata
(the loss of which could lead to wide-spread data loss) definitely makes sense
(especially given its typically negligible size), and it probably makes sense
for some files as well.

- bill
 
 
This message posted from opensolaris.org

Bill Moore

2006-Sep-15 16:25 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess?
wrote:> Implementing it at the directory and file levels would be even more
> flexible:  redundancy strategy would no longer be tightly tied to path
> location, but directories and files could themselves still inherit
> defaults from the filesystem and pool when appropriate (but could be
> individually handled when desirable).
The problem boils down to not having a way to express your intent that
works over NFS (where you''re basically limited by POSIX) that you can
use from any platform (esp. ones where ZFS isn''t installed).  If you
have some ideas, this is something we''d love to hear about.
> I''ve never understood why redundancy was a pool characteristic in
ZFS
> - and the addition of ''ditto blocks'' and now this new
proposal (both
> of which introduce completely new forms of redundancy to compensate
> for the fact that pool-level redundancy doesn''t satisfy some
needs)
> just makes me more skeptical about it.
We have thought long and hard about this problem and even know how to
implement it (the name we''ve been using is Metaslab Grids, which
isn''t
terribly descriptive, or as Matt put it "a bag o'' disks"). 
There are
two main problems with it, though.  One is failures.  The problem is
that you want the set of disks implementing redundancy (mirror, RAID-Z,
etc.) to be spread across fault domains (controller, cable, fans, power
supplies, geographic sites) as much as possible.  There is no generic
mechanism to obtain this information and act upon it.  We could ask the
administrator to supply it somehow, but such a description takes effort,
is not easy, and prone to error.  That''s why we have the model right
now
where the administrator specifies how they want the disks spread out
across fault groups (vdevs).

The second problem comes back to accounting.  If you can specify, on a
per-file or per-directory basis, what kind of replication you want, how
do you answer the statvfs() question?  I think the recent
"discussions"
on this list illustrate the complexity and passion on both sides of the
argument.
> (Not that I intend in any way to minimize the effort it might take to
> change that decision now.)
The effort is not actually that great.  All the hard problems we needed
to solve in order to implement this were basically solved when we did
the RAID-Z code.  As a matter of fact, you can see it in the on-disk
specification as well.  In the DVA, you''ll notice an 8-bit field
labeled
"GRID".  These are the bits that would describe, on a per-block basis,
what kind of redundancy we used.


--Bill

can you guess?

2006-Sep-15 22:49 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user

(I looked at my email before checking here, so I''ll just cut-and-paste
the email response in here rather than send it.  By the way, is there a way to
view just the responses that have accumulated in this forum since I last visited
- or just those I''ve never looked at before?)

Bill Moore wrote:> On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote:
>> Implementing it at the directory and file levels would be even more
>> flexible:  redundancy strategy would no longer be tightly tied to path
>> location, but directories and files could themselves still inherit
>> defaults from the filesystem and pool when appropriate (but could be
>> individually handled when desirable).
> 
> The problem boils down to not having a way to express your intent that
> works over NFS (where you''re basically limited by POSIX) that you
can
> use from any platform (esp. ones where ZFS isn''t installed).  If
you
> have some ideas, this is something we''d love to hear about.
Well, one idea is that it seems downright silly to gate ZFS facilities 
on the basis of two-decade-old network file access technology:  sure, 
it''s important to be able to *access* ZFS files using NFS, but does 
anyone really care if NFS can''t express the full range of ZFS features
-
at least to the degree that they think such features should be 
suppressed as a result (rather than made available to local users plus any
remote users employing a possibly future mechanism that *can* support them)?

That being said, you could always adopt the ReiserFS approach of 
allowing access to file/directory metadata via extended path 
specifications in environments like NFS where richer forms of 
interaction aren''t available:  yes, it may feel a bit kludgey, but it
gets the job done.

And, of course, even if you did nothing to help NFS its users would 
still benefit from inheriting whatever arbitrarily fine-grained 
redundancy levels had been established via more comprehensive means: 
they just wouldn''t be able to tweak redundancy levels themselves (any 
more, or any less, than they can do so today).
> 
>> I''ve never understood why redundancy was a pool characteristic
in ZFS
>> - and the addition of ''ditto blocks'' and now this new
proposal (both
>> of which introduce completely new forms of redundancy to compensate
>> for the fact that pool-level redundancy doesn''t satisfy some
needs)
>> just makes me more skeptical about it.
> 
> We have thought long and hard about this problem and even know how to
> implement it (the name we''ve been using is Metaslab Grids, which
isn''t
> terribly descriptive, or as Matt put it "a bag o''
disks").
Yes, ''a bag o'' disks'' - used intelligently at a
higher level - is pretty much what I had in mind.

  There are> two main problems with it, though.  One is failures.  The problem is
> that you want the set of disks implementing redundancy (mirror, RAID-Z,
> etc.) to be spread across fault domains (controller, cable, fans, power
> supplies, geographic sites) as much as possible.  There is no generic
> mechanism to obtain this information and act upon it.  We could ask the
> administrator to supply it somehow, but such a description takes effort,
> is not easy, and prone to error.  That''s why we have the model
right now
> where the administrator specifies how they want the disks spread out
> across fault groups (vdevs).
Without having looked at the code I may be missing something here. 
Even with your current implementation, if there''s indeed no automated 
way to obtain such information the administrator has to exercise manual 
control over disk groupings if they''re going to attain higher 
availability by avoiding other single points of failure instead of just guard
against unrecoverable data loss from disk failure.  Once that
information has been made available to the system, letting it make use 
of it at a higher level rather than just aggregating entire physical 
disks should not entail additional administrator effort.

I admit that I haven''t considered the problem in great detail, since my
bias is toward solutions that employ redundant arrays of inexpensive 
nodes to scale up rather than a small number of very large nodes (in part
because a single large node itself can often be a single point of
failure even if many of its subsystems carefully avoid being so in the 
manner that you suggest).  Each such small node has a relatively low 
disk count and little or no internal redundancy, and thus comprises its 
own little fault-containment environment, avoiding most such issues; as 
a plus, such node sizes mesh well with the bandwidth available from very 
inexpensive Gigabit Ethernet interconnects and switches (even when 
streaming data sequentially, such as video on demand) and allow 
fine-grained incremental system scaling (by the time faster 
interconnects become inexpensive, disk bandwidth should have increased 
enough that such a balance will still be fairly good).

Still, if you can group whole disks intelligently in a large system with 
respect to supplementing simple redundancy with higher overall subsystem 
availability, then you ought to be able to use exactly the same 
information to allow higher-level decisions about where to place 
redundant data at other than whole-disk granularity.
> 
> The second problem comes back to accounting.  If you can specify, on a
> per-file or per-directory basis, what kind of replication you want, how
> do you answer the statvfs() question?  I think the recent
"discussions"
> on this list illustrate the complexity and passion on both sides of the
> argument.
I rather liked the idea of using the filesystem *default* redundancy 
level as the basis for providing free space information, though in 
environments where different users were set up with different defaults 
using the per-user default might make sense (then, only if that was 
manually changed, presumably by that user, would less obvious things 
happen).

Overall, I think perhaps free space should be reported on the basis of 
things that the user does *not* have control over, such as the default 
flavor of redundancy established by an administrator (i.e., as the 
number of bytes the user could write using that default flavor - which 
is what I was starting to converge on just above).  Then the user will 
mostly see only discrepancies caused by changes in that default that 
s/he has made, and should be able to understand them (well, if the user 
has personal ''temp'' space the admin might have special-cased
that for
them by making it non-redundant, I suppose).

Then again, whenever one traverses a mount point today (not always all that
obvious a transition) the whole world of free space (and I''d expect
quota) changes anyway, and users don''t seem to find that an
insurmountable obstacle.  So I find it difficult to see free-space reporting as
being any real show-stopper in this area regardless of how it''s done
(though like most people who contributed to that topic I think I have a
preference).
> 
>> (Not that I intend in any way to minimize the effort it might take to
>> change that decision now.)
> 
> The effort is not actually that great.  All the hard problems we needed
> to solve in order to implement this were basically solved when we did
> the RAID-Z code.  As a matter of fact, you can see it in the on-disk
> specification as well.  In the DVA, you''ll notice an 8-bit field
labeled
> "GRID".  These are the bits that would describe, on a per-block
basis,
> what kind of redundancy we used.
The only reason I can think of for establishing that per block (rather than per
object) would be if you kept per-block access-rate information around so that
you could distribute really hot blocks more widely.  And given that such blocks
would normally be in cache anyway, that only seems to make sense in a
distributed environment (where you''re trying to spread the load over
multiple nodes more because of interconnect bandwidth limitations than disk
bandwidth limitations - though even here you could do this at the cache level
rather than the on-disk level based on dynamic needs).

- bill
 
 
This message posted from opensolaris.org

Frank Cusack

2006-Sep-15 23:24 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user

On September 15, 2006 3:49:14 PM -0700 "can you guess?" 
<billtodd at metrocast.net> wrote:> (I looked at my email before checking here, so I''ll just
cut-and-paste
> the email response in here rather than send it.  By the way, is there a
> way to view just the responses that have accumulated in this forum since
> I last visited - or just those I''ve never looked at before?)
subscribe via email instead of reading it as a forum

Anton B. Rang

2006-Sep-16 02:43 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user

>By the way, is there a way to view just the responses that have accumulated
in this forum since I
>last visited - or just those I''ve never looked at before?
Not through the web interface itself, as far as I can tell, but there''s
an RSS feed of messages that might do the trick.  Unfortunately it points to the
whole thread, rather than the individual messages.

  http://opensolaris.org/jive/rss/rssmessages.jspa?forumID=80
> it seems downright silly to gate ZFS facilities 
> on the basis of two-decade-old network file access technology: sure, 
> it''s important to be able to *access* ZFS files using NFS, but
does
> anyone really care if NFS can''t express the full range of ZFS
features -
Personally, I don''t think it''s critical. After all, you
can''t create a snapshot via NFS either, but we have snapshots. 
Concepts such as administrative ID, inherited directory characteristics, etc.
have had great success in file systems such as IBM''s GPFS and
Sun''s QFS, as well as on NetApp''s systems. For that matter,
quotas aren''t really in NFSv3, but nobody seems to mind that UFS
implements them.

Anton
 
 
This message posted from opensolaris.org

Wee Yeh Tan

2006-Sep-16 07:03 UTC

head link

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

On 9/15/06, can you guess? <billtodd at metrocast.net>
wrote:> Implementing it at the directory and file levels would be even more
flexible:  redundancy strategy would no longer be tightly tied to path location,
but directories and files could themselves still inherit defaults from the
filesystem and pool when appropriate (but could be individually handled when
desirable).
Ideally so.  FS (or dataset) level is sufficiently fine grain for my
use.  If I take the trouble to specify copies for a directory, I
really do not mind the trouble of creating a new dataset for it at the
same time.  file-level, however, is really pushing it.  You might end
up with an administrative nightmare deciphering which files have how
many copies.  I just do not see it being useful to my environment.
> It would be interesting to know whether that would still be your experience
in environments that regularly scrub active data as ZFS does (assuming that said
experience was accumulated in environments that don''t).  The theory
behind scrubbing is that all data areas will be hit often enough that they
won''t have time to deteriorate (gradually) to the point where they
can''t be read at all, and early deterioration encountered during the
scrub pass (or other access) in which they have only begun to become difficult
to read will result in immediate revectoring (by the disk or, if not, by the
file system) to healthier locations.
Scrubbing exercises the disk area to prevent bit-rot.  I do not think
ZFS''s scrubbing changes the failure mode of the raw devices.  OTOH, I
really have no such experience to speak of *fingers crossed*.  I
failed to locate the code where the relocation of files happens but
assume that copies would make this process more reliable.
> Since ZFS-style scrubbing detects even otherwise-indetectible
''silent corruption'' missed by the disk''s own ECC
mechanisms, that lower-probability event is also covered (though my impression
is that the probability of even a single such sector may be significantly lower
than that of whole-disk failure, especially in laptop environments).
I do not any data to support nor dismiss that. Matt was right that
probability of failure modes is a huge can of worms that can drag
forever.


-- 
Just me,
Wire ...

can you guess?

2006-Sep-16 09:36 UTC

head link

[zfs-discuss] Re: Proposal: multiple copies of user

> On 9/15/06, can you guess? <billtodd at metrocast.net>
> wrote:
...

  file-level, however, is really pushing> it.  You might end
> up with an administrative nightmare deciphering which
> files have how
> many copies.\
I''m not sure what you mean:  the level of redundancy would be a
per-file attribute that could be examined, and would be normally just be
defaulted to a common value.

...
> > It would be interesting to know whether that would
> still be your experience in environments that
> regularly scrub active data as ZFS does (assuming
> that said experience was accumulated in environments
> that don''t).  The theory behind scrubbing is that all
> data areas will be hit often enough that they won''t
> have time to deteriorate (gradually) to the point
> where they can''t be read at all, and early
> deterioration encountered during the scrub pass (or
> other access) in which they have only begun to become
> difficult to read will result in immediate
> revectoring (by the disk or, if not, by the file
> system) to healthier locations.
> 
> Scrubbing exercises the disk area to prevent bit-rot.
>  I do not think
> FS''s scrubbing changes the failure mode of the raw
> devices.
It doesn''t change the failure rate (if anything, it might accelerate it
marginally due to the extra disk activity), but it *does* change, potentially
radically, the frequency with which sectors containing user data become
unreadable - because it allows them to be detected *before* that happens such
that the data can be moved to a good sector (often by the disk itself, else by
higher-level software) and the failing sector marked bad.

  OTOH, I> really have no such experience to speak of *fingers
> crossed*.  I
> failed to locate the code where the relocation of
> files happens but
> assume that copies would make this process more
> reliable.
Sort of:  while they don''t make any difference when you catch a failing
sector while it''s still readable, they certainly help if you only catch
it after it''s become unreadable (or has been
''silently'' corrupted).
> 
> > Since ZFS-style scrubbing detects even
> otherwise-indetectible ''silent corruption'' missed by
> the disk''s own ECC mechanisms, that lower-probability
> event is also covered (though my impression is that
> the probability of even a single such sector may be
> significantly lower than that of whole-disk failure,
> especially in laptop environments).
> 
> I do not any data to support nor dismiss that.
Quite a few years ago Seagate still published such data, but of course I
didn''t copy it down (because it was ''always
available'' when I wanted it - as I said, it was quite a while ago and I
was not nearly as well-acquainted with the volatility of Internet data as I
would subsequently become).  But to the best of my recollection their enterprise
disks at that time were specced to have no worse than 1 uncorrectable error for
every petabit read and no worse than 1 undetected error for every exabit read.

A fairly recent paper by people who still have access to such data suggests that
the frequency of uncorrectable errors in enterprise drives is still about the
same, but that the frequency of undetected errors may have increased markedly
(to perhaps once in every 10 petabits read) - possibly a result of
ever-increasing on-disk bit densities and the more aggressive error correction
required to handle them (perhaps this is part of the reason they don''t
make error rates public any more...).  They claim that SATA drives have error
rates around 10x that of enterprise drives (or an undetected error rate of
around once per petabit).

Figure out a laptop drive''s average data rate and that gives you a mean
time to encountering undetected corruption.  Compare that to the
drive''s in-use MTBF rating and there you go!  If I haven''t
dropped a decimal place or three doing this in my head, then even if laptop
drives have nominal MTBFs equal to desktop SATA drives it looks as if it would
take an average data rate of 60 - 70 KB/sec (24/7, year-in, year-out) for the
likelihood of an undetected error to be comparable in likelihood to a whole-disk
failure:  that''s certainly nothing much for a fairly well-loaded server
in constant (or even just 40 hour/week) use, but for a laptop?.

- bill
 
 
This message posted from opensolaris.org

Richard Elling - PAE

2006-Sep-19 00:32 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

[appologies for being away from my data last week]

David Dyer-Bennet wrote:> The more I look at it the more I think that a second copy on the same
> disk doesn''t protect against very much real-world risk. Am I
wrong
> here? Are partial(small) disk corruptions more common than I think?
> I don''t have a good statistical view of disk failures.
This question was asked many times in this thread. IMHO, it is the
single biggest reason we should implement ditto blocks for data.

We did a study of disk failures in an enterprise RAID array a few
years ago. One failure mode stands heads and shoulders above the
others: non-recoverable reads. A short summary:

2,919 total errors reported
1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated)
961 (32.9%) unrecovered errors (of all types)
32 (1.1%) other (eg. device not ready)
707 (24.2%) non-recoverable reads

In other words, non-recoverable reads represent 73.6% of the non-
recoverable failures that occur, including complete drive failures.
Boo! Did that scare you? Halloween is next month! :-)
Seagate said today that in a few years 3.5" disks will store 2.5 TBytes.
Boo!

While I don''t have data on laptop disk failures, I would not be
surprised
to see a similar distribution, though with a larger mechanical damage
count. My laptops run hotter inside than my other systems and, as a rule
of thumb, your disk failure rate increases by 2x for every 15C change in
temperature. Is your laptop disk hot?

The case for ditto data is clear to me. Many people are using single-disk
systems, and many more people would really like to use single-disk systems
but they really can''t.

Beyond spinning rust systems, there are other forms of non-volatile
storage which would apply here. For example, those people who suggested
that you should backup your presentation to a CD fail to note that a spec
of dust on the CD could lead you to lose one block of data. In my CD/DVD
experience, such losses are blissfully ignored by the system and you may
blame the resulting crash on the cheap hardware you bought from your
brother-in-law. Beyond CDs, I can see this as being a nice enhancement
to limited endurance devices such as flash.

While it is true that I could slice my disk up into multiple vdevs and
mirror them, I''d much rather set a policy at a finer grainularity: my
files are more important than most of the other, mostly read-only and
easily reconstructed, files on my system.

When ditto blocks for metadata was introduced, I took a look at the
code and was pleasantly suprised. The code does an admirable job of
ensuring spatial diversity in the face of multiple policies, even in
the single disk case. IMHO, this is the right way to implement this
and allows you to mix policies with ease.

As a RAS guy, I''m biased to not wanting to lose data via easy-to-use
interfaces. I don''t see how this feature has any downside, but lots
of upside.
-- richard

David Dyer-Bennet

2006-Sep-19 02:42 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com>
wrote:> [appologies for being away from my data last week]
>
> David Dyer-Bennet wrote:
> > The more I look at it the more I think that a second copy on the same
> > disk doesn''t protect against very much real-world risk.  Am I
wrong
> > here?  Are partial(small) disk corruptions more common than I think?
> > I don''t have a good statistical view of disk failures.
>
> This question was asked many times in this thread.  IMHO, it is the
> single biggest reason we should implement ditto blocks for data.
>
> We did a study of disk failures in an enterprise RAID array a few
> years ago.  One failure mode stands heads and shoulders above the
> others: non-recoverable reads.  A short summary:
>
>    2,919 total errors reported
>    1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated)
>      961 (32.9%) unrecovered errors (of all types)
>       32 (1.1%) other (eg. device not ready)
>      707 (24.2%) non-recoverable reads
>
> In other words, non-recoverable reads represent 73.6% of the non-
> recoverable failures that occur, including complete drive failures.
I don''t see anything addressing complete drive failures vs. block
failures here anywhere.   Is there some way to read something about
that out of this data?

I''m thinking the "operations succeeded" also occurs read
errors
recovered by retries and such, as well as the write failure cited as
an example?

I guess I can conclude that the 66% for errors successfully recovered
means that a lot of errors are not, in fact, entire-drive failures.
So that''s good (for ditto-data).  So a maximum of 34% are whole-drive
failures (and in reality I''m sure far lower).

Anyway, facts on actual failures in the real world are *definitely*
the useful way to conduct this discussion!

[snip]
> While it is true that I could slice my disk up into multiple vdevs and
> mirror them, I''d much rather set a policy at a finer grainularity:
my
> files are more important than most of the other, mostly read-only and
> easily reconstructed, files on my system.
I definitely like the idea of setting policy at a finer granularity; I
really want it to be at the file level, even per-directory doesn''t fit
reality very well in my view.
> When ditto blocks for metadata was introduced, I took a look at the
> code and was pleasantly suprised.  The code does an admirable job of
> ensuring spatial diversity in the face of multiple policies, even in
> the single disk case.  IMHO, this is the right way to implement this
> and allows you to mix policies with ease.
That''s very good to hear.
-- 
David Dyer-Bennet, <mailto:dd-b at dd-b.net>,
<http://www.dd-b.net/dd-b/>
RKBA: <http://www.dd-b.net/carry/>
Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

Richard Elling - PAE

2006-Sep-19 03:16 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

more below...

David Dyer-Bennet wrote:> On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:
>> [appologies for being away from my data last week]
>>
>> David Dyer-Bennet wrote:
>> > The more I look at it the more I think that a second copy on the
same
>> > disk doesn''t protect against very much real-world risk. 
Am I wrong
>> > here?  Are partial(small) disk corruptions more common than I
think?
>> > I don''t have a good statistical view of disk failures.
>>
>> This question was asked many times in this thread.  IMHO, it is the
>> single biggest reason we should implement ditto blocks for data.
>>
>> We did a study of disk failures in an enterprise RAID array a few
>> years ago.  One failure mode stands heads and shoulders above the
>> others: non-recoverable reads.  A short summary:
>>
>>    2,919 total errors reported
>>    1,926 (66.0%) operations succeeded (eg. write failed, auto 
>> reallocated)
>>      961 (32.9%) unrecovered errors (of all types)
>>       32 (1.1%) other (eg. device not ready)
>>      707 (24.2%) non-recoverable reads
>>
>> In other words, non-recoverable reads represent 73.6% of the non-
>> recoverable failures that occur, including complete drive failures.
> 
> I don''t see anything addressing complete drive failures vs. block
> failures here anywhere.   Is there some way to read something about
> that out of this data?
Complete failures are a non-zero category, but there is more than one
error code which would result in the recommendation to replace the drive.
Their counts are included in the 961-707=254 (26.4%) of other non-
recoverable errors.  In some cases a non-recoverable error can be
corrected by a retry, and those also fall into the 26.4% bucket.

Interestingly, the operation may succeed and yet we will get an error
which recommends replacing the drive.  For example, if the failure
prediction threshold is exceeded.  You might also want to replace the
drive when there are no spare defect sectors available.  Life would be
easier if they really did simply die.
> I''m thinking the "operations succeeded" also occurs read
errors
> recovered by retries and such, as well as the write failure cited as
> an example?
Yes.
> I guess I can conclude that the 66% for errors successfully recovered
> means that a lot of errors are not, in fact, entire-drive failures.
> So that''s good (for ditto-data).  So a maximum of 34% are
whole-drive
> failures (and in reality I''m sure far lower).
I agree.
> Anyway, facts on actual failures in the real world are *definitely*
> the useful way to conduct this discussion!
> 
> [snip]
> 
>> While it is true that I could slice my disk up into multiple vdevs and
>> mirror them, I''d much rather set a policy at a finer
grainularity: my
>> files are more important than most of the other, mostly read-only and
>> easily reconstructed, files on my system.
> 
> I definitely like the idea of setting policy at a finer granularity; I
> really want it to be at the file level, even per-directory doesn''t
fit
> reality very well in my view.
> 
>> When ditto blocks for metadata was introduced, I took a look at the
>> code and was pleasantly suprised.  The code does an admirable job of
>> ensuring spatial diversity in the face of multiple policies, even in
>> the single disk case.  IMHO, this is the right way to implement this
>> and allows you to mix policies with ease.
> 
> That''s very good to hear.
  -- richard

David Dyer-Bennet

2006-Sep-19 03:29 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:
> Interestingly, the operation may succeed and yet we will get an error
> which recommends replacing the drive.  For example, if the failure
> prediction threshold is exceeded.  You might also want to replace the
> drive when there are no spare defect sectors available.  Life would be
> easier if they really did simply die.
For one thing, people wouldn''t be interested in doing ditto-block data!

So, with ditto-block data, you survive any single-block failure, and
"most" double-block failures, etc.  What it doesn''t lend
itself to is
simple computation of simple answers :-).

In theory, and with an infinite budget, I''d approach this analagously
to cpu architecture design based on large volumes of instruction trace
data.  If I had a large volume of disk operation traces with the
hardware failures indicated, I could run this against the ZFS
simulator and see what strategies produced the most robust single-disk
results.
-- 
David Dyer-Bennet, <mailto:dd-b at dd-b.net>,
<http://www.dd-b.net/dd-b/>
RKBA: <http://www.dd-b.net/carry/>
Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

Richard Elling - PAE

2006-Sep-19 18:07 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

[pardon the digression]

David Dyer-Bennet wrote:> On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:
> 
>> Interestingly, the operation may succeed and yet we will get an error
>> which recommends replacing the drive.  For example, if the failure
>> prediction threshold is exceeded.  You might also want to replace the
>> drive when there are no spare defect sectors available.  Life would be
>> easier if they really did simply die.
> 
> For one thing, people wouldn''t be interested in doing ditto-block
data!
> 
> So, with ditto-block data, you survive any single-block failure, and
> "most" double-block failures, etc.  What it doesn''t lend
itself to is
> simple computation of simple answers :-).
> 
> In theory, and with an infinite budget, I''d approach this
analagously
> to cpu architecture design based on large volumes of instruction trace
> data.  If I had a large volume of disk operation traces with the
> hardware failures indicated, I could run this against the ZFS
> simulator and see what strategies produced the most robust single-disk
> results.
There is a significant difference.  The functionality of logic part is
deterministic and discrete.  The wear-out rate of a mechanical device
is continuous and probabilistic.  In the middle are discrete events
with probabilities associated with them, but they are handled separately.
In other words, we can use probability and statistics tools to analyze
data loss in disk drives.  This will be much faster and less expensive
than running a bunch of traces.  In fact, there has already been much
written about disk drives, their failure modes, and factors which
contribute to their failure rates.  We use such data to predict the
probability of events such as non-recoverable reads (which is often
specified in the data sheet).
  -- richard

David Dyer-Bennet

2006-Sep-19 19:32 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On 9/19/06, Richard Elling - PAE <Richard.Elling at sun.com>
wrote:> [pardon the digression]
>
> David Dyer-Bennet wrote:
> > On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com>
wrote:
> >
> >> Interestingly, the operation may succeed and yet we will get an
error
> >> which recommends replacing the drive.  For example, if the failure
> >> prediction threshold is exceeded.  You might also want to replace
the
> >> drive when there are no spare defect sectors available.  Life
would be
> >> easier if they really did simply die.
> >
> > For one thing, people wouldn''t be interested in doing
ditto-block data!
> >
> > So, with ditto-block data, you survive any single-block failure, and
> > "most" double-block failures, etc.  What it doesn''t
lend itself to is
> > simple computation of simple answers :-).
> >
> > In theory, and with an infinite budget, I''d approach this
analagously
> > to cpu architecture design based on large volumes of instruction trace
> > data.  If I had a large volume of disk operation traces with the
> > hardware failures indicated, I could run this against the ZFS
> > simulator and see what strategies produced the most robust single-disk
> > results.
>
> There is a significant difference.  The functionality of logic part is
> deterministic and discrete.  The wear-out rate of a mechanical device
> is continuous and probabilistic.  In the middle are discrete events
> with probabilities associated with them, but they are handled separately.
> In other words, we can use probability and statistics tools to analyze
> data loss in disk drives.  This will be much faster and less expensive
> than running a bunch of traces.  In fact, there has already been much
> written about disk drives, their failure modes, and factors which
> contribute to their failure rates.  We use such data to predict the
> probability of events such as non-recoverable reads (which is often
> specified in the data sheet).
Oh, I know there''s a difference.  It''s not as big as it looks,
though,
if you remember that the instruction or disk operation traces are just
*representative* of the workload, not the actual workload that has to
run.  So, yes, disk failures are certainly non-deterministic, but the
actual instruction stream run by customers isn''t the same one designed
against, either.  In both cases the design has to take the trace as a
general guideline for types of things that will happen, rather than as
a strict workload to optimize for.
-- 
David Dyer-Bennet, <mailto:dd-b at dd-b.net>,
<http://www.dd-b.net/dd-b/>
RKBA: <http://www.dd-b.net/carry/>
Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

Torrey McMahon

2006-Sep-19 23:21 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Richard Elling - PAE wrote:>
> This question was asked many times in this thread.  IMHO, it is the
> single biggest reason we should implement ditto blocks for data.
>
> We did a study of disk failures in an enterprise RAID array a few
> years ago.  One failure mode stands heads and shoulders above the
> others: non-recoverable reads.  A short summary:
>
>   2,919 total errors reported
>   1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated)
>     961 (32.9%) unrecovered errors (of all types)
>      32 (1.1%) other (eg. device not ready)
>     707 (24.2%) non-recoverable reads
>
> In other words, non-recoverable reads represent 73.6% of the non-
> recoverable failures that occur, including complete drive failures. 

Does this take cascading failures into account? How often do you get an 
unrecoverable read and yet are still able to perform operation on the 
target media? Thats where ditto blocks could come in handy modulo the 
concerns around utilities and quotas.

Richard Elling - PAE

2006-Sep-19 23:51 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

reply below...

Torrey McMahon wrote:> Richard Elling - PAE wrote:
>>
>> This question was asked many times in this thread.  IMHO, it is the
>> single biggest reason we should implement ditto blocks for data.
>>
>> We did a study of disk failures in an enterprise RAID array a few
>> years ago.  One failure mode stands heads and shoulders above the
>> others: non-recoverable reads.  A short summary:
>>
>>   2,919 total errors reported
>>   1,926 (66.0%) operations succeeded (eg. write failed, auto
reallocated)
>>     961 (32.9%) unrecovered errors (of all types)
>>      32 (1.1%) other (eg. device not ready)
>>     707 (24.2%) non-recoverable reads
>>
>> In other words, non-recoverable reads represent 73.6% of the non-
>> recoverable failures that occur, including complete drive failures. 
> 
> 
> Does this take cascading failures into account? How often do you get an 
> unrecoverable read and yet are still able to perform operation on the 
> target media? Thats where ditto blocks could come in handy modulo the 
> concerns around utilities and quotas.
No event analysis is done here, though we do have the data, the task is
time consuming.

Non-recoverable reads may not represent permanent failures.  In the case
of a RAID array, the data should be reconstructed and a rewrite + verify
attempted with the possibility of sparing the sector.  ZFS can
reconstruct the data and relocate the block.

I have some (volumous) data on disk error rates as reported though kstat.
I plan to attempt to get a better sense of the failure rates from that
data.  The disk vendors specify non-recoverable read error rates, but
we think they are overly pessimistic for the first few years of life.
We''d like to have a better sense of how to model this, for a variety of
applications which are concerned with archival periods.
  -- richard

Torrey McMahon

2006-Sep-20 01:10 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Richard Elling - PAE wrote:>
> Non-recoverable reads may not represent permanent failures.  In the case
> of a RAID array, the data should be reconstructed and a rewrite + verify
> attempted with the possibility of sparing the sector.  ZFS can
> reconstruct the data and relocate the block.
>

True but if you''re using a HW raid array or some sort of protection 
within a zpool then you''re already protected to a large degree.
I''m
looking for the amount of cases where you get a permanent unrecoverable 
read error and yet can recover because you''ve got a ditto block
someplace.

Richard Elling - PAE

2006-Sep-20 03:20 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Torrey McMahon wrote:> Richard Elling - PAE wrote:
>>
>> Non-recoverable reads may not represent permanent failures.  In the
case
>> of a RAID array, the data should be reconstructed and a rewrite +
verify
>> attempted with the possibility of sparing the sector.  ZFS can
>> reconstruct the data and relocate the block.
>>
> 
> 
> True but if you''re using a HW raid array or some sort of
protection
> within a zpool then you''re already protected to a large degree.
I''m
> looking for the amount of cases where you get a permanent unrecoverable 
> read error and yet can recover because you''ve got a ditto block
someplace.
Agree.  Non-recoverable reads are largely a JBOD problem.
  -- richard

Wout Mertens

2006-Sep-20 09:25 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

Just a "me too" mail:

On 13 Sep 2006, at 08:30, Richard Elling wrote:
>> Is this use of slightly based upon disk failure modes?  That is, when
>> disks fail do they tend to get isolated areas of badness compared to
>> complete loss?  I would suggest that complete loss should include
>> someone tripping over the power cord to the external array that  
>> houses
>> the disk.
>
> The field data I have says that complete disk failures are the  
> exception.
It''s the same here. In our 100 laptop population in the last 2 years,  
we had 2 dead drives and 10 or so with I/O errors.
> BTW, this feature will be very welcome on my laptop!  I can''t wait
:-)
I, too, would love having two copies of my important data on my  
laptop drive. Laptop drives are small enough as they are, there''s no  
point in storing the OS, tmp and swap files twice as well.

So if ditto-data blocks aren''t hard to implement, they would be  
welcome. Otherwise there''s still the mirror-split-your-drive approach.

Wout.

Victor Latushkin

2006-Sep-23 14:54 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

David Dyer-Bennet wrote:> On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:
>> Here is a proposal for a new ''copies'' property which
would allow
>> different levels of replication for different filesystems.
>>
>> Your comments are appreciated!
> 
> I''ve read the proposal, and followed the discussion so far.  I
have to
> say that I don''t see any particular need for this feature.
> 
> Possibly there is a need for a different feature, in which the entire
> control of redundancy is moved away from the pool level and to the
> file or filesystem level.  I definitely see the attraction of being
> able to specify by file and directory different degrees of reliability
> needed.  However, the details of the feature actually proposed
don''t
> seem to satisfy the need for extra reliability at the level that
> drives people to employ redundancy; it doesn''t provide a guaranty.I think this is easy to solve and could make such feature more useful - 
we need a way to specify policy of placing duplicate copies, e.g. if we 
want a guarantee we specify that copies should strictly be put on 
different disks. In this we we can get level of protection close to that 
of a mirror and a much greater flexibility - for example, for two disk 
systems we may have two copies for boot environment and critical data, 
and use only one copy for data that is temporary. This may be a first 
step on a way to implementation of redundancy on filesystem, directory 
or file basis.
> I see no need for additional non-guaranteed reliability on top of the
> levels of guaranty provided by use of redundancy at the pool level.
> 
> Furthermore, as others have pointed out, this feature would add a high
> degree of user-visible complexity.
> 
>> From what I''ve seen here so far, I think this is a bad idea
and should
> not be added.
I think that with a way to specify policy of placing copies I think this 
is a good and useful idea, and it''s a pity that it is shelved for now. 
Hope it will not stay on the shelf forever ;-)

Victor

Pawel Jakub Dawidek

2006-Sep-27 11:22 UTC

head link

[zfs-discuss] Proposal: multiple copies of user data

On Tue, Sep 12, 2006 at 03:56:00PM -0700, Matthew Ahrens
wrote:> Matthew Ahrens wrote:
[...]> Given the overwhelming criticism of this feature, I''m going to
shelve it for now.
I''d really like to see this feature. You say ZFS should change our view
on filesystems, I say be consequent.

In ZFS world we create one big pool out of all our disks and create
filesystems on top of it. This way we don''t have to care about resizing
them, etc. But this way we define redundancy at pool level for all our
filesystems.

It is quite common that we have data we don''t really care about as well
as data we do care about a lot in the same pool. Before ZFS, I''d just
create RAID0 for the former and RAID1 for the latter, but this is not
the ZFS way, right?

My question is how can I express my intent of defining redundancy level
based of the importance of my data, but still following the ZFS way
without ''copies'' feature?

Please reconsider your choice.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060927/0aa7049b/attachment.bin>

Apparently Analagous Threads

Search for more possibly parallel threads

zfs discuss - Sep 2006 - Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

[zfs-discuss] Re: Proposal: multiple copies of user