Here is a proposal for a new ''copies'' property which would allow different levels of replication for different filesystems. Your comments are appreciated! --matt A. INTRODUCTION ZFS stores multiple copies of all metadata. This is accomplished by storing up to three DVAs (Disk Virtual Addresses) in each block pointer. This feature is known as "Ditto Blocks". When possible, the copies are stored on different disks. See bug 6410698 "ZFS metadata needs to be more highly replicated (ditto blocks)" for details on ditto blocks. This case will extend this feature to allow system administrators to store multiple copies of user data as well, on a per-filesystem basis. These copies are in addition to any redundancy provided at the pool level (mirroring, raid-z, etc). B. DESCRIPTION A new property will be added, ''copies'', which specifies how many copies of the given filesystem will be stored. Its value must be 1, 2, or 3. Like other properties (eg. checksum, compression), it only affects newly-written data. As such, it is recommended that the ''copies'' property be set at filesystem-creation time (eg. ''zfs create -o copies=2 pool/fs''). The pool must be at least on-disk version 2 to use this feature (see ''zfs upgrade''). By default (copies=1), only two copies of most filesystem metadata are stored. However, if we are storing multiple copies of user data, then 3 copies (the maximum) of filesystem metadata will be stored. This feature is similar to using mirroring, but differs in several important ways: * Different filesystems in the same pool can have different numbers of copies. * The storage configuration is not constrained as it is with mirroring (eg. you can have multiple copies even on a single disk). * Mirroring offers slightly better performance, because only one DVA needs to be allocated. * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. It is important to note that the copies provided by this feature are in addition to any redundancy provided by the pool configuration or the underlying storage. For example: * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any 1 disk failing without data loss. * In a pool with 2-way mirrors, a filesystem with copies=3 will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any 5 disks failing without data loss (assuming that there are at least ncopies=3 mirror groups). * In a pool with single-parity raid-z a filesystem with copies=2 will be stored with 2 copies, each copy protected by its own parity block. The filesystem can tolerate any 3 disks failing without data loss (assuming that there are at least ncopies=2 raid-z groups). C. MANPAGE CHANGES *** zfs.man4 Tue Jun 13 10:15:38 2006 --- zfs.man5 Mon Sep 11 16:34:37 2006 *************** *** 708,714 **** --- 708,725 ---- they are inherited. + copies=1 | 2 | 3 + Controls the number of copies of data stored for this dataset. + These copies are in addition to any redundancy provided by the + pool (eg. mirroring or raid-z). The copies will be stored on + different disks if possible. + + Changing this property only affects newly-written data. + Therefore, it is recommended that this property be set at + filesystem creation time, using the ''-o copies='' option. + + Temporary Mountpoint Properties When a file system is mounted, either through mount(1M) for legacy mounts or the "zfs mount" command for normal file D. REFERENCES
On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> Here is a proposal for a new ''copies'' property which would allow > different levels of replication for different filesystems. > > Your comments are appreciated! > > --matt > > A. INTRODUCTION > > ZFS stores multiple copies of all metadata. This is accomplished by > storing up to three DVAs (Disk Virtual Addresses) in each block pointer. > This feature is known as "Ditto Blocks". When possible, the copies are > stored on different disks. > > See bug 6410698 "ZFS metadata needs to be more highly replicated (ditto > blocks)" for details on ditto blocks. > > This case will extend this feature to allow system administrators to > store multiple copies of user data as well, on a per-filesystem basis. > These copies are in addition to any redundancy provided at the pool > level (mirroring, raid-z, etc). > > B. DESCRIPTION > > A new property will be added, ''copies'', which specifies how many copies > of the given filesystem will be stored. Its value must be 1, 2, or 3. > Like other properties (eg. checksum, compression), it only affects > newly-written data. As such, it is recommended that the ''copies'' > property be set at filesystem-creation time > (eg. ''zfs create -o copies=2 pool/fs''). >would the user be held acountable for the space used by the extra copies? so if a user has a 1GB quota and stores one 512MB file with two copies activated, all his space will be used? what happens if the same user stores a file that is 756MB on the filesystem with multiple copies enabled an a 1GB quota, does the save fail? How would the user tell that his filesystem is full since all the tools he is used to report he is using only 1/2 the space? Is there a way for the sysdmin to get rid of the excess copies should disk space needs require it? If I start out 2 copies and later change it to on 1 copy, do the files created before keep there 2 copies? what happens if root needs to store a copy of an important file and there is no space but there is space if extra copies are reclaimed? Will this be configurable behavior? James Dickens uadmin.blogpsot.com> The pool must be at least on-disk version 2 to use this feature (see > ''zfs upgrade''). > > By default (copies=1), only two copies of most filesystem metadata are > stored. However, if we are storing multiple copies of user data, then 3 > copies (the maximum) of filesystem metadata will be stored. > > This feature is similar to using mirroring, but differs in several > important ways: > > * Different filesystems in the same pool can have different numbers of > copies. > * The storage configuration is not constrained as it is with mirroring > (eg. you can have multiple copies even on a single disk). > * Mirroring offers slightly better performance, because only one DVA > needs to be allocated. > * Mirroring offers slightly better redundancy, because one disk from > each mirror can fail without data loss. > > It is important to note that the copies provided by this feature are in > addition to any redundancy provided by the pool configuration or the > underlying storage. For example: > > * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) > will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any > 1 disk failing without data loss. > * In a pool with 2-way mirrors, a filesystem with copies=3 > will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any > 5 disks failing without data loss (assuming that there are at least > ncopies=3 mirror groups). > * In a pool with single-parity raid-z a filesystem with copies=2 > will be stored with 2 copies, each copy protected by its own parity > block. The filesystem can tolerate any 3 disks failing without data > loss (assuming that there are at least ncopies=2 raid-z groups). > > > C. MANPAGE CHANGES > *** zfs.man4 Tue Jun 13 10:15:38 2006 > --- zfs.man5 Mon Sep 11 16:34:37 2006 > *************** > *** 708,714 **** > --- 708,725 ---- > they are inherited. > > > + copies=1 | 2 | 3 > > + Controls the number of copies of data stored for this dataset. > + These copies are in addition to any redundancy provided by the > + pool (eg. mirroring or raid-z). The copies will be stored on > + different disks if possible. > + > + Changing this property only affects newly-written data. > + Therefore, it is recommended that this property be set at > + filesystem creation time, using the ''-o copies='' option. > + > + > Temporary Mountpoint Properties > When a file system is mounted, either through mount(1M) for > legacy mounts or the "zfs mount" command for normal file > > > D. REFERENCES > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> B. DESCRIPTION > > A new property will be added, ''copies'', which specifies how many copies > of the given filesystem will be stored. Its value must be 1, 2, or 3. > Like other properties (eg. checksum, compression), it only affects > newly-written data. As such, it is recommended that the ''copies'' > property be set at filesystem-creation time > (eg. ''zfs create -o copies=2 pool/fs'').Is there anything in the works to compress (or encrypt) existing data after the fact? For example, a special option to scrub that causes the data to be re-written with the new properties could potentially do this. If so, this feature should subscribe to any generic framework provided by such an effort.> This feature is similar to using mirroring, but differs in several > important ways: > > * Mirroring offers slightly better redundancy, because one disk from > each mirror can fail without data loss.Is this use of slightly based upon disk failure modes? That is, when disks fail do they tend to get isolated areas of badness compared to complete loss? I would suggest that complete loss should include someone tripping over the power cord to the external array that houses the disk.> It is important to note that the copies provided by this feature are in > addition to any redundancy provided by the pool configuration or the > underlying storage. For example:All of these examples seem to assume that there six disks.> * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) > will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any > 1 disk failing without data loss. > * In a pool with 2-way mirrors, a filesystem with copies=3 > will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any > 5 disks failing without data loss (assuming that there are at least > ncopies=3 mirror groups).This one assumes best case scenario with 6 disks. Suppose you had 4 x 72 GB and 2 x 36 GB disks. You could end up with multiple copies on the 72 GB disks.> * In a pool with single-parity raid-z a filesystem with copies=2 > will be stored with 2 copies, each copy protected by its own parity > block. The filesystem can tolerate any 3 disks failing without data > loss (assuming that there are at least ncopies=2 raid-z groups). > > > C. MANPAGE CHANGES > *** zfs.man4 Tue Jun 13 10:15:38 2006 > --- zfs.man5 Mon Sep 11 16:34:37 2006 > *************** > *** 708,714 **** > --- 708,725 ---- > they are inherited. > > > + copies=1 | 2 | 3 > > + Controls the number of copies of data stored for this dataset. > + These copies are in addition to any redundancy provided by the > + pool (eg. mirroring or raid-z). The copies will be stored on > + different disks if possible.Any statement about physical location on the disk? It would seem as though locating two copies sequentially on the disk would not provide nearly the amount of protection as having them fairly distant from each other. -- Mike Gerdts http://mgerdts.blogspot.com/
James Dickens wrote:> On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote: >> B. DESCRIPTION >> >> A new property will be added, ''copies'', which specifies how many copies >> of the given filesystem will be stored. Its value must be 1, 2, or 3. >> Like other properties (eg. checksum, compression), it only affects >> newly-written data. As such, it is recommended that the ''copies'' >> property be set at filesystem-creation time >> (eg. ''zfs create -o copies=2 pool/fs''). >> > would the user be held acountable for the space used by the extra > copies?Doh! Sorry I forgot to address that. I''ll amend the proposal and manpage to include this information... Yes, the space used by the extra copies will be accounted for, eg. in stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota.> so if a user has a 1GB quota and stores one 512MB file with > two copies activated, all his space will be used?Yes, and as mentioned this will be reflected in all the space accounting tools.> what happens if the > same user stores a file that is 756MB on the filesystem with multiple > copies enabled an a 1GB quota, does the save fail?Yes, they will get ENOSPC and see that their filesystem is full.> How would the user > tell that his filesystem is full since all the tools he is used to > report he is using only 1/2 the space?Any tool will report that in fact all space is being used.> Is there a way for the sysdmin to get rid of the excess copies should > disk space needs require it?No, not without rewriting them. (This is the same behavior we have today with the ''compression'' and ''checksum'' properties. It''s a long-term goal of ours to be able to go back and change these things after the fact ("scrub them in", so to say), but with snapshots, this is extremely nontrivial to do efficiently and without increasing the amount of space used.)> If I start out 2 copies and later change it to on 1 copy, do the > files created before keep there 2 copies?Yep, the property only affects newly-written data.> what happens if root needs to store a copy of an important file and > there is no space but there is space if extra copies are reclaimed?They will get ENOSPC.> Will this be configurable behavior?No. --matt
Mike Gerdts wrote:> Is there anything in the works to compress (or encrypt) existing data > after the fact? For example, a special option to scrub that causes > the data to be re-written with the new properties could potentially do > this.This is a long-term goal of ours, but with snapshots, this is extremely nontrivial to do efficiently and without increasing the amount of space used.) . > If so, this feature should subscribe to any generic framework> provided by such an effort.Yep, absolutely.>> * Mirroring offers slightly better redundancy, because one disk from >> each mirror can fail without data loss. > > Is this use of slightly based upon disk failure modes? That is, when > disks fail do they tend to get isolated areas of badness compared to > complete loss? I would suggest that complete loss should include > someone tripping over the power cord to the external array that houses > the disk.I''m basing this "slightly better" call on a model of random, complete-disk failures. I know that this is only an approximation. With many mirrors, most (but not all) 2-disk failures can be tolerated. With copies=2, almost no 2-top-level-vdev failures will be tolerated, because it''s likely that *some* block will have both its copies on those 2 disks. With mirrors, you can arrange to mirror across cabinets, not within them, which you can''t do with copies.>> It is important to note that the copies provided by this feature are in >> addition to any redundancy provided by the pool configuration or the >> underlying storage. For example: > > All of these examples seem to assume that there six disks.Not really. There could be any number of mirrors or raid-z groups (although I note, you need at least ''copies'' groups to survive the max whole-disk failures).>> * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) >> will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any >> 1 disk failing without data loss. >> * In a pool with 2-way mirrors, a filesystem with copies=3 >> will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any >> 5 disks failing without data loss (assuming that there are at least >> ncopies=3 mirror groups). > > This one assumes best case scenario with 6 disks. Suppose you had 4 x > 72 GB and 2 x 36 GB disks. You could end up with multiple copies on > the 72 GB disks.Yes, all these examples assume that our "putting the copies on different disks when possible" actually worked out. It will almost certainly work out unless you have a small number of different-sized devices, or are running with very little free space. If you need hard guarantees, you need to use actual mirroring.> Any statement about physical location on the disk? It would seem as > though locating two copies sequentially on the disk would not provide > nearly the amount of protection as having them fairly distant from > each other.Yep, if the copies can''t be stored on different disks, they will be stored spread-out on the same disk if possible (I think we aim for one on each quarter of the disk). --matt
On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> > would the user be held acountable for the space used by the extra > > copies? > > Doh! Sorry I forgot to address that. I''ll amend the proposal and > manpage to include this information... > > Yes, the space used by the extra copies will be accounted for, eg. in > stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota. > > > so if a user has a 1GB quota and stores one 512MB file with > > two copies activated, all his space will be used? > > Yes, and as mentioned this will be reflected in all the space accounting > tools.Yuck. This would be terribly confusing for typical end-users. I would say that statvfs() should munge the numbers such that f_bfree and f_bavail are divided by ncopies. Else, applications that need this information will need to know *way* too much about the file system. For example, consider the checks performed by setup_install_server that comes with the Solaris media. That script does a du on the media that it came from followed by a df on the target. Should that script really need to be modified for the case where the source and/or target are on zfs with ncopies != 1? This part of the feature would keep me from using it anywhere that there is any chance of being space constrained and I have 1 or more users that can''t read the man page for zfs then explain how it is different than at least one competing file system. Mike -- Mike Gerdts http://mgerdts.blogspot.com/
On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> James Dickens wrote: > > On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote: > >> B. DESCRIPTION > >> > >> A new property will be added, ''copies'', which specifies how many copies > >> of the given filesystem will be stored. Its value must be 1, 2, or 3. > >> Like other properties (eg. checksum, compression), it only affects > >> newly-written data. As such, it is recommended that the ''copies'' > >> property be set at filesystem-creation time > >> (eg. ''zfs create -o copies=2 pool/fs''). > >> > > would the user be held acountable for the space used by the extra > > copies? > > Doh! Sorry I forgot to address that. I''ll amend the proposal and > manpage to include this information... > > Yes, the space used by the extra copies will be accounted for, eg. in > stat(2), ls -s, df(1m), du(1), zfs list, and count against their quota. > > > so if a user has a 1GB quota and stores one 512MB file with > > two copies activated, all his space will be used? > > Yes, and as mentioned this will be reflected in all the space accounting > tools. > > > what happens if the > > same user stores a file that is 756MB on the filesystem with multiple > > copies enabled an a 1GB quota, does the save fail? > > Yes, they will get ENOSPC and see that their filesystem is full. > > > How would the user > > tell that his filesystem is full since all the tools he is used to > > report he is using only 1/2 the space? > > Any tool will report that in fact all space is being used. > > > Is there a way for the sysdmin to get rid of the excess copies should > > disk space needs require it? > > No, not without rewriting them. (This is the same behavior we have > today with the ''compression'' and ''checksum'' properties. It''s a > long-term goal of ours to be able to go back and change these things > after the fact ("scrub them in", so to say), but with snapshots, this is > extremely nontrivial to do efficiently and without increasing the amount > of space used.) > > > If I start out 2 copies and later change it to on 1 copy, do the > > files created before keep there 2 copies? > > Yep, the property only affects newly-written data. > > > what happens if root needs to store a copy of an important file and > > there is no space but there is space if extra copies are reclaimed? > > They will get ENOSPC. >though I think this is a cool feature, I think i needs more work. I think there sould be an option to make extra copies expendible. So the extra copies are a request, if the space is availible make them, if not complete the write, and log the event. It the user really requires guaranteed extra copies, then use mirrored or raided disks. It seems just to be a nightmare for the administrator, you start with 3 copies and then change to 2 copies, you will have phantom copies that are only known to exist to the OS, it won''t show in any reports, zfs list doesn''t have an option to show which files have multiple clones and which dont. There is no way to destroy multiple clones without rewriting every file on the disk. James> > Will this be configurable behavior? > > No. > > --matt >
James Dickens wrote:> though I think this is a cool feature, I think i needs more work. I > think there sould be an option to make extra copies expendible. So the > extra copies are a request, if the space is availible make them, if > not complete the write, and log the event.Are you asking for the extra copies that have already been written to be dynamically freed up when we are running low on space? That could be useful, but it isn''t the problem I''m trying to solve with the ''copies'' property (not to mention it would be extremely difficult to implement).> It the user really requires guaranteed extra copies, then use mirrored > or raided disks.Right, if you want everything to have extra redundancy, that use case is handled just fine today by mirrors or RAIDZ. The case where ''copies'' is useful is when you want some data to be stored with more redundancy than others, without the burden of setting up different pools.> It seems just to be a nightmare for the administrator, you start with > 3 copies and then change to 2 copies, you will have phantom copies > that are only known to exist to the OS, it won''t show in any reports, > zfs list doesn''t have an option to show which files have multiple > clones and which dont. There is no way to destroy multiple clones > without rewriting every file on the disk.(I''m assuming you mean copies, not clones.) So would you prefer that the property be restricted to only being set at filesystem creation time, and not changed later? That way the number of copies of all files in the filesystem is always the same. It seems like the issue of knowing how many copies there are would be much worse in the system you''re asking for where the extra copies are freed up as needed... --matt
William D. Hathaway
2006-Sep-12 02:30 UTC
[zfs-discuss] Re: Proposal: multiple copies of user data
Hi Matt, Interesting proposal. Has there been any consideration if free space being reported for a ZFS filesystem would take into account the copies setting? Example: zfs create mypool/nonredundant_data zfs create mypool/redundant_data df -h /mypool/nonredundant_data /mypool/redundant_data (shows same amount of free space) zfs set copies=3 mypool/redundant_data Would a new df of /mypool/redundant_data now show a different amount of free space (presumably 1/3 if different) than /mypool/nonredundant_data? This message posted from opensolaris.org
Darren J Moffat
2006-Sep-12 09:36 UTC
[zfs-discuss] Proposal: multiple copies of user data
Mike Gerdts wrote:> On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote: >> B. DESCRIPTION >> >> A new property will be added, ''copies'', which specifies how many copies >> of the given filesystem will be stored. Its value must be 1, 2, or 3. >> Like other properties (eg. checksum, compression), it only affects >> newly-written data. As such, it is recommended that the ''copies'' >> property be set at filesystem-creation time >> (eg. ''zfs create -o copies=2 pool/fs''). > > Is there anything in the works to compress (or encrypt) existing data > after the fact? For example, a special option to scrub that causes > the data to be re-written with the new properties could potentially do > this. If so, this feature should subscribe to any generic framework > provided by such an effort.While encryption of existing data is not in scope for the first ZFS crypto phase I am being careful in the design to ensure that it can be done later if such a ZFS "framework" becomes available. The biggest problem I see with this is one of observability, if not all of the data is encrypted yet what should the encryption property say ? If it says encryption is on then the admin might think the data is "safe", but if it says it is off that isn''t the truth either because some of it maybe in encrypted. -- Darren J Moffat
On 12/09/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> Here is a proposal for a new ''copies'' property which would allow > different levels of replication for different filesystems. > > Your comments are appreciated!Flexibility is always nice, but this seems to greatly complicate things, both technically and conceptually (sometimes, good design is about what is left out :) ). Seems to me this lets you say ''files in this directory are x times more valuable than files elsewhere''. Others have covered some of my concerns (guarantees, cleanup, etc.). In addition, * if I move a file somewhere else, does it become less important? * zpools let you do that already (admittedly with less granularity, but *much* *much* more simply - and disk is cheap in my world) * I don''t need to do that :) The only real use I''d see would be for redundant copies on a single disk, but then why wouldn''t I just add a disk? * disks are cheap, and creating a mirror from a single disk is very easy (and conceptually simple) * *removing* a disk from a mirror pair is simple too - I make mistakes sometimes * in my experience, disks fail. When you get bad errors on part of a disk, the disk is about to die. * you can already create a/several zpools using disk partitions as vdevs. That''s not all that safe, and I don''t see this being any safer. Sorry to be negative, but to me ZFS'' simplicity is one of its major features. I think this provides a cool feature, but I question it''s usefulness. Quite possibly I just don''t have the particular itch this is intended to scratch - is this a much requested feature? -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/
Ceri Davies
2006-Sep-12 10:32 UTC
[zfs-discuss] Re: Proposal: multiple copies of user data
> Hi Matt, > Interesting proposal. Has there been any > consideration if free space being reported for a ZFS > filesystem would take into account the copies > setting? > > Example: > zfs create mypool/nonredundant_data > zfs create mypool/redundant_data > df -h /mypool/nonredundant_data > /mypool/redundant_data > (shows same amount of free space) > zfs set copies=3 mypool/redundant_data > > Would a new df of /mypool/redundant_data now show a > different amount of free space (presumably 1/3 if > different) than /mypool/nonredundant_data?As I understand the proposal, there''s nothing new to do here. The filesystem might be 25% full, and it would be 25% full no matter how many copies of the filesystem there are. Similarly with quotas, I''d argue that the extra copies should not count towards a user''s quota, since a quota is set on the filesystem. If I''m using 500M on a filesystem, I only have 500M of data no matter how many copies of it the administrator has decided to keep (cf. RAID1). I also don''t see why a copy can''t just be dropped if the "copies" value is decreased. Having said this, I don''t see any value in the proposal at all, to be honest. This message posted from opensolaris.org
Darren J Moffat
2006-Sep-12 10:52 UTC
[zfs-discuss] Proposal: multiple copies of user data
Dick Davies wrote:> The only real use I''d see would be for redundant copies > on a single disk, but then why wouldn''t I just add a disk?Some systems have physical space for only a single drive - think most laptops! -- Darren J Moffat
On 12/09/06, Darren J Moffat <Darren.Moffat at sun.com> wrote:> Dick Davies wrote: > > > The only real use I''d see would be for redundant copies > > on a single disk, but then why wouldn''t I just add a disk? > > Some systems have physical space for only a single drive - think most > laptops!True - I''m a laptop user myself. But as I said, I''d assume the whole disk would fail (it does in my experience). If your hardware craps differently to mine, you could do a similar thing with partitions (or even files) as vdevs. Wouldn''t be any less reliable. I''m still not Feeling the Magic on this one :) -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/
Darren J Moffat
2006-Sep-12 12:29 UTC
[zfs-discuss] Proposal: multiple copies of user data
Dick Davies wrote:> On 12/09/06, Darren J Moffat <Darren.Moffat at sun.com> wrote: >> Dick Davies wrote: >> >> > The only real use I''d see would be for redundant copies >> > on a single disk, but then why wouldn''t I just add a disk? >> >> Some systems have physical space for only a single drive - think most >> laptops! > > True - I''m a laptop user myself. But as I said, I''d assume the whole disk > would fail (it does in my experience).Indeed and that is the failure I had recently - complete death of disk no longer visible even by the BIOS.> If your hardware craps differently to mine, you could do a similar thing > with partitions (or even files) as vdevs. Wouldn''t be any less reliable.One downside to that being that ZFS can perform better, due to write cache IIRC, when given the whole disk - though in the laptop case this isn''t applicable until ZFS boot is ready. -- Darren J Moffat
Darren J Moffat
2006-Sep-12 12:31 UTC
[zfs-discuss] Proposal: multiple copies of user data
The multiple copies needs to be thought out carefully for interactions with ZFS crypto since. I''m not sure what the impact is yet, it would help to know at what layer in the ZIO pipeline this is done - eg today before or after compression. -- Darren J Moffat
This proposal would benefit greatly by a "problem statement." As it stands, it feels like a solution looking for a problem. The Introduction mentions a different problem and solution, but then pretends that there is value to this solution. The Description section mentions some benefits of ''copies'' relative to the existing situation, but requires that the reader piece together the whole picture. And IMO there aren''t enough pieces :-) , i.e. so far I haven''t seen sufficient justification for the added administrative complexity and potential for confusion, both administrative and user. Matthew Ahrens wrote:> Here is a proposal for a new ''copies'' property which would allow > different levels of replication for different filesystems. > > Your comments are appreciated! > > --matt > > A. INTRODUCTION > > ZFS stores multiple copies of all metadata. This is accomplished by > storing up to three DVAs (Disk Virtual Addresses) in each block pointer. > This feature is known as "Ditto Blocks". When possible, the copies are > stored on different disks. > > See bug 6410698 "ZFS metadata needs to be more highly replicated (ditto > blocks)" for details on ditto blocks. > > This case will extend this feature to allow system administrators to > store multiple copies of user data as well, on a per-filesystem basis. > These copies are in addition to any redundancy provided at the pool > level (mirroring, raid-z, etc). > > B. DESCRIPTION > > A new property will be added, ''copies'', which specifies how many copies > of the given filesystem will be stored. Its value must be 1, 2, or 3. > Like other properties (eg. checksum, compression), it only affects > newly-written data. As such, it is recommended that the ''copies'' > property be set at filesystem-creation time > (eg. ''zfs create -o copies=2 pool/fs''). > > The pool must be at least on-disk version 2 to use this feature (see > ''zfs upgrade''). > > By default (copies=1), only two copies of most filesystem metadata are > stored. However, if we are storing multiple copies of user data, then 3 > copies (the maximum) of filesystem metadata will be stored. > > This feature is similar to using mirroring, but differs in several > important ways: > > * Different filesystems in the same pool can have different numbers of > copies. > * The storage configuration is not constrained as it is with mirroring > (eg. you can have multiple copies even on a single disk). > * Mirroring offers slightly better performance, because only one DVA > needs to be allocated. > * Mirroring offers slightly better redundancy, because one disk from > each mirror can fail without data loss. > > It is important to note that the copies provided by this feature are in > addition to any redundancy provided by the pool configuration or the > underlying storage. For example: > > * In a pool with 2-way mirrors, a filesystem with copies=1 (the default) > will be stored with 2 * 1 = 2 copies. The filesystem can tolerate any > 1 disk failing without data loss. > * In a pool with 2-way mirrors, a filesystem with copies=3 > will be stored with 2 * 3 = 6 copies. The filesystem can tolerate any > 5 disks failing without data loss (assuming that there are at least > ncopies=3 mirror groups). > * In a pool with single-parity raid-z a filesystem with copies=2 > will be stored with 2 copies, each copy protected by its own parity > block. The filesystem can tolerate any 3 disks failing without data > loss (assuming that there are at least ncopies=2 raid-z groups). > > > C. MANPAGE CHANGES > *** zfs.man4 Tue Jun 13 10:15:38 2006 > --- zfs.man5 Mon Sep 11 16:34:37 2006 > *************** > *** 708,714 **** > --- 708,725 ---- > they are inherited. > > > + copies=1 | 2 | 3 > > + Controls the number of copies of data stored for this dataset. > + These copies are in addition to any redundancy provided by the > + pool (eg. mirroring or raid-z). The copies will be stored on > + different disks if possible. > + > + Changing this property only affects newly-written data. > + Therefore, it is recommended that this property be set at > + filesystem creation time, using the ''-o copies='' option. > + > + > Temporary Mountpoint Properties > When a file system is mounted, either through mount(1M) for > legacy mounts or the "zfs mount" command for normal file > > > D. REFERENCES > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- -------------------------------------------------------------------------- Jeff VICTOR Sun Microsystems jeff.victor @ sun.com OS Ambassador Sr. Technical Specialist Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/zones/faq --------------------------------------------------------------------------
Anton B. Rang
2006-Sep-12 14:54 UTC
[zfs-discuss] Re: Proposal: multiple copies of user data
>The biggest problem I see with this is one of observability, if not all >of the data is encrypted yet what should the encryption property say ? >If it says encryption is on then the admin might think the data is >"safe", but if it says it is off that isn''t the truth either because >some of it maybe still encrypted.>From a user interface perspective, I''d expect something likeEncryption: Being enabled, 75% complete or Encryption: Being disabled, 25% complete, about 2h23m remaining I''m not sure how you''d map this into a property (or several), but it seems like "on"/"off" ought to be paired with "transitioning to on"/"transitioning to off" for any changes which aren''t instantaneous. This message posted from opensolaris.org
Darren J Moffat
2006-Sep-12 14:59 UTC
[zfs-discuss] Re: Proposal: multiple copies of user data
Anton B. Rang wrote:>> The biggest problem I see with this is one of observability, if not all >> of the data is encrypted yet what should the encryption property say ? >> If it says encryption is on then the admin might think the data is >> "safe", but if it says it is off that isn''t the truth either because >> some of it maybe still encrypted. > >>From a user interface perspective, I''d expect something like > > Encryption: Being enabled, 75% complete > or > Encryption: Being disabled, 25% complete, about 2h23m remainingand if we are still writing to the file systems at that time ? Maybe this really does need to be done with the file system locked.> I''m not sure how you''d map this into a property (or several), but it seems like "on"/"off" ought to be paired with "transitioning to on"/"transitioning to off" for any changes which aren''t instantaneous.Agreed, and checksum and compression would have the same issue if there was a mechanism to rewrite with the new checksums or compression settings. -- Darren J Moffat
Anton B. Rang
2006-Sep-12 14:59 UTC
[zfs-discuss] Re: Proposal: multiple copies of user data
>True - I''m a laptop user myself. But as I said, I''d assume the whole disk >would fail (it does in my experience).That''s usually the case, but single-block failures can occur as well. They''re rare (check the "uncorrectable bit error rate" specifications) but if they happen to hit a critical file, they''re painful. On the other hand, multiple copies seems (to me) like a really expensive way to deal with this. ZFS is already using relatively large blocks, so it could add an erasure code on top of them and have far less storage overhead. If the assumed problem is multi-block failures in one area of the disk, I''d wonder how common this failure mode is; in my experience, multi-block failures are generally due to the head having touched the platter, in which case the whole drive will shortly fail. (In any case, multi-block failures could be addressed by spreading the data from a large block and using an erasure code.) This message posted from opensolaris.org
Anton B. Rang
2006-Sep-12 15:04 UTC
[zfs-discuss] Re: Re: Proposal: multiple copies of user data
>And if we are still writing to the file systems at that time ?New writes should be done according to the new state (if encryption is being enabled, all new writes are encrypted), since the goal is that eventually the whole disk will be in the new state. The completion percentage should probably reflect the existing data at the time that the state change is initiated, since new writes won''t affect how much data has to be replaced.>Maybe this really does need to be done with the file system locked.I don''t see any technical reason to require that, and users expect better from us these days. :-) As you point out, checksum & compression will have the same issue once we have on-line changes for those as well. The framework ought to take care of this. This message posted from opensolaris.org
David Dyer-Bennet
2006-Sep-12 15:12 UTC
[zfs-discuss] Proposal: multiple copies of user data
On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> Here is a proposal for a new ''copies'' property which would allow > different levels of replication for different filesystems. > > Your comments are appreciated!I''ve read the proposal, and followed the discussion so far. I have to say that I don''t see any particular need for this feature. Possibly there is a need for a different feature, in which the entire control of redundancy is moved away from the pool level and to the file or filesystem level. I definitely see the attraction of being able to specify by file and directory different degrees of reliability needed. However, the details of the feature actually proposed don''t seem to satisfy the need for extra reliability at the level that drives people to employ redundancy; it doesn''t provide a guaranty. I see no need for additional non-guaranteed reliability on top of the levels of guaranty provided by use of redundancy at the pool level. Furthermore, as others have pointed out, this feature would add a high degree of user-visible complexity.>From what I''ve seen here so far, I think this is a bad idea and shouldnot be added. -- David Dyer-Bennet, <mailto:dd-b at dd-b.net>, <http://www.dd-b.net/dd-b/> RKBA: <http://www.dd-b.net/carry/> Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/> Dragaera/Steven Brust: <http://dragaera.info/>
Darren J Moffat wrote:> While encryption of existing data is not in scope for the first ZFS > crypto phase I am being careful in the design to ensure that it can be > done later if such a ZFS "framework" becomes available. > > The biggest problem I see with this is one of observability, if not all > of the data is encrypted yet what should the encryption property say ? > If it says encryption is on then the admin might think the data is > "safe", but if it says it is off that isn''t the truth either because > some of it maybe in encrypted.I would also think that there''s a significant problem around what to do about the previously unencrypted data. I assume that when performing a "scrub" to encrypt the data, the encrypted data will not be written on the same blocks previously used to hold the unencrypted data. As such, there''s a very good chance that the unencrypted data would still be there for quite some time. You may not be able to access it through the filesystem, but someone with access to the raw disks may be able to recover at least parts of it. In this case, the "scrub" would not only have to write the encrypted data but also overwrite the unencrypted data (multiple times?). Neil
Darren J Moffat
2006-Sep-12 16:17 UTC
[zfs-discuss] Proposal: multiple copies of user data
Neil A. Wilson wrote:> Darren J Moffat wrote: >> While encryption of existing data is not in scope for the first ZFS >> crypto phase I am being careful in the design to ensure that it can be >> done later if such a ZFS "framework" becomes available. >> >> The biggest problem I see with this is one of observability, if not >> all of the data is encrypted yet what should the encryption property >> say ? If it says encryption is on then the admin might think the data >> is "safe", but if it says it is off that isn''t the truth either >> because some of it maybe in encrypted. > > I would also think that there''s a significant problem around what to do > about the previously unencrypted data. I assume that when performing a > "scrub" to encrypt the data, the encrypted data will not be written on > the same blocks previously used to hold the unencrypted data. As such, > there''s a very good chance that the unencrypted data would still be > there for quite some time. You may not be able to access it through the > filesystem, but someone with access to the raw disks may be able to > recover at least parts of it. In this case, the "scrub" would not only > have to write the encrypted data but also overwrite the unencrypted data > (multiple times?).Right, that is a very important issue. Would a ZFS "scrub" framework do copy on write ? As you point out if it doesn''t then we still need to do something about the old clear text blocks because strings(1) over the raw disk will show them. I see the desire to have a knob that says "make this encrypted now" but I personally believe that it is actually better if you can make this choice at the time you create the ZFS data set. -- Darren J Moffat
Nicolas Williams
2006-Sep-12 16:48 UTC
[zfs-discuss] Proposal: multiple copies of user data
On Tue, Sep 12, 2006 at 10:36:30AM +0100, Darren J Moffat wrote:> Mike Gerdts wrote: > >Is there anything in the works to compress (or encrypt) existing data > >after the fact? For example, a special option to scrub that causes > >the data to be re-written with the new properties could potentially do > >this. If so, this feature should subscribe to any generic framework > >provided by such an effort. > > While encryption of existing data is not in scope for the first ZFS > crypto phase I am being careful in the design to ensure that it can be > done later if such a ZFS "framework" becomes available. > > The biggest problem I see with this is one of observability, if not all > of the data is encrypted yet what should the encryption property say ? > If it says encryption is on then the admin might think the data is > "safe", but if it says it is off that isn''t the truth either because > some of it maybe in encrypted.I agree -- there needs to be a filesystem re-write option, something like a "scrub" but at the filesystem level. Things that might be accomplished through it: - record size changes - compression toggling / compression algorithm changes - encryption/re-keying/alg. changes - checksum alg. changes - ditto blocking What else? To me it''s important that such "scrubs" not happen simply as a result of setting/changing a filesystem property, but it''s also important that the user/admin be told that changing the property requires scrubbing in order to take effect for data/meta-data written before the change. Nico --
Nicolas Williams
2006-Sep-12 16:53 UTC
[zfs-discuss] Proposal: multiple copies of user data
On Tue, Sep 12, 2006 at 05:17:16PM +0100, Darren J Moffat wrote:> I see the desire to have a knob that says "make this encrypted now" but > I personally believe that it is actually better if you can make this > choice at the time you create the ZFS data set.Including when creating the dataset through zfs receive. I definitely want re-keying/cipher and MAC/checksum algorithm changes to be supported at zfs send/receive time. Nico --
On Tue, 12 Sep 2006, Anton B. Rang wrote: .... reformatted ....> >True - I''m a laptop user myself. But as I said, I''d assume the whole disk > >would fail (it does in my experience).Usually a laptop disk suffers a mechanical failure - and the failure rate is a lot higher than disks in a fixed location environment.> That''s usually the case, but single-block failures can occur as well. > They''re rare (check the "uncorrectable bit error rate" specifications) > but if they happen to hit a critical file, they''re painful. > > On the other hand, multiple copies seems (to me) like a really expensive > way to deal with this. ZFS is already using relatively large blocks, so > it could add an erasure code on top of them and have far less storage > overhead. If the assumed problem is multi-block failures in one area of > the disk, I''d wonder how common this failure mode is; in my experience, > multi-block failures are generally due to the head having touched the > platter, in which case the whole drive will shortly fail. (In any case,The following is based on dated knowledge from personal experience and I can''t say if its (still) accurate information today. Drive failures in a localized area are generally caused by the heads being positioned in the same (general) cylinder position for long periods of time. The heads ride on a air bearing - but there is still a lot of friction caused by the movement of air under the heads. This is turn generates heat. Localized heat buildup can cause some of the material coated on the disk to break free. The drive is designed for this eventuality - since it is equipped with a very fine filter which will catch and trap anything that breaks free and the airflow is designed to constantly circulate the air through the filter. However, some of the material might get trapped between the head and the disk and possibly stick to the disk. In this case, the neighbouring disk cylinders in this general area will probably be damaged and, if enough material accumulates, so might the head(s). In the old days people wrote their own head "floater" programs - to ensure that the head was moved randomly across the disk surface from time to time. I don''t know if this is still relevant today - since the amount of firmware a disk drive executes, continues to increase every day. But in a typical usage scenario, where a user does, for example, a find operation in a home directory - and the directory caches are not sized large enough, there is a good probability that the heads will end up in the same general area of the disk, after the find op completes. Assuming that the box has enough memory, the disk may not be accessed again for a long time - and possibly only during another find op (wash, rinse, repeat). Continuing: a buildup of heat in a localized cylinder area, will cause the disk platter to expand and shift, relative to the heads. The disk platter has one surface dedicated to storing servo information - and from this the disk can "decide" that it is on the wrong cylinder after a head movement. In which case the drive will recalibrate itself (thermal recalibration) and store a table of offsets for different cylinder ranges. So when the head it told, for example, to move to cylinder 1000, the correction table will tell it to move to where physical cylinder 1000 should be and then add the correction delta (plus or minus) for that cylinder range to figure out where to the actually move the heads to. Now the heads are positioned on the correct cylinder and should be centered on it. If the drive gets a bad CRC after reading a cylinder it can use the CRC to correct the data or it can command that the data be re-read, until a correctable read is obtained. Last I heard, the number of retries is of the order of 100 to 200 or more(??). So this will be noticable - since 100 reads will require 100 revolutions of the disk. Retries like this will probably continue to provide correctable data to the user and the disk drive will ignore the fact that there is an area of disk where retries are constantly required. This is what Steve Gibson picked up on for his SpinRite product. If he runs code that can determine that CRC corrections or re-reads are required to retrieve good data, then he "knows" this is a likely area of the disk to fail in the (possibly near) future. So he relocates the data in this area, marks the area "bad", and the drive avoids it. Given what I wrote earlier, that there could be some physical damage in this general area - having the heads avoid it is a Good Thing. So the question is, how relevant is storing multiple copies of data on a disk in terms of the mechanics of modern disk drive failure modes. Without some "SpinRite" like functionality in the code, the drive will continue to access the deteriorating disk cylinders, now a localized failure, and eventually it will deteriorate further and cause enough material to break free to take out the head(s). At which time the drive is toast.> multi-block failures could be addressed by spreading the data from a > large block and using an erasure code.)Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Darren said:> Right, that is a very important issue. Would a > ZFS "scrub" framework do copy on write ? > As you point out if it doesn''t then we still need > to do something about the old clear text blocks > because strings(1) over the raw disk will show them. > > I see the desire to have a knob that says "make this > encrypted now" but I personally believe that it is > actually better if you can make this choice at the > time you create the ZFS data set.I''m not sure that that gets rid of the problem at all. If I have an existing filesystem that I want to encrypt, but I need to create a new dataset to do so, I''m going to create my new, encrypted dataset, then copy my data onto it, then (maybe) delete the old one. If both datasets are in the same pool (which is likely), I''ll still not be able to securely erase the blocks that have all my cleartext data on them. The only way to do the job properly would to overwrite the entire pool, which is likely to be pretty inconvenient in most cases. So, how about some way to securely erase freed blocks? It could be implemented as a one-off operation that acts on an entire pool e.g. zfs shred tank which would walk the free block list and overwrite with random data some number of times. Or it might be more useful to have it as a per-dataset option: zfs set shred=32 tank/secure which could overwrite blocks with random data as they are freed. I have no idea how expensive this might be (both in development time, and in performance hit), but its use might be a bit wider than just dealing with encryption and/or rekeying. I guess that deletion of a snapshot might get a bit expensive, but maybe there''s some way that blocks awaiting shredding could be queued up and dealt with at a lower priority... Steve.
Take this for what it is: the opinion on someone who knows less about zfs than probably anyone else on this thread ,but... I would like to add my support for this proposal. As I understand it, the reason for using ditto blocks on metadata, is that maintaining their integrity is vital for the health of the filesystem, even if the zpool isn''t mirrored or redundant in any way ie laptops, or people who just don''t or can''t add another drive. One of the great things about zfs, is that it protects not just against mechanical failure, but against silent data corruption. Having this available to laptop owners seems to me to be important to making zfs even more attractive. Granted, if you are running a enterprise based fileserver, this probably isn''t going to be your first choice for data protection. You will probably be using the other features of zfs like mirroring, raidz raidz2 etc. Am I correct in assuming that having say 2 copies of your "documents" filesystem means should silent data corruption occur, your data can be reconstructed. So that you can leave your os and base applications with 1 copy, but your important data can be protected. In a way, this reminds me of intel''s "matrix raid" but much cooler (it doesn''t rely on a specific motherboard for one thing). I would also agree that utilities like ''ls'' and quotas should report both and count against peoples quotas. It just doesn''t seem to hard to me to understand that because you have 2 copies, you halve the amount of available space. Just to reiterate, I think this would be an awesome feature! Celso. PS. Please feel free to correct me on any technical inaccuracies. I am trying to learn about zfs and Solaris 10 in general. This message posted from opensolaris.org
Dick Davies
2006-Sep-12 20:45 UTC
[zfs-discuss] Re: Proposal: multiple copies of user data
On 12/09/06, Celso <celsouk at gmail.com> wrote:> One of the great things about zfs, is that it protects not just against mechanical failure, but against silent data corruption. Having this available to laptop owners seems to me to be important to making zfs even more attractive.I''m not arguing against that. I was just saying that *if* this was useful to you (and you were happy with the dubious resilience/performance benefits) you can already create mirrors/raidz on a single disk by using partitions as building blocks. There''s no need to implement the proposal to gain that.> Am I correct in assuming that having say 2 copies of your "documents" filesystem means should silent data corruption occur, your data can be reconstructed. So that you can leave your os and base applications with 1 copy, but your important data can be protected.Yes. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/
> On 12/09/06, Celso <celsouk at gmail.com> wrote: > > > One of the great things about zfs, is that it > protects not just against mechanical failure, but > against silent data corruption. Having this available > to laptop owners seems to me to be important to > making zfs even more attractive. > > I''m not arguing against that. I was just saying that > *if* this was useful to you > (and you were happy with the dubious > resilience/performance benefits) you can > already create mirrors/raidz on a single disk by > using partitions as > building blocks. > There''s no need to implement the proposal to gain > that. > >It''s not as granular though is it? In the situation you describe: ...you split one disk in two. you then have effectively two partitions which you can then create a new mirrored zpool with. Then everything is mirrored. Correct? With ditto blocks, you can selectively add copies (seeing as how filesystem are so easy to create on zfs). If you are only concerned with copies of your important documents and email, why should /usr/bin be mirrored. That''s my opinion anyway. I always enjoy choice, and I really believe this is a useful and flexible one. Celso This message posted from opensolaris.org
Dick Davies
2006-Sep-12 22:08 UTC
[zfs-discuss] Re: Re: Proposal: multiple copies of user data
On 12/09/06, Celso <celsouk at gmail.com> wrote:> ...you split one disk in two. you then have effectively two partitions which you can then create a new mirrored zpool with. Then everything is mirrored. Correct?Everything in the filesystems in the pool, yes.> With ditto blocks, you can selectively add copies (seeing as how filesystem are so easy to create on zfs). If you are only concerned with copies of your important documents and email, why should /usr/bin be mirrored.So my machine will boot if a disk fails. Which happened the other day :) -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/
Celso
2006-Sep-12 22:24 UTC
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
> On 12/09/06, Celso <celsouk at gmail.com> wrote: > > > ...you split one disk in two. you then have > effectively two partitions which you can then create > a new mirrored zpool with. Then everything is > mirrored. Correct? > > Everything in the filesystems in the pool, yes. > > > With ditto blocks, you can selectively add copies > (seeing as how filesystem are so easy to create on > zfs). If you are only concerned with copies of your > important documents and email, why should /usr/bin be > mirrored. > > So my machine will boot if a disk fails. Which > happened the other day :) > > -- > Rasputin :: Jack of All Trades - Master of Nuns > http://number9.hellooperator.net/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss >ok cool. I think it has already been said that in many peoples experience, when a disk fails, it completely fails. Especially on laptops. Of course ditto blocks wouldn''t help you in this situation either! I still think that silent data corruption is a valid concern, one that ditto blocks would solve. Also, I am not thrilled about losing that much space for duplication of unneccessary data (caused by partitioning a disk in two). I also echo Darren''s comments on zfs performing better when it has the whole disk. Hopefully we can agree that you lose nothing by adding this feature, even if you personally don''t see a need for it. Celso This message posted from opensolaris.org
Torrey McMahon
2006-Sep-12 22:37 UTC
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
Celso wrote:> > Hopefully we can agree that you lose nothing by adding this feature, even if you personally don''t see a need for it.If I read correctly user tools will show more space in use when adding copies, quotas are impacted, etc. One could argue the added confusion outweighs the addition of the feature. As others have asked I''d like to see the problem that this feature is designed to solve.
Matthew Ahrens wrote:> Here is a proposal for a new ''copies'' property which would allow > different levels of replication for different filesystems.Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS''s pooled storage model. (You have to divide up your storage, you''ll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I''m going to shelve it for now. Out of curiosity, what would you guys think about addressing this same problem by having the option to store some filesystems unreplicated on an mirrored (or raid-z) pool? This would have the same issues of unexpected space usage, but since it would be *less* than expected, that might be more acceptable. There are no plans to implement anything like this right now, but I just wanted to get a read on it. --matt
> Matthew Ahrens wrote: > > Here is a proposal for a new ''copies'' property > which would allow > > different levels of replication for different > filesystems. > > Thanks everyone for your input. > > The problem that this feature attempts to address is > when you have some > data that is more important (and thus needs a higher > level of > redundancy) than other data. Of course in some > situations you can use > multiple pools, but that is antithetical to ZFS''s > pooled storage model. > (You have to divide up your storage, you''ll end up > with stranded > torage and bandwidth, etc.) > > Given the overwhelming criticism of this feature, I''m > going to shelve it > for now.Damn! That''s a real shame! I was really starting to look forward to that. Please reconsider??!> --matt > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss >Celso This message posted from opensolaris.org
David Dyer-Bennet
2006-Sep-12 23:09 UTC
[zfs-discuss] Proposal: multiple copies of user data
On 9/12/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> Matthew Ahrens wrote: > > Here is a proposal for a new ''copies'' property which would allow > > different levels of replication for different filesystems. > > Thanks everyone for your input. > > The problem that this feature attempts to address is when you have some > data that is more important (and thus needs a higher level of > redundancy) than other data. Of course in some situations you can use > multiple pools, but that is antithetical to ZFS''s pooled storage model. > (You have to divide up your storage, you''ll end up with stranded > storage and bandwidth, etc.) > > Given the overwhelming criticism of this feature, I''m going to shelve it > for now.I think it''s a valid problem. My understanding was that this didn''t give a *guaranteed* solution, though. I think most people, when committing to the point of replication (spending actual money), need a guarantee at some level (not of course of total safety; but that the data actually does exist on separate disks, and will survive the destruction of one disk). A good solution to this problem would be valuable. (And I''d accept a non-guarantee on a single disk; or rather a guarantee that said "if enough blocks to find the data exist, and a copy of each data block exists, we can retrieve the data"; but that guarantee *does* exist I think).> Out of curiosity, what would you guys think about addressing this same > problem by having the option to store some filesystems unreplicated on > an mirrored (or raid-z) pool? This would have the same issues of > unexpected space usage, but since it would be *less* than expected, that > might be more acceptable. There are no plans to implement anything like > this right now, but I just wanted to get a read on it.I was never concerned at the free space issues (though I was concerned by some of the proposed solutions to what I saw as a non-issue). I''d be happy if the free space described how many bytes of default files you could add to the pool, and the user would have to understand that results would differ if they used non-default parameters. You''re probably right that fewer people would mind having *more* space than an unthinking reading would show than less. -- David Dyer-Bennet, <mailto:dd-b at dd-b.net>, <http://www.dd-b.net/dd-b/> RKBA: <http://www.dd-b.net/carry/> Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/> Dragaera/Steven Brust: <http://dragaera.info/>
Dick Davies
2006-Sep-12 23:13 UTC
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
On 12/09/06, Celso <celsouk at gmail.com> wrote:> I think it has already been said that in many peoples experience, when a disk fails, it completely fails. Especially on laptops. Of course ditto blocks wouldn''t help you in this situation either!Exactly.> I still think that silent data corruption is a valid concern, one that ditto blocks would solve. > Also, I am not thrilled about losing that much space for duplication of unneccessary data (caused by partitioning a disk in two).Well, you''d only be duplicating the data on the mirror. If you don''t want to mirror the base OS, no one''s saying you have to. For the sake of argument, let''s assume: 1. disk is expensive 2. someone is keeping valuable files on a non-redundant zpool 3. they can''t scrape enough vdevs to make a redundant zpool (remembering you can build vdevs out of *flat files*) Even then, to my mind: to the user, the *file* (screenplay, movie of childs birth, civ3 saved game, etc.) is the logical entity to have a ''duplication level'' attached to it, and the only person who can score that is the author of the file. This proposal says the filesystem creator/admin scores the filesystem. Your argument against unneccessary data duplication applies to all ''non-special'' files in the ''special'' filesystem. They''re wasting space too. If the user wants to make sure the file is ''safer'' than others, he can just make multiple copies. Either to a USB disk/flashdrive, cdrw, dvd, ftp server, whatever. The redundancy you''re talking about is what you''d get from ''cp /foo/bar.jpg /foo/bar.jpg.ok'', except it''s hidden from the user and causing headaches for anyone trying to comprehend, port or extend the codebase in the future.> I also echo Darren''s comments on zfs performing better when it has the whole disk.Me too, but a lot of laptop users dual-boot, which makes it a moot point.> Hopefully we can agree that you lose nothing by adding this feature, > even if you personally don''t see a need for it.Sorry, I don''t think we''re going to agree on this one :) I''ve seen dozens of project proposals in the few months I''ve been lurking around opensolaris. Most of them have been of no use to me, but each to their own. I''m afraid I honestly think this greatly complicates the conceptual model (not to mention the technical implementation) of ZFS, and I haven''t seen a convincing use case. All the best Dick. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/
Matthew Ahrens wrote:> Matthew Ahrens wrote: >> Here is a proposal for a new ''copies'' property which would allow >> different levels of replication for different filesystems. > > Thanks everyone for your input. > > The problem that this feature attempts to address is when you have some > data that is more important (and thus needs a higher level of > redundancy) than other data. Of course in some situations you can use > multiple pools, but that is antithetical to ZFS''s pooled storage model. > (You have to divide up your storage, you''ll end up with stranded > storage and bandwidth, etc.) > > Given the overwhelming criticism of this feature, I''m going to shelve it > for now.This is unfortunate. As a laptop user with only a single drive, I was looking forward to it since I''ve been bitten in the past by data loss caused by a bad area on the disk. I don''t care about the space consumption because I generally don''t come anywhere close to filling up the available space. It may not be the primary market for ZFS, but it could be a very useful side benefit.> Out of curiosity, what would you guys think about addressing this same > problem by having the option to store some filesystems unreplicated on > an mirrored (or raid-z) pool? This would have the same issues of > unexpected space usage, but since it would be *less* than expected, that > might be more acceptable. There are no plans to implement anything like > this right now, but I just wanted to get a read on it.I don''t see much need for this in any area that I would use ZFS (either my own personal use or for any case in which I would recommend it for production use). However, if you think that it''s OK to under-report free space, then why not just do that for the "data ditto" blocks. If one or more of my filesystems are configured to keep two copies of the data, then simply report only half of the available space. If duplication isn''t enabled for the entire pool but only for certain filesystems, then perhaps you could even take advantage of quotas for those filesystems to make a more accurate calculation.> > --matt > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Matthew Ahrens
2006-Sep-12 23:38 UTC
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
Dick Davies wrote:> For the sake of argument, let''s assume: > > 1. disk is expensive > 2. someone is keeping valuable files on a non-redundant zpool > 3. they can''t scrape enough vdevs to make a redundant zpool > (remembering you can build vdevs out of *flat files*)Given those assumptions, I think that the proposed feature is the perfect solution. Simply put those files in a filesystem that has copies>1. Also note that using files to back vdevs is not a recommended solution.> If the user wants to make sure the file is ''safer'' than others, he > can just make multiple copies. Either to a USB disk/flashdrive, cdrw, > dvd, ftp server, whatever.It seems to me that asking the user to solve this problem by manually making copies of all his files puts all the burden on the user/administrator and is a poor solution. For one, they have to remember to do it pretty often. For two, when they do experience some data loss, they have to manually reconstruct the files! They could have one file which has part of it missing from copy A and part of it missing from copy B. I''d hate to have to reconstruct that manually from two different files, but the proposed solution would do this transparently.> The redundancy you''re talking about is what you''d get from ''cp > /foo/bar.jpg /foo/bar.jpg.ok'', except it''s hidden from the user and > causing headaches for anyone trying to comprehend, port or extend the > codebase in the future.Whether it''s hard to understand is debatable, but this feature integrates very smoothly with the existing infrastructure and wouldn''t cause any trouble when extending or porting ZFS.> I''m afraid I honestly think this greatly complicates the conceptual model > (not to mention the technical implementation) of ZFS, and I haven''t seen > a convincing use case.Just for the record, these changes are pretty trivial to implement; less than 50 lines of code changed. --matt
Celso
2006-Sep-12 23:39 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
> On 12/09/06, Celso <celsouk at gmail.com> wrote: > > > I think it has already been said that in many > peoples experience, when a disk fails, it completely > fails. Especially on laptops. Of course ditto blocks > wouldn''t help you in this situation either! > > Exactly. > > > I still think that silent data corruption is a > valid concern, one that ditto blocks would solve. > > Also, I am not thrilled about losing that much space > for duplication of unneccessary data (caused by > partitioning a disk in two). > > Well, you''d only be duplicating the data on the > mirror. If you don''t want to > mirror the base OS, no one''s saying you have to. >Yikes! that sounds like even more partitioning!> For the sake of argument, let''s assume: > > 1. disk is expensive > 2. someone is keeping valuable files on a > non-redundant zpool > 3. they can''t scrape enough vdevs to make a redundant > zpool > (remembering you can build vdevs out of *flat > files*) > Even then, to my mind: > > to the user, the *file* (screenplay, movie of childs > birth, civ3 saved > game, etc.) > is the logical entity to have a ''duplication level'' > attached to it, > and the only person who can score that is the author > of the file. > > This proposal says the filesystem creator/admin > scores the filesystem. > Your argument against unneccessary data duplication > applies to all ''non-special'' > files in the ''special'' filesystem. They''re wasting > space too. > > If the user wants to make sure the file is ''safer'' > than others, he can > just make > multiple copies. Either to a USB disk/flashdrive, > cdrw, dvd, ftp > server, whatever. > > The redundancy you''re talking about is what you''d get > from ''cp /foo/bar.jpg /foo/bar.jpg.ok'', except it''s > hidden from the > user and causing > headaches for anyone trying to comprehend, port or > extend the codebase in > the future.the proposed solution differs in one important aspect: it automatically detects data corruption.> > I also echo Darren''s comments on zfs performing > better when it has the whole disk. > > Me too, but a lot of laptop users dual-boot, which > makes it a moot point. > > > Hopefully we can agree that you lose nothing by > adding this feature, > > even if you personally don''t see a need for it. > > Sorry, I don''t think we''re going to agree on this one > :)No worries, that''s cool.> All the best > Dick. > > -- > Rasputin :: Jack of All Trades - Master of Nuns > http://number9.hellooperator.net/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss >Celso This message posted from opensolaris.org
Chad Lewis
2006-Sep-12 23:51 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On Sep 12, 2006, at 4:39 PM, Celso wrote:>> On 12/09/06, Celso <celsouk at gmail.com> wrote: >> >>> I think it has already been said that in many >> peoples experience, when a disk fails, it completely >> fails. Especially on laptops. Of course ditto blocks >> wouldn''t help you in this situation either! >> >> Exactly. >> >>> I still think that silent data corruption is a >> valid concern, one that ditto blocks would solve. > >> Also, I am not thrilled about losing that much space >> for duplication of unneccessary data (caused by >> partitioning a disk in two). >> >> Well, you''d only be duplicating the data on the >> mirror. If you don''t want to >> mirror the base OS, no one''s saying you have to. >> > > Yikes! that sounds like even more partitioning! > >> >> The redundancy you''re talking about is what you''d get >> from ''cp /foo/bar.jpg /foo/bar.jpg.ok'', except it''s >> hidden from the >> user and causing >> headaches for anyone trying to comprehend, port or >> extend the codebase in >> the future. > > the proposed solution differs in one important aspect: it > automatically detects data corruption. > >Detecting data corruption is a function of the ZFS checksumming feature. The proposed solution has _nothing_ to do with detecting corruption. The difference is in what happens when/if such bad data is detected. Without a duplicate copy, via some RAID level or the proposed ditto block copies, the file is corrupted.
Matthew Ahrens wrote:> Matthew Ahrens wrote: > >> Here is a proposal for a new ''copies'' property which would allow >> different levels of replication for different filesystems. > > > Thanks everyone for your input. > > The problem that this feature attempts to address is when you have > some data that is more important (and thus needs a higher level of > redundancy) than other data. Of course in some situations you can use > multiple pools, but that is antithetical to ZFS''s pooled storage > model. (You have to divide up your storage, you''ll end up with > stranded storage and bandwidth, etc.) > > Given the overwhelming criticism of this feature, I''m going to shelve > it for now.So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Doing it for the filesystem is just one step higher (and makes it administratively easier as i don''t have to type the same command for each file thats important). Mirroring is just like another step above that - though its possibly replicating stuff you just don''t care about. Now placing extra copies of the data doesn''t guarantee that data will survive multiple diskf failures; but neither does having a mirrored pool guarantee the data will be there either (2 disk failures). Both methods are about increasing your chances of having your valuable data around. I for one would have loved to have multiple copy filesystems + ZFS on my powerbook when i was travelling in Australia for a month - think of all the digital pictures you take and how pissed you would be if the one with the wild wombat didn''t survive. Its maybe not an enterprise solution, but it seems like a consumer solution. Ensuring that the space accounting tools make sense is definitely a valid point though. eric> > Out of curiosity, what would you guys think about addressing this same > problem by having the option to store some filesystems unreplicated on > an mirrored (or raid-z) pool? This would have the same issues of > unexpected space usage, but since it would be *less* than expected, > that might be more acceptable. There are no plans to implement > anything like this right now, but I just wanted to get a read on it. > > --matt > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Celso
2006-Sep-13 00:01 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
> > It seems to me that asking the user to solve this > problem by manually > making copies of all his files puts all the burden on > the > user/administrator and is a poor solution.I completely agree ?> For one, they have to remember to do it pretty often. > For two, when > hey do experience some data loss, they have to > manually reconstruct the > files! They could have one file which has part of it > missing from copy > A and part of it missing from copy B. I''d hate to > have to reconstruct > that manually from two different files, but the > proposed solution would > do this transparently.Again, I agree.> > The redundancy you''re talking about is what you''d > get from ''cp > > /foo/bar.jpg /foo/bar.jpg.ok'', except it''s hidden > from the user and > > causing headaches for anyone trying to comprehend, > port or extend the > > codebase in the future. > > Whether it''s hard to understand is debatable, but > this feature > integrates very smoothly with the existing > infrastructure and wouldn''t > cause any trouble when extending or porting ZFS. >OK, given this statement...> > Just for the record, these changes are pretty trivial > to implement; less > than 50 lines of code changed.and this statement, I can''t see any reasons not to include it. If the changes are easy to do, don''t require anymore of the zfs team''s valuable time, and don''t hinder other things, I would plead with you to include them, as I think they are genuinely valuable and would make zfs not only the best enterprise level filesystem, but also the best filesystem for laptops/home computers. ?> --matt > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss >celso This message posted from opensolaris.org
Jeff Victor
2006-Sep-13 00:47 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
Chad Lewis wrote:> > On Sep 12, 2006, at 4:39 PM, Celso wrote: > >> the proposed solution differs in one important aspect: it automatically >> detects data corruption. > > Detecting data corruption is a function of the ZFS checksumming feature. The > proposed solution has _nothing_ to do with detecting corruption. The difference > is in what happens when/if such bad data is detected. Without a duplicate copy, > via some RAID level or the proposed ditto block copies, the file is corrupted.With a mirrored ZFS pool, what are the odds of losing all copies of the [meta]data, for N disks (where N = 1, 2, etc)? I thought we understood this pretty well, and that the answer was extremely small. -------------------------------------------------------------------------- Jeff VICTOR Sun Microsystems jeff.victor @ sun.com OS Ambassador Sr. Technical Specialist Solaris 10 Zones FAQ: http://www.opensolaris.org/os/community/zones/faq --------------------------------------------------------------------------
eric kustarz wrote:> Matthew Ahrens wrote: > >> Matthew Ahrens wrote: >> >>> Here is a proposal for a new ''copies'' property which would allow >>> different levels of replication for different filesystems. >> >> >> Thanks everyone for your input. >> >> The problem that this feature attempts to address is when you have >> some data that is more important (and thus needs a higher level of >> redundancy) than other data. Of course in some situations you can >> use multiple pools, but that is antithetical to ZFS''s pooled storage >> model. (You have to divide up your storage, you''ll end up with >> stranded storage and bandwidth, etc.) >> >> Given the overwhelming criticism of this feature, I''m going to shelve >> it for now. > > > So it seems to me that having this feature per-file is really useful. > Say i have a presentation to give in Pleasanton, and the presentation > lives on my single-disk laptop - I want all the meta-data and the > actual presentation to be replicated. We already use ditto blocks for > the meta-data. Now we could have an extra copy of the actual data. > When i get back from the presentation i can turn off the extra copies.Under what failure nodes would your data still be accessible? What things can go wrong that still allow you to access the data because some event has removed one copy but left the others?
David Dyer-Bennet
2006-Sep-13 03:38 UTC
[zfs-discuss] Proposal: multiple copies of user data
On 9/12/06, eric kustarz <eric.kustarz at sun.com> wrote:> So it seems to me that having this feature per-file is really useful. > Say i have a presentation to give in Pleasanton, and the presentation > lives on my single-disk laptop - I want all the meta-data and the actual > presentation to be replicated. We already use ditto blocks for the > meta-data. Now we could have an extra copy of the actual data. When i > get back from the presentation i can turn off the extra copies.Yes, you could do that. *I* would make a copy on a CD, which I would carry in a separate case from the laptop. I think my presentation is a lot safer than your presentation. Similarly for your digital images example; I don''t consider it safe until I have two or more *independent* copies. Two copies on a single hard drive doesn''t come even close to passing the test for me; as many people have pointed out, those tend to fail all at once. And I will also point out that laptops get stolen a lot. And of course all the accidents involving fumble-fingers, OS bugs, and driver bugs won''t be helped by the data duplication either. (Those will mostly be helped by sensible use of snapshots, though, which is another argument for ZFS on *any* disk you work on a lot.) The more I look at it the more I think that a second copy on the same disk doesn''t protect against very much real-world risk. Am I wrong here? Are partial(small) disk corruptions more common than I think? I don''t have a good statistical view of disk failures. -- David Dyer-Bennet, <mailto:dd-b at dd-b.net>, <http://www.dd-b.net/dd-b/> RKBA: <http://www.dd-b.net/carry/> Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/> Dragaera/Steven Brust: <http://dragaera.info/>
David Dyer-Bennet
2006-Sep-13 03:43 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On 9/12/06, Celso <celsouk at gmail.com> wrote:> > Whether it''s hard to understand is debatable, but > > this feature > > integrates very smoothly with the existing > > infrastructure and wouldn''t > > cause any trouble when extending or porting ZFS. > > > > OK, given this statement... > > > > > Just for the record, these changes are pretty trivial > > to implement; less > > than 50 lines of code changed. > > and this statement, I can''t see any reasons not to include it. If the changes are easy to do, don''t require anymore of the zfs team''s valuable time, and don''t hinder other things, I would plead with you to include them, as I think they are genuinely valuable and would make zfs not only the best enterprise level filesystem, but also the best filesystem for laptops/home computers.While I''m not a big fan of this feature, if the work is that well understood and that small, I have no objection to it. (Boy that sounds snotty; apologies, not what I intend here. Those of you reading this know how muich you care about my opinion, that''s up to you.) I do pity the people who count on the ZFS redundancy to protect their presentation on an important sales trip -- and then have their laptop stolen. But those people might well be the same ones who would have *no* redundancy otherwise. And nothing about this feature prevents the paranoids like me from still making our backup CD and carrying it separately. I''m not prepared to go so far as to argue that it''s bad to make them feel safer :-). At least, to make them feel safer *by making them actually safer*. -- David Dyer-Bennet, <mailto:dd-b at dd-b.net>, <http://www.dd-b.net/dd-b/> RKBA: <http://www.dd-b.net/carry/> Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/> Dragaera/Steven Brust: <http://dragaera.info/>
David Dyer-Bennet wrote:> > The more I look at it the more I think that a second copy on the same > disk doesn''t protect against very much real-world risk. Am I wrong > here? Are partial(small) disk corruptions more common than I think? > I don''t have a good statistical view of disk failures.I don''t have hard data at hand but you see entire drives go bad much more often then a single section....and when you do it is usually a notice of a block re-allocation from the disk drive firmware. Often you''ll see a bunch of those, sometimes over the course of a month and sometimes over the course of a minute, and then the entire drive goes. In some cases a raid array will watch for those messages and automagically swap the drive with a hot spare after X amount of notifications. Al Hopper recently posted some more detailed examples of how this can happen. However, lets move to a different example and say you''ve got six drives in a raidZ pool. What failure modes - This time I used the ''m'' instead of the ''n'' - allow your data to survive that can''t already be taken care of with underlying raid configurations within the pool?
Torrey McMahon
2006-Sep-13 03:55 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
David Dyer-Bennet wrote:> > While I''m not a big fan of this feature, if the work is that well > understood and that small, I have no objection to it. (Boy that > sounds snotty; apologies, not what I intend here. Those of you > reading this know how muich you care about my opinion, that''s up to > you.)One could make the argument that the feature could cause enough confusion to not warrant its inclusion. If I''m a typical user and I write a file to the filesystem where the admin set three copies but didn''t tell me it might throw me into a tizzy trying to figure out why my quota is 3X where I expect it to be.
Matthew Ahrens wrote:> Matthew Ahrens wrote: >> Here is a proposal for a new ''copies'' property which would allow >> different levels of replication for different filesystems. > > Thanks everyone for your input. > > The problem that this feature attempts to address is when you have > some data that is more important (and thus needs a higher level of > redundancy) than other data. Of course in some situations you can use > multiple pools, but that is antithetical to ZFS''s pooled storage > model. (You have to divide up your storage, you''ll end up with > stranded storage and bandwidth, etc.)Can you expand? I can think of some examples where using multiple pools - even on the same host - is quite useful given the current feature set of the product. Or are you only discussing the specific case where a host would want more reliability for a certain set of data then an other? If that''s the case I''m still confused as to what failure cases would still allow you to retrieve your data if there are more then one copy in the fs or pool.....but I''ll gladly take some enlightenment. :)
a couple of points> One could make the argument that the feature could > cause enough > confusion to not warrant its inclusion. If I''m a > typical user and I > write a file to the filesystem where the admin set > three copies but > didn''t tell me it might throw me into a tizzy trying > to figure out why > my quota is 3X where I expect it to be. >I don''t think anybody is saying it is going to be the default setup. If someone is not comfortable with a feature, surely? they can choose to ignore it. An admin can use actual mirroring, raidz etc, and carry on as before.? There are many potentially confusing features of almost any computer system. Computers are complex things. I admin a couple of schools with a total of about 2000 kids. I really doubt that any of them would have a problem understanding it. More importantly, is an institution utilizing quotas really the main market for this feature. It seems to me that it is clearly aimed at people in control of their own machines (even though I can see uses for this in pretty much any environment). I doubt anyone capable of installing and running Solaris on their laptop would be confused by this issue. I don''t think anyone is saying that ditto blocks are a complete, never-lose-data solution. Sure the whole disk can (and probably will) die on you. If you partitioned the disk, mirrored it, and it died, you would still be in trouble. If the disk doesn''t die, but for whatever reason, you get silent data corruption, the checksums pick up the problem, and the ditto blocks allow recovery. Given a situation, where you:? a) have a laptop or home computer which you have important data on. b) for whatever reason, you can''t add another disk to utilize mirroring (and you are between ?? backups) this seems to me to be a very valid solution. Especially as has already been said, it takes very little to implement, and doesn''t hinder anything else within zfs. I think that people can benefit from this.> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss >This message posted from opensolaris.org
Nicolas Williams
2006-Sep-13 05:21 UTC
[zfs-discuss] Proposal: multiple copies of user data
On Tue, Sep 12, 2006 at 03:56:00PM -0700, Matthew Ahrens wrote:> The problem that this feature attempts to address is when you have some > data that is more important (and thus needs a higher level of > redundancy) than other data. Of course in some situations you can use > multiple pools, but that is antithetical to ZFS''s pooled storage model. > (You have to divide up your storage, you''ll end up with stranded > storage and bandwidth, etc.)For me this feature is something I would use on a laptop. I''d set copies = 2. The idea is: bad blocks happen, but I don''t want to lose data because of random bad blocks. Dead disks happen also. This feature won''t help with that. to deal with bad disks I''d like laptops to have two disks. Alternatively, I would (but don''t) plug-in a USB disk from time to time as a mirror vdev.> Given the overwhelming criticism of this feature, I''m going to shelve it > for now.I don''t see why. The used/free space UI issues need to be worked out. And you need to give guidance for when to use this (e.g., "ditto blocking is primarily intended for use on laptops" -- if you have mirroring/raid-5/raid-Z then you probably wouldn''t care for ditto blocking). Beyond that you do need to introduce a generic filesystem-level "scrub" or re-write option for dealing with changes to fs properties that would otherwise only apply to data/meta-data created/changed after the property change. But this was needed before this case, so I don''t see why this case should have to add that feature. Plus, I can imagine that such scrubbing could be difficult to implement, since it must appear to leave everything untouched, including snapshots and clones, yet actually have COWed everything that existed prior to the scrub. Nico --
Celso wrote:> a couple of points > > >> One could make the argument that the feature could >> cause enough >> confusion to not warrant its inclusion. If I''m a >> typical user and I >> write a file to the filesystem where the admin set >> three copies but >> didn''t tell me it might throw me into a tizzy trying >> to figure out why >> my quota is 3X where I expect it to be. >> >> > > I don''t think anybody is saying it is going to be the default setup. If someone is not comfortable with a feature, surely they can choose to ignore it. An admin can use actual mirroring, raidz etc, and carry on as before. > > There are many potentially confusing features of almost any computer system. Computers are complex things. > > I admin a couple of schools with a total of about 2000 kids. I really doubt that any of them would have a problem understanding it. > > More importantly, is an institution utilizing quotas really the main market for this feature. It seems to me that it is clearly aimed at people in control of their own machines (even though I can see uses for this in pretty much any environment). I doubt anyone capable of installing and running Solaris on their laptop would be confused by this issue. >Its not the smart people I would be worried about. It''s the ones where you would get into endless loops of conversation around "But I only wrote 1MB how come it says 2MB?" that worry me. Especially, when it impacts a lot of user level tools and could be a surprise if set by a BOFH type. That said I was worried about that type of effect when the change itself seemed to have low value. However, you and Richard have pointed to at least one example where this would be useful at the file level....> > Given a situation, where you: > > a) have a laptop or home computer which you have important data on. > b) for whatever reason, you can''t add another disk to utilize mirroring (and you are between backups) > > this seems to me to be a very valid solution.... and though I see that as a valid solution to the issue does it really cover enough ground to warrant inclusion of this feature given some of the other issues that have been brought up? In the above case I think people would me more concerned with the entire system going down, a drive crashing, etc. then the possibility of a checksum error or data corruption requiring the lookup on a ditto block if one exists. In that case they would create a copy on an independent system, like a USB disk, some sort of archiving media, like a CD-R, or even place a copy on a remote system, to maintain the data in case of a failure. Hell, I''ve been known to do all three to meet my own paranoia level. IMHO, It''s more ammo to include the feature but I''m not sure its enough. Perhaps Richard''s late breaking data concerning drive failures will add some more weight?
Dick Davies
2006-Sep-13 06:20 UTC
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
On 13/09/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> Dick Davies wrote: > > For the sake of argument, let''s assume: > > > > 1. disk is expensive > > 2. someone is keeping valuable files on a non-redundant zpool > > 3. they can''t scrape enough vdevs to make a redundant zpool > > (remembering you can build vdevs out of *flat files*) > > Given those assumptions, I think that the proposed feature is the > perfect solution. Simply put those files in a filesystem that has copies>1.I don''t think we disagree that multiple copies in ZFS are a good idea, I just think the zpool is the right place to do that. To clarify, I was addressing Celsos laptop scenario here - especially the idea that you can make a single disk redundant without any risks. (for bigger systems I''d just mirror at the zpool and have done).> Also note that using files to back vdevs is not a recommended solution.Understood. But neither is mirroring on a single disk (which is what is effectively being suggested for laptop users using this solution).> > If the user wants to make sure the file is ''safer'' than others, he > > can just make multiple copies. Either to a USB disk/flashdrive, cdrw, > > dvd, ftp server, whatever. > > It seems to me that asking the user to solve this problem by manually > making copies of all his files puts all the burden on the > user/administrator and is a poor solution.You''ll be being backing up your laptop anyway, aren''t you?> For one, they have to remember to do it pretty often. For two, when > they do experience some data loss, they have to manually reconstruct the > files! They could have one file which has part of it missing from copy > A and part of it missing from copy B. I''d hate to have to reconstruct > that manually from two different files, but the proposed solution would > do this transparently.Are you likely to lose parts of both file at the same time, though? I''d say you''re more likely to have one crap file and one good one. And you know which file is crap due to checksumming already.> > I''m afraid I honestly think this greatly complicates the conceptual model > > (not to mention the technical implementation) of ZFS, and I haven''t seen > > a convincing use case. > > Just for the record, these changes are pretty trivial to implement; less > than 50 lines of code changed.But they raise a lot of administrative issues (how many copies do I really have? Where are they? Have they all been deleted? If I set this property, how many copies do I have now? How much disk will I get back if I delete fileX? How much disk do I bill zone admin foo for this month? How much disk io are ops on this filesystem likely to cause? How do I dtrace this?) I appreciate the effort and thought that''s gone into it, not to mention a request for feedback. If I''ve not made that clear, I apologize. I''m just worried that it muddies the waters for everybody. The users (me too!) want mirror-level reliability on their laptops. I don''t think this is the right way to get that feature, that''s all. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/
Torrey McMahon wrote:> Matthew Ahrens wrote: >> The problem that this feature attempts to address is when you have >> some data that is more important (and thus needs a higher level of >> redundancy) than other data. Of course in some situations you can use >> multiple pools, but that is antithetical to ZFS''s pooled storage >> model. (You have to divide up your storage, you''ll end up with >> stranded storage and bandwidth, etc.) > > Can you expand? I can think of some examples where using multiple pools > - even on the same host - is quite useful given the current feature set > of the product. Or are you only discussing the specific case where a > host would want more reliability for a certain set of data then an > other? If that''s the case I''m still confused as to what failure cases > would still allow you to retrieve your data if there are more then one > copy in the fs or pool.....but I''ll gladly take some enlightenment. :)(My apologies for the length of this response, I''ll try to address most of the issues brought up recently...) When I wrote this proposal, I was only seriously thinking about the case where you want different amounts of redundancy for different data. Perhaps because I failed to make this clear, discussion has concentrated on laptop reliability issues. It is true that there would be some benefit to using multiple copies on a single-disk (eg. laptop) pool, but of course it would not protect against the most common failure mode (whole disk failure). One case where this feature would be useful is if you have a pool with no redundancy (ie. no mirroring or raid-z), because most of the data in the pool is not very important. However, the pool may have a bunch of disks in it (say, four). The administrator/user may realize (perhaps later on) that some of their data really *is* important and they would like some protection against losing it if a disk fails. They may not have the option of adding more disks to mirror all of their data (cost or physical space constraints may apply here). Their problem is solved by creating a new filesystem with copies=2 and putting the important data there. Now, if a disk fails, then the data in the copies=2 filesystem will not be lost. Approximately 1/4 of the data in other filesystems will be lost. (There is a small chance that some tiny fraction of the data in the copies=2 filesystem will still be lost if we were forced to put both copies on the disk that failed.) Another plausible use case would be where you have some level of redundancy, say you have a Thumper (X4500) with its 48 disks configured into 9 5-wide single-parity raid-z groups (with 3 spares). If a single disk fails, there will be no data loss. However, if two disks within the same raid-z group fail, data will be lost. In this scenario, imagine that this data loss probability is acceptable for most of the data stored here, but there is some extremely important data for which this is unacceptable. Rather than reconfiguring the entire pool for higher redundancy (say, double-parity raid-z) and less usable storage, you can simply create a filesystem with copies=2 within the raid-z storage pool. Data within that filesystem will not be lost even if any three disks fail. I believe that these use cases, while not being extremely common, do occur. The extremely low amount of engineering effort required to implement the feature (modulo the space accounting issues) seems justified. The fact that this feature does not solve all problems (eg, it is not intended to be a replacement for mirroring) is not a downside; not all features need to be used in all situations :-) The real problem with this proposal is the confusion surrounding disk space accounting with copies>1. While the same issues are present when using compression, people are understandably less upset when files take up less space than expected. Given the current lack of interest in this feature, the effort required to address the space accounting issue does not seem justified at this time. --matt
[dang, this thread started on the one week this quarter that I don''t have any spare time... please accept this one comment, more later...] Mike Gerdts wrote:> On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote: >> B. DESCRIPTION >> >> A new property will be added, ''copies'', which specifies how many copies >> of the given filesystem will be stored. Its value must be 1, 2, or 3. >> Like other properties (eg. checksum, compression), it only affects >> newly-written data. As such, it is recommended that the ''copies'' >> property be set at filesystem-creation time >> (eg. ''zfs create -o copies=2 pool/fs''). > > Is there anything in the works to compress (or encrypt) existing data > after the fact? For example, a special option to scrub that causes > the data to be re-written with the new properties could potentially do > this. If so, this feature should subscribe to any generic framework > provided by such an effort. > >> This feature is similar to using mirroring, but differs in several >> important ways: >> >> * Mirroring offers slightly better redundancy, because one disk from >> each mirror can fail without data loss. > > Is this use of slightly based upon disk failure modes? That is, when > disks fail do they tend to get isolated areas of badness compared to > complete loss? I would suggest that complete loss should include > someone tripping over the power cord to the external array that houses > the disk.The field data I have says that complete disk failures are the exception. I hate to leave this as a teaser, I''ll expand my comments later. BTW, this feature will be very welcome on my laptop! I can''t wait :-) -- richard
David Dyer-Bennet wrote:> On 9/12/06, eric kustarz <eric.kustarz at sun.com> wrote: > >> So it seems to me that having this feature per-file is really useful. >> Say i have a presentation to give in Pleasanton, and the presentation >> lives on my single-disk laptop - I want all the meta-data and the actual >> presentation to be replicated. We already use ditto blocks for the >> meta-data. Now we could have an extra copy of the actual data. When i >> get back from the presentation i can turn off the extra copies. > > > Yes, you could do that. > > *I* would make a copy on a CD, which I would carry in a separate case > from the laptop.Do you backup the presentation to CD everytime you make an edit?> > I think my presentation is a lot safer than your presentation.I''m sure both of our presentations would be equally safe as we would know not to have the only copy(ies) on our personage.> > Similarly for your digital images example; I don''t consider it safe > until I have two or more *independent* copies. Two copies on a single > hard drive doesn''t come even close to passing the test for me; as many > people have pointed out, those tend to fail all at once. And I will > also point out that laptops get stolen a lot. And of course all the > accidents involving fumble-fingers, OS bugs, and driver bugs won''t be > helped by the data duplication either. (Those will mostly be helped > by sensible use of snapshots, though, which is another argument for > ZFS on *any* disk you work on a lot.)Well of course you would have a separate, independent copy if it really mattered.> > The more I look at it the more I think that a second copy on the same > disk doesn''t protect against very much real-world risk. Am I wrong > here? Are partial(small) disk corruptions more common than I think? > I don''t have a good statistical view of disk failures.Well let''s see - my friend accompanied me on a trip and saved her photos daily onto her laptop. Near the end of the trip her hard drive started having problems. The hard drive was not dead, as it was bootable and you could access certain data. Upon returning home she was able to retrieve some of her photos but not all. She would have been much happier having ZFS + "copies". And yes, you could backup to CD/DVD every night, but its a pain and people don''t do it (as much as they should). Side note: it would have cost hundreds of dollars for data recovery to have just the *possibility* to get the other photos. eric
Torrey McMahon wrote:> eric kustarz wrote: > >> Matthew Ahrens wrote: >> >>> Matthew Ahrens wrote: >>> >>>> Here is a proposal for a new ''copies'' property which would allow >>>> different levels of replication for different filesystems. >>> >>> >>> >>> Thanks everyone for your input. >>> >>> The problem that this feature attempts to address is when you have >>> some data that is more important (and thus needs a higher level of >>> redundancy) than other data. Of course in some situations you can >>> use multiple pools, but that is antithetical to ZFS''s pooled storage >>> model. (You have to divide up your storage, you''ll end up with >>> stranded storage and bandwidth, etc.) >>> >>> Given the overwhelming criticism of this feature, I''m going to >>> shelve it for now. >> >> >> >> So it seems to me that having this feature per-file is really >> useful. Say i have a presentation to give in Pleasanton, and the >> presentation lives on my single-disk laptop - I want all the >> meta-data and the actual presentation to be replicated. We already >> use ditto blocks for the meta-data. Now we could have an extra copy >> of the actual data. When i get back from the presentation i can turn >> off the extra copies. > > > Under what failure nodes would your data still be accessible? What > things can go wrong that still allow you to access the data because > some event has removed one copy but left the others? >Silent data corruption of one of the copies.
Matthew Ahrens
2006-Sep-13 06:40 UTC
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
Dick Davies wrote:> On 13/09/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote: >> Dick Davies wrote: >> > For the sake of argument, let''s assume: >> > >> > 1. disk is expensive >> > 2. someone is keeping valuable files on a non-redundant zpool >> > 3. they can''t scrape enough vdevs to make a redundant zpool >> > (remembering you can build vdevs out of *flat files*) >> >> Given those assumptions, I think that the proposed feature is the >> perfect solution. Simply put those files in a filesystem that has >> copies>1. > > I don''t think we disagree that multiple copies in ZFS are a good idea, > I just think the zpool is the right place to do that.Sure, if you want *everything* in your pool to be mirrored, there is no real need for this feature (you could argue that setting up the pool would be easier if you didn''t have to slice up the disk though).>> Also note that using files to back vdevs is not a recommended solution. > > Understood. But neither is mirroring on a single disk (which is what is > effectively being suggested for laptop users using this solution).It could be recommended in some situations. If you want to protect against disk firmware errors, bit flips, part of the disk getting scrogged, then mirroring on a single disk (whether via a mirror vdev or copies=2) solves your problem. Admittedly, these problems are probably less common that whole-disk failure, which mirroring on a single disk does not address.> But they raise a lot of administrative issuesSure, especially if you choose to change the copies property on an existing filesystem. However, if you only set it at filesystem creation time (which is the recommended way), then it''s pretty easy to address your issues: (how many copies do I really have? whatever you set the ''copies'' property to. Where are they? If you have multiple disks, they are almost certainly on different disks. Some tiny fraction of the blocks may have both their copies on the same disk if enough disks were nearly full. Have they all been deleted? No.> If I set this property, how many copies do I have now?Whatever you set the ''copies'' property to.> How much disk will I get back if I delete fileX?The space you get back is always the space used by the file, as specified by st_blocks, ls -s, or du. Note that this applies even if you use compression, or change the ''copies'' property after creating the filesystem.> How much disk do I bill zone admin foo for this month?That would be up to your policy. You could bill them for the space used, or divide by the number of copies.> How much disk io are ops on this filesystem likely to cause?copies * mirror_width * raidz_stripe / (raidz_stripe-1). (this applies even if you change the ''copies'' property after creating the filesystem).> How do I dtrace this?)The same way you''d dtrace ditto blocks today. Do you have a specific event in mind that you''d like to trace?> The users (me too!) want mirror-level reliability on their laptops. > I don''t think this is the right way to get that feature, that''s all.I agree; there is no magic. If you want to survive a drive failure in your laptop, you need two drives in there. --matt
Dick Davies
2006-Sep-13 07:05 UTC
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
On 13/09/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> Dick Davies wrote:> > But they raise a lot of administrative issues > > Sure, especially if you choose to change the copies property on an > existing filesystem. However, if you only set it at filesystem creation > time (which is the recommended way), then it''s pretty easy to address > your issues:You''re right, that would prevent getting into some nasty messes (I see this as closer to encryption than compression in that respect). I still feel we''d be doing the same job in several places. But I''m sure anyone who cares has a pretty good idea of my opinion, so I''ll shut up now :) Thanks for taking the time to feedback on the feedback. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/
Darren J Moffat
2006-Sep-13 09:42 UTC
[zfs-discuss] Proposal: multiple copies of user data
eric kustarz wrote:> So it seems to me that having this feature per-file is really useful.Per-file with a POSIX filesystem is often not that useful. That is because many applications (since you mentioned a presentation StarOffice I know does this) don''t update the file in place. Instead they write a temporary file on the same file system in the same directory then do an unlink(2) and rename(2). So that means what you really need to say is per directory, which for ZFS you may as well implement as per data set. -- Darren J Moffat
On 9/13/06, Richard Elling <Richard.Elling at sun.com> wrote:> >> * Mirroring offers slightly better redundancy, because one disk from > >> each mirror can fail without data loss. > > > > Is this use of slightly based upon disk failure modes? That is, when > > disks fail do they tend to get isolated areas of badness compared to > > complete loss? I would suggest that complete loss should include > > someone tripping over the power cord to the external array that houses > > the disk. > > The field data I have says that complete disk failures are the exception. > I hate to leave this as a teaser, I''ll expand my comments later. > > BTW, this feature will be very welcome on my laptop! I can''t wait :-)On servers and stationary desktops, I just don''t care whether it is a whole disk failure or a few bad blocks. In that case I have the resources to mirror, RAID5, perform daily backups, etc. The laptop disk failures that I have seen have typically been limited to a few bad blocks. As Torey McMahon mentioned, they tend to start out with some warning signs followed by a full failure. I would *really* like to have that window between warning signs and full failure as my opportunity to back up my data and replace my non-redundant hard drive with no data loss. The only part of the proposal I don''t like is space accounting. Double or triple charging for data will only confuse those apps and users that check for free space or block usage. If this is worked out, it would be a great feature for those times when mirroring just isn''t an option. Mike -- Mike Gerdts http://mgerdts.blogspot.com/
On 9/13/06, Mike Gerdts <mgerdts at gmail.com> wrote:> The only part of the proposal I don''t like is space accounting. > Double or triple charging for data will only confuse those apps and > users that check for free space or block usage.Why exactly isn''t reporting the free space divided by the "copies" value on that particular file system an easy solution for this? Did I miss something? Tobias
On Tue, 12 Sep 2006, Matthew Ahrens wrote:> Torrey McMahon wrote: > > Matthew Ahrens wrote: > >> The problem that this feature attempts to address is when you have > >> some data that is more important (and thus needs a higher level of > >> redundancy) than other data. Of course in some situations you can use > >> multiple pools, but that is antithetical to ZFS''s pooled storage > >> model. (You have to divide up your storage, you''ll end up with > >> stranded storage and bandwidth, etc.) > > > > Can you expand? I can think of some examples where using multiple pools > > - even on the same host - is quite useful given the current feature set > > of the product. Or are you only discussing the specific case where a > > host would want more reliability for a certain set of data then an > > other? If that''s the case I''m still confused as to what failure cases > > would still allow you to retrieve your data if there are more then one > > copy in the fs or pool.....but I''ll gladly take some enlightenment. :) > > (My apologies for the length of this response, I''ll try to address most > of the issues brought up recently...) > > When I wrote this proposal, I was only seriously thinking about the case > where you want different amounts of redundancy for different data. > Perhaps because I failed to make this clear, discussion has concentrated > on laptop reliability issues. It is true that there would be some > benefit to using multiple copies on a single-disk (eg. laptop) pool, but > of course it would not protect against the most common failure mode > (whole disk failure).... lots of Good Stuff elided .... Soon Samsung will release a 100% flash memory based drive (32Gb) in a laptop form factor. But flash memory chips have a limited number of write cycles available, and when exceeded, this usually results in data corruption. Some people have already encountered this issue with USB thumb drives. Its especially annoying if you were using the thumb drive as a, what you thought was, a 100% _reliable_ backup mechanism. This is a perfect application for ZFS copies=2. Also, consider that there is no time penalty for positioning the "heads" on a flash drive. So now you would have 2 options in a laptop type application with a single flash based drive: a) create a mirrored pool using 2 slices - expensive in terms of storage utilization b) create a pool with no redundancy create a filesystem called "importantPresentationData" within that pool with copies=2 (or more). Matthew - "build it and they will come"! Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Darren J Moffat wrote:> eric kustarz wrote: > >> So it seems to me that having this feature per-file is really useful. > > > Per-file with a POSIX filesystem is often not that useful. That is > because many applications (since you mentioned a presentation > StarOffice I know does this) don''t update the file in place. Instead > they write a temporary file on the same file system in the same > directory then do an unlink(2) and rename(2).That''s too bad, but i guess its the best StarOffice can do.> > So that means what you really need to say is per directory, which for > ZFS you may as well implement as per data set. >I want per pool, per dataset, and per file - where all are done by the filesystem (ZFS), not the application. I was talking about a further enhancement to "copies" than what Matt is currently proposing - per file "copies", but its more work (one thing being we don''t have administrative control over files per se). eric
Bill Sommerfeld
2006-Sep-13 15:42 UTC
[zfs-discuss] Proposal: multiple copies of user data
On Wed, 2006-09-13 at 02:30, Richard Elling wrote:> The field data I have says that complete disk failures are the exception. > I hate to leave this as a teaser, I''ll expand my comments later.That matches my anecdotal experience with laptop drives; maybe I''m just lucky, or maybe I''m just paying attention than most to the sounds they start to make when they''re having a bad hair day, but so far they''ve always given *me* significant advance warning of impending doom, generally by failing to read a bunch of disk sectors. That said, I think the best use case for the copies > 1 config would be in systems with exactly two disks -- which covers most of the 1U boxes out there. One question for Matt: when ditto blocks are used with raidz1, how well does this handle the case where you encounter one or more single-sector read errors on other drive(s) while reconstructing a failed drive? for a concrete example A0 B0 C0 D0 P0 A1 B1 C1 D1 P1 (A0==A1, B0==B1, ...; A^B^C^D==P) Does the current implementation of raidz + ditto blocks cope with the case where all of "A", C0, and D1 are unavailable? - Bill
eric kustarz wrote:> > I want per pool, per dataset, and per file - where all are done by the > filesystem (ZFS), not the application. I was talking about a further > enhancement to "copies" than what Matt is currently proposing - per > file "copies", but its more work (one thing being we don''t have > administrative control over files per se).Now if you could do that and make it something that can be set at install time it would get a lot more interesting. When you install Solaris to that single laptop drive you can select files or even directories that have more then one copy in case of a problem down the road.
Gregory Shaw
2006-Sep-13 17:05 UTC
[zfs-discuss] Re: Re: Proposal: multiple copies of user data
On Sep 12, 2006, at 2:55 PM, Celso wrote:>> On 12/09/06, Celso <celsouk at gmail.com> wrote: >> >>> One of the great things about zfs, is that it >> protects not just against mechanical failure, but >> against silent data corruption. Having this available >> to laptop owners seems to me to be important to >> making zfs even more attractive. >> >> I''m not arguing against that. I was just saying that >> *if* this was useful to you >> (and you were happy with the dubious >> resilience/performance benefits) you can >> already create mirrors/raidz on a single disk by >> using partitions as >> building blocks. >> There''s no need to implement the proposal to gain >> that. >> >> > > > It''s not as granular though is it? > > In the situation you describe: > > ...you split one disk in two. you then have effectively two > partitions which you can then create a new mirrored zpool with. > Then everything is mirrored. Correct? > > With ditto blocks, you can selectively add copies (seeing as how > filesystem are so easy to create on zfs). If you are only concerned > with copies of your important documents and email, why should /usr/ > bin be mirrored. > > That''s my opinion anyway. I always enjoy choice, and I really > believe this is a useful and flexible one. > > Celso > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussOne item missed in the discussion is the idea that individual ZFS filesystems can be created in a pool that will have the duplicate block behavior. The idea being that only a vast subset of your data may be critical. This allows additional flexibility in a single disk configuration. Rather than sacrificing 1/2 of the pool storage, I can say that my critical documents will reside in a pool that will keep two copies on disk. I think it''s a great idea. It may not be for everybody, but I think the ability to treat some of my files as critical is a excellent feature. ----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 500 Eldorado Blvd, UBRM02-401 greg.shaw at sun.com (work) Broomfield, CO 80021 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060913/39a9720c/attachment.html>
Torrey McMahon wrote:> eric kustarz wrote: >> >> I want per pool, per dataset, and per file - where all are done by the >> filesystem (ZFS), not the application. I was talking about a further >> enhancement to "copies" than what Matt is currently proposing - per >> file "copies", but its more work (one thing being we don''t have >> administrative control over files per se). > > Now if you could do that and make it something that can be set at > install time it would get a lot more interesting. When you install > Solaris to that single laptop drive you can select files or even > directories that have more then one copy in case of a problem down the > road. >Actually, this is a perfect use case for setting the copies=2 property after installation. The original binaries are quite replaceable; the customizations and personal files created later on are not. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Bart Smaalders wrote:> Torrey McMahon wrote: >> eric kustarz wrote: >>> >>> I want per pool, per dataset, and per file - where all are done by >>> the filesystem (ZFS), not the application. I was talking about a >>> further enhancement to "copies" than what Matt is currently >>> proposing - per file "copies", but its more work (one thing being we >>> don''t have administrative control over files per se). >> >> Now if you could do that and make it something that can be set at >> install time it would get a lot more interesting. When you install >> Solaris to that single laptop drive you can select files or even >> directories that have more then one copy in case of a problem down >> the road. >> > > Actually, this is a perfect use case for setting the copies=2 > property after installation. The original binaries are > quite replaceable; the customizations and personal files > created later on are not. >We''ve been talking about user data but the chance of corrupting something on disk and then detecting a bad checksum on something in /kernel is also possible. (Disk drives do weird things from time to time.) If I was sufficiently paranoid I would want everything required to get into single-user mode, some other stuff, and then my user data, duplicated to avoid any issues.
Bill Sommerfeld wrote:> One question for Matt: when ditto blocks are used with raidz1, how well > does this handle the case where you encounter one or more single-sector > read errors on other drive(s) while reconstructing a failed drive? > > for a concrete example > > A0 B0 C0 D0 P0 > A1 B1 C1 D1 P1 > > (A0==A1, B0==B1, ...; A^B^C^D==P) > > Does the current implementation of raidz + ditto blocks cope with the > case where all of "A", C0, and D1 are unavailable? > > - Bill >I''ll answer for Matt. Yes, ''A'' will be reconstructed from B[0 or 1], C1, D0, and P[0 or 1]. -Mark
Anton B. Rang
2006-Sep-13 20:37 UTC
[zfs-discuss] Re: Proposal: multiple copies of user data
Is this true for single-sector, vs. single-ZFS-block, errors? (Yes, it''s pathological and probably nobody really cares.) I didn''t see anything in the code which falls back on single-sector reads. (It''s slightly annoying that the interface to the block device drivers loses the SCSI error status, which tells you the first sector which was bad.) This message posted from opensolaris.org
Wee Yeh Tan
2006-Sep-14 01:37 UTC
[zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
On 9/13/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote:> Sure, if you want *everything* in your pool to be mirrored, there is no > real need for this feature (you could argue that setting up the pool > would be easier if you didn''t have to slice up the disk though).Not necessarily. Implementing this on the FS level will still allow the administrator to turn on copies on the entire pool if since the pool is technically also a FS and the property is inherited by child FS''s. Of course, this will allow the admin to turn off copies to the FS containing junk.> It could be recommended in some situations. If you want to protect > against disk firmware errors, bit flips, part of the disk getting > scrogged, then mirroring on a single disk (whether via a mirror vdev or > copies=2) solves your problem. Admittedly, these problems are probably > less common that whole-disk failure, which mirroring on a single disk > does not address.I beg to differ from experience that the above errors are more common than whole disk failures. It''s just that we do not notice the disks are developing problems but panic when they finally fail completely. That''s what happens to most of my disks anyway. Disks are much smarter nowadays with hiding bad sectors but it doesn''t mean that there are none. If your precious data happens to sit on one, you''ll be crying for copies. -- Just me, Wire ...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Matthew Ahrens wrote:> Out of curiosity, what would you guys think about addressing this same > problem by having the option to store some filesystems unreplicated on > an mirrored (or raid-z) pool? This would have the same issues of > unexpected space usage, but since it would be *less* than expected, that > might be more acceptable. There are no plans to implement anything like > this right now, but I just wanted to get a read on it.+1, especially in a two disk (mirrored) configuration. Currently I use two ZFS pools: one mirrored and other unmirrored spreaded over two disks (each disk partitioned with SVM). And I''m constantly fighting the fill-up of one pools while the other is empty. My current setup have the same space balance problem that a traditional two *static* partition setup. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBRQlwoJlgi5GaxT1NAQLR7gP8C3QHCkvRznthRZNZ6sCfhtD/y+am7b2V +JrPBD0RRHkD65ZKhj6r3Ss4ypkjlSo82+pMdnPdIQUpNKoqmwEyAqfvXvdqm7A+ Yks5Ac5e9ris2Sz3o7wruFixkLOJSoKrUS8TR1TpvnXlHE8l3U4Q2uEgzwKr4s8F k/AR3VC70pg=BCz2 -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Neil A. Wilson wrote:> This is unfortunate. As a laptop user with only a single drive, I was > looking forward to it since I''ve been bitten in the past by data loss > caused by a bad area on the disk. I don''t care about the space > consumption because I generally don''t come anywhere close to filling up > the available space. It may not be the primary market for ZFS, but it > could be a very useful side benefit.I feel your pain. Although your harddrive will suffer by the extra seeks, I would suggest you to partition your HD in two spaces and mount a two-way ZFS mirror between them. If space is an issue, you can use N partitions to mount a raid-z, but your performance will suffer a lot because any data read would require N seeks. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at argo.es http://www.argo.es/~jcea/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBRQlx0Zlgi5GaxT1NAQLxnAQAnR5ja6G+jzTPC6cNWRpD1BmUnEcXP+k5 KvRuoIAZ2GLLQvKbPYv+KivX9+jZcNW3W73g/HPGrmnMrFwKyVaeotnk5M8z2IH/ mCneF/qfV751eTaWGUXHqCD1bh/jRkxlIHRPU+TvCriE2zJ+N5r+AMOIbAd9oQ6H 9Y9LUSWAK+Q=rNRA -----END PGP SIGNATURE-----
can you guess?
2006-Sep-15 08:23 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
> On 9/13/06, Matthew Ahrens <Matthew.Ahrens at sun.com> > wrote: > > Sure, if you want *everything* in your pool to be > mirrored, there is no > > real need for this feature (you could argue that > setting up the pool > > would be easier if you didn''t have to slice up the > disk though). > > Not necessarily. Implementing this on the FS level > will still allow > the administrator to turn on copies on the entire > pool if since the > pool is technically also a FS and the property is > inherited by child > FS''s. Of course, this will allow the admin to turn > off copies to the > FS containing junk.Implementing it at the directory and file levels would be even more flexible: redundancy strategy would no longer be tightly tied to path location, but directories and files could themselves still inherit defaults from the filesystem and pool when appropriate (but could be individually handled when desirable). I''ve never understood why redundancy was a pool characteristic in ZFS - and the addition of ''ditto blocks'' and now this new proposal (both of which introduce completely new forms of redundancy to compensate for the fact that pool-level redundancy doesn''t satisfy some needs) just makes me more skeptical about it. (Not that I intend in any way to minimize the effort it might take to change that decision now.)> > > It could be recommended in some situations. If you > want to protect > > against disk firmware errors, bit flips, part of > the disk getting > > scrogged, then mirroring on a single disk (whether > via a mirror vdev or > > copies=2) solves your problem. Admittedly, these > problems are probably > > less common that whole-disk failure, which > mirroring on a single disk > > does not address. > > I beg to differ from experience that the above errors > are more common > than whole disk failures. It''s just that we do not > notice the disks > are developing problems but panic when they finally > fail completely.It would be interesting to know whether that would still be your experience in environments that regularly scrub active data as ZFS does (assuming that said experience was accumulated in environments that don''t). The theory behind scrubbing is that all data areas will be hit often enough that they won''t have time to deteriorate (gradually) to the point where they can''t be read at all, and early deterioration encountered during the scrub pass (or other access) in which they have only begun to become difficult to read will result in immediate revectoring (by the disk or, if not, by the file system) to healthier locations. Since ZFS-style scrubbing detects even otherwise-indetectible ''silent corruption'' missed by the disk''s own ECC mechanisms, that lower-probability event is also covered (though my impression is that the probability of even a single such sector may be significantly lower than that of whole-disk failure, especially in laptop environments). All that being said, keeping multiple copies on a single disk of most metadata (the loss of which could lead to wide-spread data loss) definitely makes sense (especially given its typically negligible size), and it probably makes sense for some files as well. - bill This message posted from opensolaris.org
Bill Moore
2006-Sep-15 16:25 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote:> Implementing it at the directory and file levels would be even more > flexible: redundancy strategy would no longer be tightly tied to path > location, but directories and files could themselves still inherit > defaults from the filesystem and pool when appropriate (but could be > individually handled when desirable).The problem boils down to not having a way to express your intent that works over NFS (where you''re basically limited by POSIX) that you can use from any platform (esp. ones where ZFS isn''t installed). If you have some ideas, this is something we''d love to hear about.> I''ve never understood why redundancy was a pool characteristic in ZFS > - and the addition of ''ditto blocks'' and now this new proposal (both > of which introduce completely new forms of redundancy to compensate > for the fact that pool-level redundancy doesn''t satisfy some needs) > just makes me more skeptical about it.We have thought long and hard about this problem and even know how to implement it (the name we''ve been using is Metaslab Grids, which isn''t terribly descriptive, or as Matt put it "a bag o'' disks"). There are two main problems with it, though. One is failures. The problem is that you want the set of disks implementing redundancy (mirror, RAID-Z, etc.) to be spread across fault domains (controller, cable, fans, power supplies, geographic sites) as much as possible. There is no generic mechanism to obtain this information and act upon it. We could ask the administrator to supply it somehow, but such a description takes effort, is not easy, and prone to error. That''s why we have the model right now where the administrator specifies how they want the disks spread out across fault groups (vdevs). The second problem comes back to accounting. If you can specify, on a per-file or per-directory basis, what kind of replication you want, how do you answer the statvfs() question? I think the recent "discussions" on this list illustrate the complexity and passion on both sides of the argument.> (Not that I intend in any way to minimize the effort it might take to > change that decision now.)The effort is not actually that great. All the hard problems we needed to solve in order to implement this were basically solved when we did the RAID-Z code. As a matter of fact, you can see it in the on-disk specification as well. In the DVA, you''ll notice an 8-bit field labeled "GRID". These are the bits that would describe, on a per-block basis, what kind of redundancy we used. --Bill
(I looked at my email before checking here, so I''ll just cut-and-paste the email response in here rather than send it. By the way, is there a way to view just the responses that have accumulated in this forum since I last visited - or just those I''ve never looked at before?) Bill Moore wrote:> On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote: >> Implementing it at the directory and file levels would be even more >> flexible: redundancy strategy would no longer be tightly tied to path >> location, but directories and files could themselves still inherit >> defaults from the filesystem and pool when appropriate (but could be >> individually handled when desirable). > > The problem boils down to not having a way to express your intent that > works over NFS (where you''re basically limited by POSIX) that you can > use from any platform (esp. ones where ZFS isn''t installed). If you > have some ideas, this is something we''d love to hear about.Well, one idea is that it seems downright silly to gate ZFS facilities on the basis of two-decade-old network file access technology: sure, it''s important to be able to *access* ZFS files using NFS, but does anyone really care if NFS can''t express the full range of ZFS features - at least to the degree that they think such features should be suppressed as a result (rather than made available to local users plus any remote users employing a possibly future mechanism that *can* support them)? That being said, you could always adopt the ReiserFS approach of allowing access to file/directory metadata via extended path specifications in environments like NFS where richer forms of interaction aren''t available: yes, it may feel a bit kludgey, but it gets the job done. And, of course, even if you did nothing to help NFS its users would still benefit from inheriting whatever arbitrarily fine-grained redundancy levels had been established via more comprehensive means: they just wouldn''t be able to tweak redundancy levels themselves (any more, or any less, than they can do so today).> >> I''ve never understood why redundancy was a pool characteristic in ZFS >> - and the addition of ''ditto blocks'' and now this new proposal (both >> of which introduce completely new forms of redundancy to compensate >> for the fact that pool-level redundancy doesn''t satisfy some needs) >> just makes me more skeptical about it. > > We have thought long and hard about this problem and even know how to > implement it (the name we''ve been using is Metaslab Grids, which isn''t > terribly descriptive, or as Matt put it "a bag o'' disks").Yes, ''a bag o'' disks'' - used intelligently at a higher level - is pretty much what I had in mind. There are> two main problems with it, though. One is failures. The problem is > that you want the set of disks implementing redundancy (mirror, RAID-Z, > etc.) to be spread across fault domains (controller, cable, fans, power > supplies, geographic sites) as much as possible. There is no generic > mechanism to obtain this information and act upon it. We could ask the > administrator to supply it somehow, but such a description takes effort, > is not easy, and prone to error. That''s why we have the model right now > where the administrator specifies how they want the disks spread out > across fault groups (vdevs).Without having looked at the code I may be missing something here. Even with your current implementation, if there''s indeed no automated way to obtain such information the administrator has to exercise manual control over disk groupings if they''re going to attain higher availability by avoiding other single points of failure instead of just guard against unrecoverable data loss from disk failure. Once that information has been made available to the system, letting it make use of it at a higher level rather than just aggregating entire physical disks should not entail additional administrator effort. I admit that I haven''t considered the problem in great detail, since my bias is toward solutions that employ redundant arrays of inexpensive nodes to scale up rather than a small number of very large nodes (in part because a single large node itself can often be a single point of failure even if many of its subsystems carefully avoid being so in the manner that you suggest). Each such small node has a relatively low disk count and little or no internal redundancy, and thus comprises its own little fault-containment environment, avoiding most such issues; as a plus, such node sizes mesh well with the bandwidth available from very inexpensive Gigabit Ethernet interconnects and switches (even when streaming data sequentially, such as video on demand) and allow fine-grained incremental system scaling (by the time faster interconnects become inexpensive, disk bandwidth should have increased enough that such a balance will still be fairly good). Still, if you can group whole disks intelligently in a large system with respect to supplementing simple redundancy with higher overall subsystem availability, then you ought to be able to use exactly the same information to allow higher-level decisions about where to place redundant data at other than whole-disk granularity.> > The second problem comes back to accounting. If you can specify, on a > per-file or per-directory basis, what kind of replication you want, how > do you answer the statvfs() question? I think the recent "discussions" > on this list illustrate the complexity and passion on both sides of the > argument.I rather liked the idea of using the filesystem *default* redundancy level as the basis for providing free space information, though in environments where different users were set up with different defaults using the per-user default might make sense (then, only if that was manually changed, presumably by that user, would less obvious things happen). Overall, I think perhaps free space should be reported on the basis of things that the user does *not* have control over, such as the default flavor of redundancy established by an administrator (i.e., as the number of bytes the user could write using that default flavor - which is what I was starting to converge on just above). Then the user will mostly see only discrepancies caused by changes in that default that s/he has made, and should be able to understand them (well, if the user has personal ''temp'' space the admin might have special-cased that for them by making it non-redundant, I suppose). Then again, whenever one traverses a mount point today (not always all that obvious a transition) the whole world of free space (and I''d expect quota) changes anyway, and users don''t seem to find that an insurmountable obstacle. So I find it difficult to see free-space reporting as being any real show-stopper in this area regardless of how it''s done (though like most people who contributed to that topic I think I have a preference).> >> (Not that I intend in any way to minimize the effort it might take to >> change that decision now.) > > The effort is not actually that great. All the hard problems we needed > to solve in order to implement this were basically solved when we did > the RAID-Z code. As a matter of fact, you can see it in the on-disk > specification as well. In the DVA, you''ll notice an 8-bit field labeled > "GRID". These are the bits that would describe, on a per-block basis, > what kind of redundancy we used.The only reason I can think of for establishing that per block (rather than per object) would be if you kept per-block access-rate information around so that you could distribute really hot blocks more widely. And given that such blocks would normally be in cache anyway, that only seems to make sense in a distributed environment (where you''re trying to spread the load over multiple nodes more because of interconnect bandwidth limitations than disk bandwidth limitations - though even here you could do this at the cache level rather than the on-disk level based on dynamic needs). - bill This message posted from opensolaris.org
On September 15, 2006 3:49:14 PM -0700 "can you guess?" <billtodd at metrocast.net> wrote:> (I looked at my email before checking here, so I''ll just cut-and-paste > the email response in here rather than send it. By the way, is there a > way to view just the responses that have accumulated in this forum since > I last visited - or just those I''ve never looked at before?)subscribe via email instead of reading it as a forum
>By the way, is there a way to view just the responses that have accumulated in this forum since I >last visited - or just those I''ve never looked at before?Not through the web interface itself, as far as I can tell, but there''s an RSS feed of messages that might do the trick. Unfortunately it points to the whole thread, rather than the individual messages. http://opensolaris.org/jive/rss/rssmessages.jspa?forumID=80> it seems downright silly to gate ZFS facilities > on the basis of two-decade-old network file access technology: sure, > it''s important to be able to *access* ZFS files using NFS, but does > anyone really care if NFS can''t express the full range of ZFS features -Personally, I don''t think it''s critical. After all, you can''t create a snapshot via NFS either, but we have snapshots. Concepts such as administrative ID, inherited directory characteristics, etc. have had great success in file systems such as IBM''s GPFS and Sun''s QFS, as well as on NetApp''s systems. For that matter, quotas aren''t really in NFSv3, but nobody seems to mind that UFS implements them. Anton This message posted from opensolaris.org
Wee Yeh Tan
2006-Sep-16 07:03 UTC
[zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data
On 9/15/06, can you guess? <billtodd at metrocast.net> wrote:> Implementing it at the directory and file levels would be even more flexible: redundancy strategy would no longer be tightly tied to path location, but directories and files could themselves still inherit defaults from the filesystem and pool when appropriate (but could be individually handled when desirable).Ideally so. FS (or dataset) level is sufficiently fine grain for my use. If I take the trouble to specify copies for a directory, I really do not mind the trouble of creating a new dataset for it at the same time. file-level, however, is really pushing it. You might end up with an administrative nightmare deciphering which files have how many copies. I just do not see it being useful to my environment.> It would be interesting to know whether that would still be your experience in environments that regularly scrub active data as ZFS does (assuming that said experience was accumulated in environments that don''t). The theory behind scrubbing is that all data areas will be hit often enough that they won''t have time to deteriorate (gradually) to the point where they can''t be read at all, and early deterioration encountered during the scrub pass (or other access) in which they have only begun to become difficult to read will result in immediate revectoring (by the disk or, if not, by the file system) to healthier locations.Scrubbing exercises the disk area to prevent bit-rot. I do not think ZFS''s scrubbing changes the failure mode of the raw devices. OTOH, I really have no such experience to speak of *fingers crossed*. I failed to locate the code where the relocation of files happens but assume that copies would make this process more reliable.> Since ZFS-style scrubbing detects even otherwise-indetectible ''silent corruption'' missed by the disk''s own ECC mechanisms, that lower-probability event is also covered (though my impression is that the probability of even a single such sector may be significantly lower than that of whole-disk failure, especially in laptop environments).I do not any data to support nor dismiss that. Matt was right that probability of failure modes is a huge can of worms that can drag forever. -- Just me, Wire ...
> On 9/15/06, can you guess? <billtodd at metrocast.net> > wrote:... file-level, however, is really pushing> it. You might end > up with an administrative nightmare deciphering which > files have how > many copies.\I''m not sure what you mean: the level of redundancy would be a per-file attribute that could be examined, and would be normally just be defaulted to a common value. ...> > It would be interesting to know whether that would > still be your experience in environments that > regularly scrub active data as ZFS does (assuming > that said experience was accumulated in environments > that don''t). The theory behind scrubbing is that all > data areas will be hit often enough that they won''t > have time to deteriorate (gradually) to the point > where they can''t be read at all, and early > deterioration encountered during the scrub pass (or > other access) in which they have only begun to become > difficult to read will result in immediate > revectoring (by the disk or, if not, by the file > system) to healthier locations. > > Scrubbing exercises the disk area to prevent bit-rot. > I do not think > FS''s scrubbing changes the failure mode of the raw > devices.It doesn''t change the failure rate (if anything, it might accelerate it marginally due to the extra disk activity), but it *does* change, potentially radically, the frequency with which sectors containing user data become unreadable - because it allows them to be detected *before* that happens such that the data can be moved to a good sector (often by the disk itself, else by higher-level software) and the failing sector marked bad. OTOH, I> really have no such experience to speak of *fingers > crossed*. I > failed to locate the code where the relocation of > files happens but > assume that copies would make this process more > reliable.Sort of: while they don''t make any difference when you catch a failing sector while it''s still readable, they certainly help if you only catch it after it''s become unreadable (or has been ''silently'' corrupted).> > > Since ZFS-style scrubbing detects even > otherwise-indetectible ''silent corruption'' missed by > the disk''s own ECC mechanisms, that lower-probability > event is also covered (though my impression is that > the probability of even a single such sector may be > significantly lower than that of whole-disk failure, > especially in laptop environments). > > I do not any data to support nor dismiss that.Quite a few years ago Seagate still published such data, but of course I didn''t copy it down (because it was ''always available'' when I wanted it - as I said, it was quite a while ago and I was not nearly as well-acquainted with the volatility of Internet data as I would subsequently become). But to the best of my recollection their enterprise disks at that time were specced to have no worse than 1 uncorrectable error for every petabit read and no worse than 1 undetected error for every exabit read. A fairly recent paper by people who still have access to such data suggests that the frequency of uncorrectable errors in enterprise drives is still about the same, but that the frequency of undetected errors may have increased markedly (to perhaps once in every 10 petabits read) - possibly a result of ever-increasing on-disk bit densities and the more aggressive error correction required to handle them (perhaps this is part of the reason they don''t make error rates public any more...). They claim that SATA drives have error rates around 10x that of enterprise drives (or an undetected error rate of around once per petabit). Figure out a laptop drive''s average data rate and that gives you a mean time to encountering undetected corruption. Compare that to the drive''s in-use MTBF rating and there you go! If I haven''t dropped a decimal place or three doing this in my head, then even if laptop drives have nominal MTBFs equal to desktop SATA drives it looks as if it would take an average data rate of 60 - 70 KB/sec (24/7, year-in, year-out) for the likelihood of an undetected error to be comparable in likelihood to a whole-disk failure: that''s certainly nothing much for a fairly well-loaded server in constant (or even just 40 hour/week) use, but for a laptop?. - bill This message posted from opensolaris.org
Richard Elling - PAE
2006-Sep-19 00:32 UTC
[zfs-discuss] Proposal: multiple copies of user data
[appologies for being away from my data last week] David Dyer-Bennet wrote:> The more I look at it the more I think that a second copy on the same > disk doesn''t protect against very much real-world risk. Am I wrong > here? Are partial(small) disk corruptions more common than I think? > I don''t have a good statistical view of disk failures.This question was asked many times in this thread. IMHO, it is the single biggest reason we should implement ditto blocks for data. We did a study of disk failures in an enterprise RAID array a few years ago. One failure mode stands heads and shoulders above the others: non-recoverable reads. A short summary: 2,919 total errors reported 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) 961 (32.9%) unrecovered errors (of all types) 32 (1.1%) other (eg. device not ready) 707 (24.2%) non-recoverable reads In other words, non-recoverable reads represent 73.6% of the non- recoverable failures that occur, including complete drive failures. Boo! Did that scare you? Halloween is next month! :-) Seagate said today that in a few years 3.5" disks will store 2.5 TBytes. Boo! While I don''t have data on laptop disk failures, I would not be surprised to see a similar distribution, though with a larger mechanical damage count. My laptops run hotter inside than my other systems and, as a rule of thumb, your disk failure rate increases by 2x for every 15C change in temperature. Is your laptop disk hot? The case for ditto data is clear to me. Many people are using single-disk systems, and many more people would really like to use single-disk systems but they really can''t. Beyond spinning rust systems, there are other forms of non-volatile storage which would apply here. For example, those people who suggested that you should backup your presentation to a CD fail to note that a spec of dust on the CD could lead you to lose one block of data. In my CD/DVD experience, such losses are blissfully ignored by the system and you may blame the resulting crash on the cheap hardware you bought from your brother-in-law. Beyond CDs, I can see this as being a nice enhancement to limited endurance devices such as flash. While it is true that I could slice my disk up into multiple vdevs and mirror them, I''d much rather set a policy at a finer grainularity: my files are more important than most of the other, mostly read-only and easily reconstructed, files on my system. When ditto blocks for metadata was introduced, I took a look at the code and was pleasantly suprised. The code does an admirable job of ensuring spatial diversity in the face of multiple policies, even in the single disk case. IMHO, this is the right way to implement this and allows you to mix policies with ease. As a RAS guy, I''m biased to not wanting to lose data via easy-to-use interfaces. I don''t see how this feature has any downside, but lots of upside. -- richard
David Dyer-Bennet
2006-Sep-19 02:42 UTC
[zfs-discuss] Proposal: multiple copies of user data
On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:> [appologies for being away from my data last week] > > David Dyer-Bennet wrote: > > The more I look at it the more I think that a second copy on the same > > disk doesn''t protect against very much real-world risk. Am I wrong > > here? Are partial(small) disk corruptions more common than I think? > > I don''t have a good statistical view of disk failures. > > This question was asked many times in this thread. IMHO, it is the > single biggest reason we should implement ditto blocks for data. > > We did a study of disk failures in an enterprise RAID array a few > years ago. One failure mode stands heads and shoulders above the > others: non-recoverable reads. A short summary: > > 2,919 total errors reported > 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) > 961 (32.9%) unrecovered errors (of all types) > 32 (1.1%) other (eg. device not ready) > 707 (24.2%) non-recoverable reads > > In other words, non-recoverable reads represent 73.6% of the non- > recoverable failures that occur, including complete drive failures.I don''t see anything addressing complete drive failures vs. block failures here anywhere. Is there some way to read something about that out of this data? I''m thinking the "operations succeeded" also occurs read errors recovered by retries and such, as well as the write failure cited as an example? I guess I can conclude that the 66% for errors successfully recovered means that a lot of errors are not, in fact, entire-drive failures. So that''s good (for ditto-data). So a maximum of 34% are whole-drive failures (and in reality I''m sure far lower). Anyway, facts on actual failures in the real world are *definitely* the useful way to conduct this discussion! [snip]> While it is true that I could slice my disk up into multiple vdevs and > mirror them, I''d much rather set a policy at a finer grainularity: my > files are more important than most of the other, mostly read-only and > easily reconstructed, files on my system.I definitely like the idea of setting policy at a finer granularity; I really want it to be at the file level, even per-directory doesn''t fit reality very well in my view.> When ditto blocks for metadata was introduced, I took a look at the > code and was pleasantly suprised. The code does an admirable job of > ensuring spatial diversity in the face of multiple policies, even in > the single disk case. IMHO, this is the right way to implement this > and allows you to mix policies with ease.That''s very good to hear. -- David Dyer-Bennet, <mailto:dd-b at dd-b.net>, <http://www.dd-b.net/dd-b/> RKBA: <http://www.dd-b.net/carry/> Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/> Dragaera/Steven Brust: <http://dragaera.info/>
Richard Elling - PAE
2006-Sep-19 03:16 UTC
[zfs-discuss] Proposal: multiple copies of user data
more below... David Dyer-Bennet wrote:> On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote: >> [appologies for being away from my data last week] >> >> David Dyer-Bennet wrote: >> > The more I look at it the more I think that a second copy on the same >> > disk doesn''t protect against very much real-world risk. Am I wrong >> > here? Are partial(small) disk corruptions more common than I think? >> > I don''t have a good statistical view of disk failures. >> >> This question was asked many times in this thread. IMHO, it is the >> single biggest reason we should implement ditto blocks for data. >> >> We did a study of disk failures in an enterprise RAID array a few >> years ago. One failure mode stands heads and shoulders above the >> others: non-recoverable reads. A short summary: >> >> 2,919 total errors reported >> 1,926 (66.0%) operations succeeded (eg. write failed, auto >> reallocated) >> 961 (32.9%) unrecovered errors (of all types) >> 32 (1.1%) other (eg. device not ready) >> 707 (24.2%) non-recoverable reads >> >> In other words, non-recoverable reads represent 73.6% of the non- >> recoverable failures that occur, including complete drive failures. > > I don''t see anything addressing complete drive failures vs. block > failures here anywhere. Is there some way to read something about > that out of this data?Complete failures are a non-zero category, but there is more than one error code which would result in the recommendation to replace the drive. Their counts are included in the 961-707=254 (26.4%) of other non- recoverable errors. In some cases a non-recoverable error can be corrected by a retry, and those also fall into the 26.4% bucket. Interestingly, the operation may succeed and yet we will get an error which recommends replacing the drive. For example, if the failure prediction threshold is exceeded. You might also want to replace the drive when there are no spare defect sectors available. Life would be easier if they really did simply die.> I''m thinking the "operations succeeded" also occurs read errors > recovered by retries and such, as well as the write failure cited as > an example?Yes.> I guess I can conclude that the 66% for errors successfully recovered > means that a lot of errors are not, in fact, entire-drive failures. > So that''s good (for ditto-data). So a maximum of 34% are whole-drive > failures (and in reality I''m sure far lower).I agree.> Anyway, facts on actual failures in the real world are *definitely* > the useful way to conduct this discussion! > > [snip] > >> While it is true that I could slice my disk up into multiple vdevs and >> mirror them, I''d much rather set a policy at a finer grainularity: my >> files are more important than most of the other, mostly read-only and >> easily reconstructed, files on my system. > > I definitely like the idea of setting policy at a finer granularity; I > really want it to be at the file level, even per-directory doesn''t fit > reality very well in my view. > >> When ditto blocks for metadata was introduced, I took a look at the >> code and was pleasantly suprised. The code does an admirable job of >> ensuring spatial diversity in the face of multiple policies, even in >> the single disk case. IMHO, this is the right way to implement this >> and allows you to mix policies with ease. > > That''s very good to hear.-- richard
David Dyer-Bennet
2006-Sep-19 03:29 UTC
[zfs-discuss] Proposal: multiple copies of user data
On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:> Interestingly, the operation may succeed and yet we will get an error > which recommends replacing the drive. For example, if the failure > prediction threshold is exceeded. You might also want to replace the > drive when there are no spare defect sectors available. Life would be > easier if they really did simply die.For one thing, people wouldn''t be interested in doing ditto-block data! So, with ditto-block data, you survive any single-block failure, and "most" double-block failures, etc. What it doesn''t lend itself to is simple computation of simple answers :-). In theory, and with an infinite budget, I''d approach this analagously to cpu architecture design based on large volumes of instruction trace data. If I had a large volume of disk operation traces with the hardware failures indicated, I could run this against the ZFS simulator and see what strategies produced the most robust single-disk results. -- David Dyer-Bennet, <mailto:dd-b at dd-b.net>, <http://www.dd-b.net/dd-b/> RKBA: <http://www.dd-b.net/carry/> Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/> Dragaera/Steven Brust: <http://dragaera.info/>
Richard Elling - PAE
2006-Sep-19 18:07 UTC
[zfs-discuss] Proposal: multiple copies of user data
[pardon the digression] David Dyer-Bennet wrote:> On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote: > >> Interestingly, the operation may succeed and yet we will get an error >> which recommends replacing the drive. For example, if the failure >> prediction threshold is exceeded. You might also want to replace the >> drive when there are no spare defect sectors available. Life would be >> easier if they really did simply die. > > For one thing, people wouldn''t be interested in doing ditto-block data! > > So, with ditto-block data, you survive any single-block failure, and > "most" double-block failures, etc. What it doesn''t lend itself to is > simple computation of simple answers :-). > > In theory, and with an infinite budget, I''d approach this analagously > to cpu architecture design based on large volumes of instruction trace > data. If I had a large volume of disk operation traces with the > hardware failures indicated, I could run this against the ZFS > simulator and see what strategies produced the most robust single-disk > results.There is a significant difference. The functionality of logic part is deterministic and discrete. The wear-out rate of a mechanical device is continuous and probabilistic. In the middle are discrete events with probabilities associated with them, but they are handled separately. In other words, we can use probability and statistics tools to analyze data loss in disk drives. This will be much faster and less expensive than running a bunch of traces. In fact, there has already been much written about disk drives, their failure modes, and factors which contribute to their failure rates. We use such data to predict the probability of events such as non-recoverable reads (which is often specified in the data sheet). -- richard
David Dyer-Bennet
2006-Sep-19 19:32 UTC
[zfs-discuss] Proposal: multiple copies of user data
On 9/19/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:> [pardon the digression] > > David Dyer-Bennet wrote: > > On 9/18/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote: > > > >> Interestingly, the operation may succeed and yet we will get an error > >> which recommends replacing the drive. For example, if the failure > >> prediction threshold is exceeded. You might also want to replace the > >> drive when there are no spare defect sectors available. Life would be > >> easier if they really did simply die. > > > > For one thing, people wouldn''t be interested in doing ditto-block data! > > > > So, with ditto-block data, you survive any single-block failure, and > > "most" double-block failures, etc. What it doesn''t lend itself to is > > simple computation of simple answers :-). > > > > In theory, and with an infinite budget, I''d approach this analagously > > to cpu architecture design based on large volumes of instruction trace > > data. If I had a large volume of disk operation traces with the > > hardware failures indicated, I could run this against the ZFS > > simulator and see what strategies produced the most robust single-disk > > results. > > There is a significant difference. The functionality of logic part is > deterministic and discrete. The wear-out rate of a mechanical device > is continuous and probabilistic. In the middle are discrete events > with probabilities associated with them, but they are handled separately. > In other words, we can use probability and statistics tools to analyze > data loss in disk drives. This will be much faster and less expensive > than running a bunch of traces. In fact, there has already been much > written about disk drives, their failure modes, and factors which > contribute to their failure rates. We use such data to predict the > probability of events such as non-recoverable reads (which is often > specified in the data sheet).Oh, I know there''s a difference. It''s not as big as it looks, though, if you remember that the instruction or disk operation traces are just *representative* of the workload, not the actual workload that has to run. So, yes, disk failures are certainly non-deterministic, but the actual instruction stream run by customers isn''t the same one designed against, either. In both cases the design has to take the trace as a general guideline for types of things that will happen, rather than as a strict workload to optimize for. -- David Dyer-Bennet, <mailto:dd-b at dd-b.net>, <http://www.dd-b.net/dd-b/> RKBA: <http://www.dd-b.net/carry/> Pics: <http://www.dd-b.net/dd-b/SnapshotAlbum/> Dragaera/Steven Brust: <http://dragaera.info/>
Richard Elling - PAE wrote:> > This question was asked many times in this thread. IMHO, it is the > single biggest reason we should implement ditto blocks for data. > > We did a study of disk failures in an enterprise RAID array a few > years ago. One failure mode stands heads and shoulders above the > others: non-recoverable reads. A short summary: > > 2,919 total errors reported > 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) > 961 (32.9%) unrecovered errors (of all types) > 32 (1.1%) other (eg. device not ready) > 707 (24.2%) non-recoverable reads > > In other words, non-recoverable reads represent 73.6% of the non- > recoverable failures that occur, including complete drive failures.Does this take cascading failures into account? How often do you get an unrecoverable read and yet are still able to perform operation on the target media? Thats where ditto blocks could come in handy modulo the concerns around utilities and quotas.
Richard Elling - PAE
2006-Sep-19 23:51 UTC
[zfs-discuss] Proposal: multiple copies of user data
reply below... Torrey McMahon wrote:> Richard Elling - PAE wrote: >> >> This question was asked many times in this thread. IMHO, it is the >> single biggest reason we should implement ditto blocks for data. >> >> We did a study of disk failures in an enterprise RAID array a few >> years ago. One failure mode stands heads and shoulders above the >> others: non-recoverable reads. A short summary: >> >> 2,919 total errors reported >> 1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated) >> 961 (32.9%) unrecovered errors (of all types) >> 32 (1.1%) other (eg. device not ready) >> 707 (24.2%) non-recoverable reads >> >> In other words, non-recoverable reads represent 73.6% of the non- >> recoverable failures that occur, including complete drive failures. > > > Does this take cascading failures into account? How often do you get an > unrecoverable read and yet are still able to perform operation on the > target media? Thats where ditto blocks could come in handy modulo the > concerns around utilities and quotas.No event analysis is done here, though we do have the data, the task is time consuming. Non-recoverable reads may not represent permanent failures. In the case of a RAID array, the data should be reconstructed and a rewrite + verify attempted with the possibility of sparing the sector. ZFS can reconstruct the data and relocate the block. I have some (volumous) data on disk error rates as reported though kstat. I plan to attempt to get a better sense of the failure rates from that data. The disk vendors specify non-recoverable read error rates, but we think they are overly pessimistic for the first few years of life. We''d like to have a better sense of how to model this, for a variety of applications which are concerned with archival periods. -- richard
Richard Elling - PAE wrote:> > Non-recoverable reads may not represent permanent failures. In the case > of a RAID array, the data should be reconstructed and a rewrite + verify > attempted with the possibility of sparing the sector. ZFS can > reconstruct the data and relocate the block. >True but if you''re using a HW raid array or some sort of protection within a zpool then you''re already protected to a large degree. I''m looking for the amount of cases where you get a permanent unrecoverable read error and yet can recover because you''ve got a ditto block someplace.
Richard Elling - PAE
2006-Sep-20 03:20 UTC
[zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote:> Richard Elling - PAE wrote: >> >> Non-recoverable reads may not represent permanent failures. In the case >> of a RAID array, the data should be reconstructed and a rewrite + verify >> attempted with the possibility of sparing the sector. ZFS can >> reconstruct the data and relocate the block. >> > > > True but if you''re using a HW raid array or some sort of protection > within a zpool then you''re already protected to a large degree. I''m > looking for the amount of cases where you get a permanent unrecoverable > read error and yet can recover because you''ve got a ditto block someplace.Agree. Non-recoverable reads are largely a JBOD problem. -- richard
Just a "me too" mail: On 13 Sep 2006, at 08:30, Richard Elling wrote:>> Is this use of slightly based upon disk failure modes? That is, when >> disks fail do they tend to get isolated areas of badness compared to >> complete loss? I would suggest that complete loss should include >> someone tripping over the power cord to the external array that >> houses >> the disk. > > The field data I have says that complete disk failures are the > exception.It''s the same here. In our 100 laptop population in the last 2 years, we had 2 dead drives and 10 or so with I/O errors.> BTW, this feature will be very welcome on my laptop! I can''t wait :-)I, too, would love having two copies of my important data on my laptop drive. Laptop drives are small enough as they are, there''s no point in storing the OS, tmp and swap files twice as well. So if ditto-data blocks aren''t hard to implement, they would be welcome. Otherwise there''s still the mirror-split-your-drive approach. Wout.
Victor Latushkin
2006-Sep-23 14:54 UTC
[zfs-discuss] Proposal: multiple copies of user data
David Dyer-Bennet wrote:> On 9/11/06, Matthew Ahrens <Matthew.Ahrens at sun.com> wrote: >> Here is a proposal for a new ''copies'' property which would allow >> different levels of replication for different filesystems. >> >> Your comments are appreciated! > > I''ve read the proposal, and followed the discussion so far. I have to > say that I don''t see any particular need for this feature. > > Possibly there is a need for a different feature, in which the entire > control of redundancy is moved away from the pool level and to the > file or filesystem level. I definitely see the attraction of being > able to specify by file and directory different degrees of reliability > needed. However, the details of the feature actually proposed don''t > seem to satisfy the need for extra reliability at the level that > drives people to employ redundancy; it doesn''t provide a guaranty.I think this is easy to solve and could make such feature more useful - we need a way to specify policy of placing duplicate copies, e.g. if we want a guarantee we specify that copies should strictly be put on different disks. In this we we can get level of protection close to that of a mirror and a much greater flexibility - for example, for two disk systems we may have two copies for boot environment and critical data, and use only one copy for data that is temporary. This may be a first step on a way to implementation of redundancy on filesystem, directory or file basis.> I see no need for additional non-guaranteed reliability on top of the > levels of guaranty provided by use of redundancy at the pool level. > > Furthermore, as others have pointed out, this feature would add a high > degree of user-visible complexity. > >> From what I''ve seen here so far, I think this is a bad idea and should > not be added.I think that with a way to specify policy of placing copies I think this is a good and useful idea, and it''s a pity that it is shelved for now. Hope it will not stay on the shelf forever ;-) Victor
Pawel Jakub Dawidek
2006-Sep-27 11:22 UTC
[zfs-discuss] Proposal: multiple copies of user data
On Tue, Sep 12, 2006 at 03:56:00PM -0700, Matthew Ahrens wrote:> Matthew Ahrens wrote:[...]> Given the overwhelming criticism of this feature, I''m going to shelve it for now.I''d really like to see this feature. You say ZFS should change our view on filesystems, I say be consequent. In ZFS world we create one big pool out of all our disks and create filesystems on top of it. This way we don''t have to care about resizing them, etc. But this way we define redundancy at pool level for all our filesystems. It is quite common that we have data we don''t really care about as well as data we do care about a lot in the same pool. Before ZFS, I''d just create RAID0 for the former and RAID1 for the latter, but this is not the ZFS way, right? My question is how can I express my intent of defining redundancy level based of the importance of my data, but still following the ZFS way without ''copies'' feature? Please reconsider your choice. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060927/0aa7049b/attachment.bin>