Hey there, I was thinking about laptops and zfs. I support a fair number of laptops, and if their drives go bad, it''s almost always some sort of bit-rot. Sectors that were written to a while ago, can''t be read any more. I haven''t seen a laptop where the drive just wouldn''t work any more. I was considering what could be done to improve reliability from a zfs perspective, given only 1 drive. Since it''s bitrot, once-reliable sectors can break, so just using a lot of snapshots won''t help. First off, how about splitting the drive in 2 partitions and mirroring across them? This keeps the two sets of data nicely separated in case of head crash or bad magnetic patches. Only, it means that the drive head will have to move around a lot, making things slower. On top of that, you lose half your disk space. Ok, so how about splitting it in 3..8 partitions and running RAID-Z across that? Now the head has to move even more, but it has less ways to go between places it has to write data at. And a the I/O scheduler can probably arrange I/O so that the head goes back and forth in graceful, swooping arcs. Plus, you don''t lose as much disk space, you can even choose how much redundancy you want. Hmmm, how about giving raidz the same device 8 times? I know right now this doesn''t make sense, but what if raidz could be told to use only one vdev, and provide 12.5% (1/8) redundant bits? It would still split up a given transaction in stripes, but it could choose where on the disk to lay them out. For instance, if some research shows that most drive errors occur in 128KB blocks, it would just have to make sure that the stripe blocks are at least 256KB apart. I''m really curious to read your thoughts about this... Wout.
Wouldn''t it make more sense to plug in a firewire drive, mirror to that, and then break the mirror? It wouldn''t be live all the time, but it would give you a backup of your data/system. Regarding the 8 slices, laptop drives (and non-enterprise-class [PS] ATA) aren''t built to do heavy i/o. By slicing it 8 times (with 8 writes to the same drive), you''re going to be stressing the hardware much harder than otherwise. I think that would reduce the usable lifetime of your drive significantly. Striping across the same drive really seems like a false sense of security. On Apr 6, 2006, at 3:50 AM, Wout Mertens wrote:> Hey there, > > I was thinking about laptops and zfs. I support a fair number of > laptops, and if their drives go bad, it''s almost always some sort > of bit-rot. Sectors that were written to a while ago, can''t be read > any more. I haven''t seen a laptop where the drive just wouldn''t > work any more. > > I was considering what could be done to improve reliability from a > zfs perspective, given only 1 drive. > > Since it''s bitrot, once-reliable sectors can break, so just using a > lot of snapshots won''t help. > > First off, how about splitting the drive in 2 partitions and > mirroring across them? > This keeps the two sets of data nicely separated in case of head > crash or bad magnetic patches. Only, it means that the drive head > will have to move around a lot, making things slower. On top of > that, you lose half your disk space. > > Ok, so how about splitting it in 3..8 partitions and running RAID-Z > across that? > Now the head has to move even more, but it has less ways to go > between places it has to write data at. And a the I/O scheduler can > probably arrange I/O so that the head goes back and forth in > graceful, swooping arcs. > Plus, you don''t lose as much disk space, you can even choose how > much redundancy you want. > > Hmmm, how about giving raidz the same device 8 times? > I know right now this doesn''t make sense, but what if raidz could > be told to use only one vdev, and provide 12.5% (1/8) redundant bits? > It would still split up a given transaction in stripes, but it > could choose where on the disk to lay them out. For instance, if > some research shows that most drive errors occur in 128KB blocks, > it would just have to make sure that the stripe blocks are at least > 256KB apart. > > > I''m really curious to read your thoughts about this... > > Wout. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On 06 Apr 2006, at 15:45, Gregory Shaw wrote:> Wouldn''t it make more sense to plug in a firewire drive, mirror to > that, and then break the mirror? It wouldn''t be live all the time, > but it would give you a backup of your data/system.Well, it would certainly improve things, but then you need to automate the heck out of it with hotplugging and so on, and even then people will often not do it. For instance, to back up my personal data on my laptop, I need to plug in my iPod and run 1 command. I almost never do it. I realize with only 1 drive you''re not going to win any data- availability prizes, but it''s a solution that zfs enables through its checksumming and it''s fully automated.> Regarding the 8 slices, laptop drives (and non-enterprise-class [PS] > ATA) aren''t built to do heavy i/o. By slicing it 8 times (with 8 > writes to the same drive), you''re going to be stressing the > hardware much harder than otherwise. I think that would reduce > the usable lifetime of your drive significantly.But the total amount of data written would only be 12.5% more? And with zfs compression turned on, it''s less! :) As for usable lifetime... I don''t know. I agree that it probably increases mechanical stress on the drive, but I really wonder what the net result is. I was hoping someone on this mailinglist knows more about drive failure modes... So the idea is to automatically improve the reliability of laptop harddisks by leveraging zfs features and paying a price in CPU time and disk space. You bring up good points but I don''t think they invalidate this concept. No? Cheers, Wout.
On Thu, 2006-04-06 at 11:50 +0200, Wout Mertens wrote:> I was thinking about laptops and zfs. I support a fair number of > laptops, and if their drives go bad, it''s almost always some sort of > bit-rot. Sectors that were written to a while ago, can''t be read any > more. I haven''t seen a laptop where the drive just wouldn''t work any > more. > > I was considering what could be done to improve reliability from a > zfs perspective, given only 1 drive.Mirror.> Since it''s bitrot, once-reliable sectors can break, so just using a > lot of snapshots won''t help.Correct. Though the on-disk format is highly redundant in a single vdev, you are still susceptible to multiple block loss. Anecdotally, we do see disk blocks fail in clusters, and I''ve got some data on this laying around here somewhere...> First off, how about splitting the drive in 2 partitions and > mirroring across them? > This keeps the two sets of data nicely separated in case of head > crash or bad magnetic patches. Only, it means that the drive head > will have to move around a lot, making things slower. On top of that, > you lose half your disk space.Reliable, fast, or inexpensive: pick one.> Ok, so how about splitting it in 3..8 partitions and running RAID-Z > across that?Reliable, fast, or inexpensive: pick one :-)> Now the head has to move even more, but it has less ways to go > between places it has to write data at. And a the I/O scheduler can > probably arrange I/O so that the head goes back and forth in > graceful, swooping arcs. > Plus, you don''t lose as much disk space, you can even choose how much > redundancy you want.I''ve asked this question over the past year of a number of drive reliability experts. The concensus seems to be that head movement does not significantly impact reliability. Media seems to stay at the top of the failure Pareto, and positioning faults are down in the tail. The field data I have also confirms this.> Hmmm, how about giving raidz the same device 8 times? > I know right now this doesn''t make sense, but what if raidz could be > told to use only one vdev, and provide 12.5% (1/8) redundant bits? > It would still split up a given transaction in stripes, but it could > choose where on the disk to lay them out. For instance, if some > research shows that most drive errors occur in 128KB blocks, it would > just have to make sure that the stripe blocks are at least 256KB apart. > > > I''m really curious to read your thoughts about this...I''ve been doing some work on this sort of analysis. There is definitely a strong case for mirroring on one disk, when you have but one disk. I don''t think end-user acceptance is altogether an unsolvable problem -- the PC space already does this with snapshots (eg. IBM''s laptop recovery). Further, there are many failures that can occur on a disk before it becomes cost effective to replace. My gut reaction is that people have lost data for years and never known it, so they think everything is ok. ZFS will expose this sort of data loss (was undetected errors, now detected errors), so we need to be able to show a good reason for data protection, with its associated cost trade-offs. -- richard
On Thu, Apr 06, 2006 at 11:50:35AM +0200, Wout Mertens wrote:> > Hmmm, how about giving raidz the same device 8 times? > I know right now this doesn''t make sense, but what if raidz could be > told to use only one vdev, and provide 12.5% (1/8) redundant bits? > It would still split up a given transaction in stripes, but it could > choose where on the disk to lay them out. For instance, if some > research shows that most drive errors occur in 128KB blocks, it would > just have to make sure that the stripe blocks are at least 256KB apart.Unfortunately, I don''t believe ZFS notices when two slices are on the same device (zfs folks, feel free to chime in if I''m wrong), so the N partitions will each have IOs scheduled independently, causing all kinds of thrashing. Seems poor. Cheers, - jonathan -- Jonathan Adams, Solaris Kernel Development
In my experience, I''ve seen the following failures on laptop drives: - Total failure due to abuse (such as dropping the laptop) - Controller failure (flaky controller) - Spin-up problems (stickage/age?) - Bad block count failure. Laptop drives have a limited number of block reallocations available. Once those have been used, it is no longer possible to fix bad blocks. When you mention bit-rot, I''m assuming you mean bad blocks. I haven''t seen files degrade on disk without an underlying cause, such as bad blocks or turning off the laptop in the middle of a write operation. Only the bad block count would be helped by doing RAIDZ. File corruption through power-off should be addressed by ZFS as a whole. One question: If you encounter a bad block that can''t be fixed and it has invalidated one of the 8 partitions, what do you do to fix it? On Apr 6, 2006, at 11:16 AM, Wout Mertens wrote:> On 06 Apr 2006, at 15:45, Gregory Shaw wrote: > >> Wouldn''t it make more sense to plug in a firewire drive, mirror to >> that, and then break the mirror? It wouldn''t be live all the >> time, but it would give you a backup of your data/system. > > Well, it would certainly improve things, but then you need to > automate the heck out of it with hotplugging and so on, and even > then people will often not do it. For instance, to back up my > personal data on my laptop, I need to plug in my iPod and run 1 > command. I almost never do it. > > I realize with only 1 drive you''re not going to win any data- > availability prizes, but it''s a solution that zfs enables through > its checksumming and it''s fully automated. > > >> Regarding the 8 slices, laptop drives (and non-enterprise-class >> [PS]ATA) aren''t built to do heavy i/o. By slicing it 8 times >> (with 8 writes to the same drive), you''re going to be stressing >> the hardware much harder than otherwise. I think that would >> reduce the usable lifetime of your drive significantly. > > But the total amount of data written would only be 12.5% more? And > with zfs compression turned on, it''s less! :) > > As for usable lifetime... I don''t know. I agree that it probably > increases mechanical stress on the drive, but I really wonder what > the net result is. I was hoping someone on this mailinglist knows > more about drive failure modes... > > So the idea is to automatically improve the reliability of laptop > harddisks by leveraging zfs features and paying a price in CPU time > and disk space. You bring up good points but I don''t think they > invalidate this concept. No? > > Cheers, > > Wout. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Thu, 2006-04-06 at 14:30 -0600, Gregory Shaw wrote:> In my experience, I''ve seen the following failures on laptop drives: > > - Total failure due to abuse (such as dropping the laptop) > - Controller failure (flaky controller) > - Spin-up problems (stickage/age?) > - Bad block count failure. Laptop drives have a limited number of > block reallocations available. Once those have been used, it is no > longer possible to fix bad blocks.The data from the bad block may not be recoverable. In the field data I have, this is always the #1 most common fault. The "repaired" block is zero-filled to the spare block. This is one manifestation of bit rot. NB. some flash devices also use block sparing.> When you mention bit-rot, I''m assuming you mean bad blocks. I > haven''t seen files degrade on disk without an underlying cause, such > as bad blocks or turning off the laptop in the middle of a write > operation.ZFS is immune to the power-off problem. The data on disk is always consistent. Whether it was the data you wanted on disk is another question... reliable, fast, or inexpensive: pick one.> Only the bad block count would be helped by doing RAIDZ. File > corruption through power-off should be addressed by ZFS as a whole.Bad blocks can be detected on both write and read operations. In the case of writes, ZFS could write it somewhere else [*]. In the case of reads, ZFS knows the data is bad because it checksums, and can recreate the data if you use some sort of robustness. This is very different from other file systems where they assume the data is good and, frankly, don''t handle the bad cases well. [*] not a panacea, see earlier thread on slowness while disk is failed> One question: > If you encounter a bad block that can''t be fixed and it has > invalidated one of the 8 partitions, what do you do to fix it?Someone from the ZFS team can correct me here, I''m not sure how or when ZFS gives up on the whole vdev. The vtoc is stored in the first block of a partition. If you lose that, you lose the partitions. But this will be a very different failure mode than losing enough blocks in the partition such that you would want to fail the whole partion. I presume FMA will be used for such diagnosis and leveraged into ZFS rather than reinventing it (the DE) for ZFS. -- richard
On Thu, Apr 06, 2006 at 03:01:19PM -0700, Richard Elling wrote:> > Someone from the ZFS team can correct me here, I''m not sure how > or when ZFS gives up on the whole vdev. The vtoc is stored in > the first block of a partition. If you lose that, you lose > the partitions. But this will be a very different failure mode > than losing enough blocks in the partition such that you would > want to fail the whole partion. I presume FMA will be used for > such diagnosis and leveraged into ZFS rather than reinventing > it (the DE) for ZFS.We do nothing - that block will be bad forever. In the next phase of ZFS/FMA, we will be able to detect persistently bad blocks and mark the the entire vdev degraded (auto-initiating a replace if hot spares are configured). This is our only choice, because at that point you''ll have no fault tolerance (or N-1 tolerance) for that particular block. Losing another drive would result in permanent loss of data. We have some ideas about block reallocation, but in general it''s very hard because ZFS is COW, as well as having snapshots. - Eri -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Richard Elling <Richard.Elling at Sun.COM> wrote:> On Thu, 2006-04-06 at 14:30 -0600, Gregory Shaw wrote: > > In my experience, I''ve seen the following failures on laptop drives: > > > > - Total failure due to abuse (such as dropping the laptop) > > - Controller failure (flaky controller) > > - Spin-up problems (stickage/age?) > > - Bad block count failure. Laptop drives have a limited number of > > block reallocations available. Once those have been used, it is no > > longer possible to fix bad blocks. > > The data from the bad block may not be recoverable. In the field > data I have, this is always the #1 most common fault. The "repaired" > block is zero-filled to the spare block. This is one manifestation > of bit rot. > > NB. some flash devices also use block sparing.The main problem with bad block reallocation is that S.M.A.R.T. only works in case that you read the whole disk frequent enough. If you read the blocks frequenrt enough, then the data is recovered by the disk because it id detected before the block becomes completely unreadable. Otherwise, the content is lost. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On Thu, Apr 06, 2006 at 11:50:35AM +0200, Wout Mertens wrote:> I was considering what could be done to improve reliability from a > zfs perspective, given only 1 drive....> First off, how about splitting the drive in 2 partitions and > mirroring across them?You may be interested in the following bug, which we expect to be integrated soon: 6410698 ZFS metadata needs to be more highly replicated (ditto blocks) This will cause us to store multiple copies of metadata on a single drive pool. It also provides some infrastructure that will allow us to eventually store multiple copies of user data even in a non-replicated pool. --matt