Hello everyone, Currently, btrfs has its own raid1 but no repair mechanism for bad checksums or EIOs. While trying to implement such a repair mechanism, several more or less general questions came up. There are two different retry paths for data and metadata. (If you know or don''t care how btrfs handles read errors: goto questions) The data path: btrfs_io_failed_hook is called for each failed bio (EIO or checksum error). Currently, it does not know which mirror failed at first, because normally btrfs_map_block is called with mirror_num=0, leading to a path where find_live_mirror picks one of them. The error recovery strategy is then to explicitly read available mirrors one after the other until one succeeds. In case the very first read picked mirror 1 and failed, the retry code will most likely fail at mirror 1 as well. It would be nice to know which mirror was picked formerly and directly try the other. The metadata path: there is no failure hook, instead there is a loop in btree_read_extent_buffer_pages, also starting off at mirror_num=0, which again leaves the decision to find_live_mirror. If there is an error for any page to be read, the same retry strategy is used as is in the data path. This obviously might leave you alone with unreadable data (consider page x is bad on mirror 1 and page x+1 is bad on mirror 2, both belonging to the same extent, you lose). It would be nice to have a mechanism at a lower level issuing page-sized retries. Of course, knowing which mirror is bad before trying mirror 1 again is desirable as well. questions: I have a raid1 repair solution in mind (partially coded) for btrfs that can be implemented quite easily. However, I have some misgivings. All of the following questions would need a "yes" for my solution to stand: - Is it acceptable to retry reading a block immediately after the disk said it won''t work? Or in case of a successful read followed by a checksum error? (Which is already being done right now in btrfs.) - Is it acceptable to always write both mirrors if one is found to be bad (also consider ssds)? If either of the answers is "no", tracking where the initial read came from seems inevitable. Tracking would be very easy if bios came back with unmodified values in bd_bdev and bd_sector, which is not the case. Thanks, Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 17, 2011 at 03:46:43PM +0100, Jan Schmidt wrote:> Hello everyone, > > Currently, btrfs has its own raid1 but no repair mechanism for bad > checksums or EIOs. While trying to implement such a repair mechanism, > several more or less general questions came up. > > There are two different retry paths for data and metadata. (If you know > or don''t care how btrfs handles read errors: goto questions) > > The data path: btrfs_io_failed_hook is called for each failed bio (EIO > or checksum error). Currently, it does not know which mirror failed at > first, because normally btrfs_map_block is called with mirror_num=0, > leading to a path where find_live_mirror picks one of them. The error > recovery strategy is then to explicitly read available mirrors one after > the other until one succeeds. In case the very first read picked mirror > 1 and failed, the retry code will most likely fail at mirror 1 as well. > It would be nice to know which mirror was picked formerly and directly > try the other. >So why not add a new field to the btrfs_multi_bio that tells you which mirror we''re acting on so that the endio stuff can know? That should be relatively simple to accomplish.> The metadata path: there is no failure hook, instead there is a loop in > btree_read_extent_buffer_pages, also starting off at mirror_num=0, which > again leaves the decision to find_live_mirror. If there is an error for > any page to be read, the same retry strategy is used as is in the data > path. This obviously might leave you alone with unreadable data > (consider page x is bad on mirror 1 and page x+1 is bad on mirror 2, > both belonging to the same extent, you lose). It would be nice to have a > mechanism at a lower level issuing page-sized retries. Of course, > knowing which mirror is bad before trying mirror 1 again is desirable as > well. > > questions: > I have a raid1 repair solution in mind (partially coded) for btrfs that > can be implemented quite easily. However, I have some misgivings. All of > the following questions would need a "yes" for my solution to stand: > > - Is it acceptable to retry reading a block immediately after the disk > said it won''t work? Or in case of a successful read followed by a > checksum error? (Which is already being done right now in btrfs.) >You mean re-read the same block? No thats not ok, if the checksum failed or the drive returned with the buffer not uptodate then re-reading the thing isn''t likely to change anything, so just wastes time.> - Is it acceptable to always write both mirrors if one is found to be > bad (also consider ssds)? >Yes, if one is bad we really want to re-write the bad one so we don''t find it again.> If either of the answers is "no", tracking where the initial read came > from seems inevitable. Tracking would be very easy if bios came back > with unmodified values in bd_bdev and bd_sector, which is not the case.Huh? The bios come back with the bi_bdev/bi_sector they were submitted on, how are you getting something different? Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Jan Schmidt''s message of 2011-03-17 10:46:43 -0400:> Hello everyone, > > Currently, btrfs has its own raid1 but no repair mechanism for bad > checksums or EIOs. While trying to implement such a repair mechanism, > several more or less general questions came up. > > There are two different retry paths for data and metadata. (If you know > or don''t care how btrfs handles read errors: goto questions)You should talk with Ilya, who is working on replacing failed raid drives as well.> > The data path: btrfs_io_failed_hook is called for each failed bio (EIO > or checksum error). Currently, it does not know which mirror failed at > first, because normally btrfs_map_block is called with mirror_num=0, > leading to a path where find_live_mirror picks one of them. The error > recovery strategy is then to explicitly read available mirrors one after > the other until one succeeds. In case the very first read picked mirror > 1 and failed, the retry code will most likely fail at mirror 1 as well. > It would be nice to know which mirror was picked formerly and directly > try the other.Agree with Josef here, change the code to record which one was used. The current bio submission stuff only keeps the btrfs_multi_bio struct around when a given IO spans more than one disk. But you can easily change it to keep the struct around for all IOs.> > The metadata path: there is no failure hook, instead there is a loop in > btree_read_extent_buffer_pages, also starting off at mirror_num=0, which > again leaves the decision to find_live_mirror. If there is an error for > any page to be read, the same retry strategy is used as is in the data > path. This obviously might leave you alone with unreadable data > (consider page x is bad on mirror 1 and page x+1 is bad on mirror 2, > both belonging to the same extent, you lose). It would be nice to have a > mechanism at a lower level issuing page-sized retries. Of course, > knowing which mirror is bad before trying mirror 1 again is desirable as > well.Currently the block size is always smaller than the stripe size. But you have a good point.> > questions: > I have a raid1 repair solution in mind (partially coded) for btrfs that > can be implemented quite easily. However, I have some misgivings. All of > the following questions would need a "yes" for my solution to stand: > > - Is it acceptable to retry reading a block immediately after the disk > said it won''t work? Or in case of a successful read followed by a > checksum error? (Which is already being done right now in btrfs.)In the initial implementation sure, but long term it''s not the best.> > - Is it acceptable to always write both mirrors if one is found to be > bad (also consider ssds)?Sorry, I''d rather not overwrite the copy we know to be good.> > If either of the answers is "no", tracking where the initial read came > from seems inevitable. Tracking would be very easy if bios came back > with unmodified values in bd_bdev and bd_sector, which is not the case.-chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/17/2011 06:09 PM, Andrey Kuzmin wrote:> On Thu, Mar 17, 2011 at 5:46 PM, Jan Schmidt <list.btrfs@jan-o-sch.net > <mailto:list.btrfs@jan-o-sch.net>> wrote: > - Is it acceptable to retry reading a block immediately after the disk > said it won''t work? Or in case of a successful read followed by a > checksum error? (Which is already being done right now in btrfs.) > > > These are two pretty different cases. When disk firmware fails read, it > means it has retried number of times but gave up (suggesting media > error), so an upper layer retry would hardly make sense. Checksum error > catches on-disk EDC fault, so retry is on the contrary quite reasonable.Agreed.> - Is it acceptable to always write both mirrors if one is found to be > bad (also consider ssds)? > > > Writing on read path bypassing file-system transaction mechanism doesn''t > seem a good idea to me. Just imaging loosing power while overwriting > last good copy.Okay, sounds reasonable to me. Let''s say we''re bypassing transaction mechanism in the same rude manner, but only write the bad mirror. Does that seem reasonable? Thanks, Jan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Jan Schmidt''s message of 2011-03-17 13:37:54 -0400:> On 03/17/2011 06:09 PM, Andrey Kuzmin wrote: > > On Thu, Mar 17, 2011 at 5:46 PM, Jan Schmidt <list.btrfs@jan-o-sch.net > > <mailto:list.btrfs@jan-o-sch.net>> wrote: > > - Is it acceptable to retry reading a block immediately after the disk > > said it won''t work? Or in case of a successful read followed by a > > checksum error? (Which is already being done right now in btrfs.) > > > > > > These are two pretty different cases. When disk firmware fails read, it > > means it has retried number of times but gave up (suggesting media > > error), so an upper layer retry would hardly make sense. Checksum error > > catches on-disk EDC fault, so retry is on the contrary quite reasonable. > > Agreed. > > > - Is it acceptable to always write both mirrors if one is found to be > > bad (also consider ssds)? > > > > > > Writing on read path bypassing file-system transaction mechanism doesn''t > > seem a good idea to me. Just imaging loosing power while overwriting > > last good copy. > > Okay, sounds reasonable to me. Let''s say we''re bypassing transaction > mechanism in the same rude manner, but only write the bad mirror. Does > that seem reasonable?The bad mirror is fair game. Write away, as long as you''re sure you''re excluding nodatacow and you don''t allow that block to get reallocated elsewhere. You don''t actually need to bypass the transaction mechanism, just those two things. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 17, 2011 at 8:42 PM, Chris Mason <chris.mason@oracle.com> wrote:> Excerpts from Jan Schmidt''s message of 2011-03-17 13:37:54 -0400: >> On 03/17/2011 06:09 PM, Andrey Kuzmin wrote: >> > On Thu, Mar 17, 2011 at 5:46 PM, Jan Schmidt <list.btrfs@jan-o-sch.net >> > <mailto:list.btrfs@jan-o-sch.net>> wrote: >> > - Is it acceptable to retry reading a block immediately after the disk >> > said it won''t work? Or in case of a successful read followed by a >> > checksum error? (Which is already being done right now in btrfs.) >> > >> > >> > These are two pretty different cases. When disk firmware fails read, it >> > means it has retried number of times but gave up (suggesting media >> > error), so an upper layer retry would hardly make sense. Checksum error >> > catches on-disk EDC fault, so retry is on the contrary quite reasonable. >> >> Agreed. >> >> > - Is it acceptable to always write both mirrors if one is found to be >> > bad (also consider ssds)? >> > >> > >> > Writing on read path bypassing file-system transaction mechanism doesn''t >> > seem a good idea to me. Just imaging loosing power while overwriting >> > last good copy. >> >> Okay, sounds reasonable to me. Let''s say we''re bypassing transaction >> mechanism in the same rude manner, but only write the bad mirror. Does >> that seem reasonable? > > The bad mirror is fair game. Write away, as long as you''re sure you''re > excluding nodatacow and you don''t allow that block to get reallocated > elsewhere. You don''t actually need to bypass the transaction > mechanism, just those two things.What happens if multiple readers (allowed by read lock) attempt an overwrite? Regards, Andrey> > -chris >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/17/2011 06:36 PM, Chris Mason wrote:> Excerpts from Jan Schmidt''s message of 2011-03-17 10:46:43 -0400: >> Hello everyone, >> >> Currently, btrfs has its own raid1 but no repair mechanism for bad >> checksums or EIOs. While trying to implement such a repair mechanism, >> several more or less general questions came up. >> >> There are two different retry paths for data and metadata. (If you know >> or don''t care how btrfs handles read errors: goto questions) > > You should talk with Ilya, who is working on replacing failed raid drives as > well.Thanks, I''ll do that.>> >> The data path: btrfs_io_failed_hook is called for each failed bio (EIO >> or checksum error). Currently, it does not know which mirror failed at >> first, because normally btrfs_map_block is called with mirror_num=0, >> leading to a path where find_live_mirror picks one of them. The error >> recovery strategy is then to explicitly read available mirrors one after >> the other until one succeeds. In case the very first read picked mirror >> 1 and failed, the retry code will most likely fail at mirror 1 as well. >> It would be nice to know which mirror was picked formerly and directly >> try the other. > > Agree with Josef here, change the code to record which one was used. > The current bio submission stuff only keeps the btrfs_multi_bio struct > around when a given IO spans more than one disk. But you can easily > change it to keep the struct around for all IOs.I was about to write the same in reply to Josefs mail :-)>> The metadata path: there is no failure hook, instead there is a loop in >> btree_read_extent_buffer_pages, also starting off at mirror_num=0, which >> again leaves the decision to find_live_mirror. If there is an error for >> any page to be read, the same retry strategy is used as is in the data >> path. This obviously might leave you alone with unreadable data >> (consider page x is bad on mirror 1 and page x+1 is bad on mirror 2, >> both belonging to the same extent, you lose). It would be nice to have a >> mechanism at a lower level issuing page-sized retries. Of course, >> knowing which mirror is bad before trying mirror 1 again is desirable as >> well. > > Currently the block size is always smaller than the stripe size. But > you have a good point. > >> >> questions: >> I have a raid1 repair solution in mind (partially coded) for btrfs that >> can be implemented quite easily. However, I have some misgivings. All of >> the following questions would need a "yes" for my solution to stand: >> >> - Is it acceptable to retry reading a block immediately after the disk >> said it won''t work? Or in case of a successful read followed by a >> checksum error? (Which is already being done right now in btrfs.) > > In the initial implementation sure, but long term it''s not the best. > >> >> - Is it acceptable to always write both mirrors if one is found to be >> bad (also consider ssds)? > > Sorry, I''d rather not overwrite the copy we know to be good. > >> >> If either of the answers is "no", tracking where the initial read came >> from seems inevitable. Tracking would be very easy if bios came back >> with unmodified values in bd_bdev and bd_sector, which is not the case.First off, I''ll go for the tracking part then. I''ll see whether implementing the retry mechanism somewhere near end_bio_multi_stripe is a good thing or else, how to fiddle the mirror_num further up so the code calling the hooks can care for the retries and repairs. Thanks! -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(snipped the parts already commented on in the other mails just sent) On 03/17/2011 06:19 PM, Josef Bacik wrote:> On Thu, Mar 17, 2011 at 03:46:43PM +0100, Jan Schmidt wrote: >> If either of the answers is "no", tracking where the initial read came >> from seems inevitable. Tracking would be very easy if bios came back >> with unmodified values in bd_bdev and bd_sector, which is not the case. > > Huh? The bios come back with the bi_bdev/bi_sector they were submitted on, how > are you getting something different? Thanks,That''s only if you write to the block device directly. If I''m not mistaken, you''ll have bi_bdev and bi_sector modified when using bios with partitions to block-device-absolute values. Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html