On 06/27/2011 07:46 AM, NeilBrown wrote:> On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius > <nico-lkml-20110623@schottelius.org> wrote: > >> Good morning devs, >> >> I''m wondering whether the raid- and volume-management-builtin of btrfs is >> actually a sane idea or not. >> Currently we do have md/device-mapper support for raid >> already, btrfs lacks raid5 support and re-implements stuff that >> has already been done. >> >> I''m aware of the fact that it is very useful to know on which devices >> we are in a filesystem. But I''m wondering, whether it wouldn''t be >> smarter to generalise the information exposure through the VFS layer >> instead of replicating functionality: >> >> Physical: USB-HD SSD USB-Flash | Exposes information to >> Raid: Raid1, Raid5, Raid10, etc. | higher levels >> Crypto: Luks | >> LVM: Groups/Volumes | >> FS: xfs/jfs/reiser/ext3 v >> >> Thus a filesystem like ext3 could be aware that it is running >> on a USB HD, enable -o sync be default or have the filesystem >> to rewrite blocks when running on crypto or optimise for an SSD, ... > I would certainly agree that exposing information to higher levels is a good > idea. To some extent we do. But it isn''t always as easy as it might sound. > Choosing exactly what information to expose is the challenge. If you lack > sufficient foresight you might expose something which turns out to be > very specific to just one device, so all those upper levels which make use of > the information find they are really special-casing one specific device, > which isn''t a good idea. > > > However it doesn''t follow that RAID5 should not be implemented in BTRFS. > The levels that you have drawn are just one perspective. While that has > value, it may not be universal. > I could easily argue that the LVM layer is a mistake and that filesystems > should provide that functionality directly. > I could almost argue the same for crypto. > RAID1 can make a lot of sense to be tightly integrated with the FS. > RAID5 ... I''m less convinced, but then I have a vested interest there so that > isn''t an objective assessment. > > Part of "the way Linux works" is that s/he who writes the code gets to make > the design decisions. The BTRFS developers might create something truly > awesome, or might end up having to support a RAID feature that they > subsequently think is a bad idea. But it really is their decision to make. > > NeilBrown >One more thing to add here is that I think that we still have a chance to increase the sharing between btrfs and the MD stack if we can get those changes made. No one likes to duplicate code, but we will need a richer interface between the block and file system layer to help close that gap. Ric
On Wed, 2011-06-29 at 10:29 +0100, Ric Wheeler wrote:> On 06/27/2011 07:46 AM, NeilBrown wrote: > > On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius > > <nico-lkml-20110623@schottelius.org> wrote: > > > >> Good morning devs, > >> > >> I''m wondering whether the raid- and volume-management-builtin of btrfs is > >> actually a sane idea or not. > >> Currently we do have md/device-mapper support for raid > >> already, btrfs lacks raid5 support and re-implements stuff that > >> has already been done. > >> > >> I''m aware of the fact that it is very useful to know on which devices > >> we are in a filesystem. But I''m wondering, whether it wouldn''t be > >> smarter to generalise the information exposure through the VFS layer > >> instead of replicating functionality: > >> > >> Physical: USB-HD SSD USB-Flash | Exposes information to > >> Raid: Raid1, Raid5, Raid10, etc. | higher levels > >> Crypto: Luks | > >> LVM: Groups/Volumes | > >> FS: xfs/jfs/reiser/ext3 v > >> > >> Thus a filesystem like ext3 could be aware that it is running > >> on a USB HD, enable -o sync be default or have the filesystem > >> to rewrite blocks when running on crypto or optimise for an SSD, ... > > I would certainly agree that exposing information to higher levels is a good > > idea. To some extent we do. But it isn''t always as easy as it might sound. > > Choosing exactly what information to expose is the challenge. If you lack > > sufficient foresight you might expose something which turns out to be > > very specific to just one device, so all those upper levels which make use of > > the information find they are really special-casing one specific device, > > which isn''t a good idea. > > > > > > However it doesn''t follow that RAID5 should not be implemented in BTRFS. > > The levels that you have drawn are just one perspective. While that has > > value, it may not be universal. > > I could easily argue that the LVM layer is a mistake and that filesystems > > should provide that functionality directly. > > I could almost argue the same for crypto. > > RAID1 can make a lot of sense to be tightly integrated with the FS. > > RAID5 ... I''m less convinced, but then I have a vested interest there so that > > isn''t an objective assessment. > > > > Part of "the way Linux works" is that s/he who writes the code gets to make > > the design decisions. The BTRFS developers might create something truly > > awesome, or might end up having to support a RAID feature that they > > subsequently think is a bad idea. But it really is their decision to make. > > > > NeilBrown > > > > One more thing to add here is that I think that we still have a chance to > increase the sharing between btrfs and the MD stack if we can get those changes > made. No one likes to duplicate code, but we will need a richer interface > between the block and file system layer to help close that gap. > > Ric >Is there a possibility that one could have a 3 disk RAID5 array, and then add a 4th disk and then do a balance, growing the RAID5 onto 4 disks and gaining the space still with RAID5? It seems that to be consistent, BTRFS would have to do this. If this is the case, then I think that the BTRFS implementation of RAID5 would have to be quite different to the MD implementation. James.> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 29 Jun 2011 10:29:53 +0100 Ric Wheeler <rwheeler@redhat.com> wrote:> On 06/27/2011 07:46 AM, NeilBrown wrote: > > On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius > > <nico-lkml-20110623@schottelius.org> wrote: > > > >> Good morning devs, > >> > >> I''m wondering whether the raid- and volume-management-builtin of btrfs is > >> actually a sane idea or not. > >> Currently we do have md/device-mapper support for raid > >> already, btrfs lacks raid5 support and re-implements stuff that > >> has already been done. > >> > >> I''m aware of the fact that it is very useful to know on which devices > >> we are in a filesystem. But I''m wondering, whether it wouldn''t be > >> smarter to generalise the information exposure through the VFS layer > >> instead of replicating functionality: > >> > >> Physical: USB-HD SSD USB-Flash | Exposes information to > >> Raid: Raid1, Raid5, Raid10, etc. | higher levels > >> Crypto: Luks | > >> LVM: Groups/Volumes | > >> FS: xfs/jfs/reiser/ext3 v > >> > >> Thus a filesystem like ext3 could be aware that it is running > >> on a USB HD, enable -o sync be default or have the filesystem > >> to rewrite blocks when running on crypto or optimise for an SSD, ... > > I would certainly agree that exposing information to higher levels is a good > > idea. To some extent we do. But it isn''t always as easy as it might sound. > > Choosing exactly what information to expose is the challenge. If you lack > > sufficient foresight you might expose something which turns out to be > > very specific to just one device, so all those upper levels which make use of > > the information find they are really special-casing one specific device, > > which isn''t a good idea. > > > > > > However it doesn''t follow that RAID5 should not be implemented in BTRFS. > > The levels that you have drawn are just one perspective. While that has > > value, it may not be universal. > > I could easily argue that the LVM layer is a mistake and that filesystems > > should provide that functionality directly. > > I could almost argue the same for crypto. > > RAID1 can make a lot of sense to be tightly integrated with the FS. > > RAID5 ... I''m less convinced, but then I have a vested interest there so that > > isn''t an objective assessment. > > > > Part of "the way Linux works" is that s/he who writes the code gets to make > > the design decisions. The BTRFS developers might create something truly > > awesome, or might end up having to support a RAID feature that they > > subsequently think is a bad idea. But it really is their decision to make. > > > > NeilBrown > > > > One more thing to add here is that I think that we still have a chance to > increase the sharing between btrfs and the MD stack if we can get those changes > made. No one likes to duplicate code, but we will need a richer interface > between the block and file system layer to help close that gap. > > Ric >I''m certainly open to suggestions and collaboration. Do you have in mind any particular way to make the interface richer?? NeilBrown
On 07/14/2011 06:56 AM, NeilBrown wrote:> On Wed, 29 Jun 2011 10:29:53 +0100 Ric Wheeler<rwheeler@redhat.com> wrote: > >> On 06/27/2011 07:46 AM, NeilBrown wrote: >>> On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius >>> <nico-lkml-20110623@schottelius.org> wrote: >>> >>>> Good morning devs, >>>> >>>> I''m wondering whether the raid- and volume-management-builtin of btrfs is >>>> actually a sane idea or not. >>>> Currently we do have md/device-mapper support for raid >>>> already, btrfs lacks raid5 support and re-implements stuff that >>>> has already been done. >>>> >>>> I''m aware of the fact that it is very useful to know on which devices >>>> we are in a filesystem. But I''m wondering, whether it wouldn''t be >>>> smarter to generalise the information exposure through the VFS layer >>>> instead of replicating functionality: >>>> >>>> Physical: USB-HD SSD USB-Flash | Exposes information to >>>> Raid: Raid1, Raid5, Raid10, etc. | higher levels >>>> Crypto: Luks | >>>> LVM: Groups/Volumes | >>>> FS: xfs/jfs/reiser/ext3 v >>>> >>>> Thus a filesystem like ext3 could be aware that it is running >>>> on a USB HD, enable -o sync be default or have the filesystem >>>> to rewrite blocks when running on crypto or optimise for an SSD, ... >>> I would certainly agree that exposing information to higher levels is a good >>> idea. To some extent we do. But it isn''t always as easy as it might sound. >>> Choosing exactly what information to expose is the challenge. If you lack >>> sufficient foresight you might expose something which turns out to be >>> very specific to just one device, so all those upper levels which make use of >>> the information find they are really special-casing one specific device, >>> which isn''t a good idea. >>> >>> >>> However it doesn''t follow that RAID5 should not be implemented in BTRFS. >>> The levels that you have drawn are just one perspective. While that has >>> value, it may not be universal. >>> I could easily argue that the LVM layer is a mistake and that filesystems >>> should provide that functionality directly. >>> I could almost argue the same for crypto. >>> RAID1 can make a lot of sense to be tightly integrated with the FS. >>> RAID5 ... I''m less convinced, but then I have a vested interest there so that >>> isn''t an objective assessment. >>> >>> Part of "the way Linux works" is that s/he who writes the code gets to make >>> the design decisions. The BTRFS developers might create something truly >>> awesome, or might end up having to support a RAID feature that they >>> subsequently think is a bad idea. But it really is their decision to make. >>> >>> NeilBrown >>> >> One more thing to add here is that I think that we still have a chance to >> increase the sharing between btrfs and the MD stack if we can get those changes >> made. No one likes to duplicate code, but we will need a richer interface >> between the block and file system layer to help close that gap. >> >> Ric >> > I''m certainly open to suggestions and collaboration. Do you have in mind any > particular way to make the interface richer?? > > NeilBrownHi Neil, I know that Chris has a very specific set of use cases for btrfs and think that Alasdair and others have started to look at what is doable. The obvious use case is the following: If a file system uses checksumming or other data corruption detection bits, it can detect that it got bad data on a write. If that data was protected by RAID, it would like to ask the block layer to try to read from another mirror (for raid1) or try to validate/rebuild from parity. Today, I think that a retry will basically just give us back a random chance of getting data from a different mirror or the same one that we got data from on the first go. Chris, Alasdair, was that a good summary of one concern? Thanks! Ric
On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler <rwheeler@redhat.com> wrote:> > I''m certainly open to suggestions and collaboration. Do you have in mind any > > particular way to make the interface richer?? > > > > NeilBrown > > Hi Neil, > > I know that Chris has a very specific set of use cases for btrfs and think that > Alasdair and others have started to look at what is doable. > > The obvious use case is the following: > > If a file system uses checksumming or other data corruption detection bits, it > can detect that it got bad data on a write. If that data was protected by RAID, > it would like to ask the block layer to try to read from another mirror (for > raid1) or try to validate/rebuild from parity. > > Today, I think that a retry will basically just give us back a random chance of > getting data from a different mirror or the same one that we got data from on > the first go. > > Chris, Alasdair, was that a good summary of one concern? > > Thanks! > > RicI imagine a new field in ''struct bio'' which was normally zero but could be some small integer. It is only meaningful for read. When 0 it means "get this data way you like". When non-zero it means "get this data using method N", where the different methods are up to the device. For a mirrored RAID, method N means read from device N-1. For stripe/parity RAID, method 1 means "use other data blocks and parity blocks to reconstruct data. The default for non RAID devices is to return EINVAL for any N > 0. A remapping device (dm-linear, dm-stripe etc) would just pass the number down. I''m not sure how RAID1 over RAID5 would handle it... that might need some thought. So if btrfs reads a block and the checksum looks wrong, it reads again with a larger N. It continues incrementing N and retrying until it gets a block that it likes or it gets EINVAL. There should probably be an error code (EAGAIN?) which means "I cannot work with that number, but try the next one". It would be trivial for me to implement this for RAID1 and RAID10, and relatively easy for RAID5. I''d need to give a bit of thought to RAID6 as there are possibly multiple ways to reconstruct from different combinations of parity and data. I''m not sure if there would be much point in doing that though. It might make sense for a device to be able to report what the maximum ''N'' supported is... that might make stacked raid easier to manage... NeilBrown
On 07/14/2011 07:38 AM, NeilBrown wrote:> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com> wrote: > >>> I''m certainly open to suggestions and collaboration. Do you have in mind any >>> particular way to make the interface richer?? >>> >>> NeilBrown >> Hi Neil, >> >> I know that Chris has a very specific set of use cases for btrfs and think that >> Alasdair and others have started to look at what is doable. >> >> The obvious use case is the following: >> >> If a file system uses checksumming or other data corruption detection bits, it >> can detect that it got bad data on a write. If that data was protected by RAID, >> it would like to ask the block layer to try to read from another mirror (for >> raid1) or try to validate/rebuild from parity. >> >> Today, I think that a retry will basically just give us back a random chance of >> getting data from a different mirror or the same one that we got data from on >> the first go. >> >> Chris, Alasdair, was that a good summary of one concern? >> >> Thanks! >> >> Ric > I imagine a new field in ''struct bio'' which was normally zero but could be > some small integer. It is only meaningful for read. > When 0 it means "get this data way you like". > When non-zero it means "get this data using method N", where the different > methods are up to the device. > > For a mirrored RAID, method N means read from device N-1. > For stripe/parity RAID, method 1 means "use other data blocks and parity > blocks to reconstruct data. > > The default for non RAID devices is to return EINVAL for any N> 0. > A remapping device (dm-linear, dm-stripe etc) would just pass the number > down. I''m not sure how RAID1 over RAID5 would handle it... that might need > some thought. > > So if btrfs reads a block and the checksum looks wrong, it reads again with > a larger N. It continues incrementing N and retrying until it gets a block > that it likes or it gets EINVAL. There should probably be an error code > (EAGAIN?) which means "I cannot work with that number, but try the next one". > > It would be trivial for me to implement this for RAID1 and RAID10, and > relatively easy for RAID5. > I''d need to give a bit of thought to RAID6 as there are possibly multiple > ways to reconstruct from different combinations of parity and data. I''m not > sure if there would be much point in doing that though. > > It might make sense for a device to be able to report what the maximum > ''N'' supported is... that might make stacked raid easier to manage... > > NeilBrown >I think that the above makes sense. Not sure what your "0" definition is, but I would assume that for non-raided devices (i.e., a single s-ata disk), "0" would be an indication that there is nothing more that can be tried. The data you got is all you are ever going to get :) For multiple mirrors, you might want to have a scheme where you would be able to cycle through the mirrors. You could retry, cycling through the various mirrors until you have tried and returned them all at which point you would get a "no more" error back or some such thing. Thanks! ric
On 14.07.2011 08:02, Ric Wheeler wrote:> On 07/14/2011 06:56 AM, NeilBrown wrote: >> I''m certainly open to suggestions and collaboration. Do you have in >> mind any >> particular way to make the interface richer?? > > If a file system uses checksumming or other data corruption detection > bits, it can detect that it got bad data on a write. If that data was > protected by RAID, it would like to ask the block layer to try to read > from another mirror (for raid1) or try to validate/rebuild from parity. > > Today, I think that a retry will basically just give us back a random > chance of getting data from a different mirror or the same one that we > got data from on the first go.Another case that comes to mind is the ''remove device'' operation. To accomplish this, btrfs just rewrites all the data that reside on that device to other drives. Also, scrub and my recently posted readahead patches make heavy use of the knowledge of how the raid is laid out. Readahead always uses as many drives as possible in parallel, while trying to avoid unnecessary seeks on each device. -Arne> > Chris, Alasdair, was that a good summary of one concern? > > Thanks! > > Ric >
Hi Neil, On 14.07.2011 08:38, NeilBrown wrote:> I imagine a new field in ''struct bio'' which was normally zero but could be > some small integer. It is only meaningful for read. > When 0 it means "get this data way you like". > When non-zero it means "get this data using method N", where the different > methods are up to the device. > > For a mirrored RAID, method N means read from device N-1. > For stripe/parity RAID, method 1 means "use other data blocks and parity > blocks to reconstruct data. > > The default for non RAID devices is to return EINVAL for any N > 0. > A remapping device (dm-linear, dm-stripe etc) would just pass the number > down. I''m not sure how RAID1 over RAID5 would handle it... that might need > some thought. > > So if btrfs reads a block and the checksum looks wrong, it reads again with > a larger N. It continues incrementing N and retrying until it gets a block > that it likes or it gets EINVAL. There should probably be an error code > (EAGAIN?) which means "I cannot work with that number, but try the next one".I like this idea. It comes pretty close to what btrfs is currently doing (what was the reason for this thread being started, wasn''t it?), only not within struct bio. The way you describe the extra parameter is input only. I think it would be a nice add on if we knew which "N" was used when "0" passed for the request. At least for the RAID1 case, I''d like to see that when I submit a bio with method (or whatever we call it) "0", it comes back with the method field set to the value that reflects the method the device actually used. Obviously, when passing non-zero values, the bio would have to come back with that value unmodified. That way, we''ll get more control on the retry algorithms and are free to decide not to use the failed method again, if we like. Setting "method" on the return path might be valuable not only for RAID1, but I haven''t thought that trough. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 14 Jul 2011 11:37:41 +0200 Jan Schmidt <list.btrfs@jan-o-sch.net> wrote:> Hi Neil, > > On 14.07.2011 08:38, NeilBrown wrote: > > I imagine a new field in ''struct bio'' which was normally zero but could be > > some small integer. It is only meaningful for read. > > When 0 it means "get this data way you like". > > When non-zero it means "get this data using method N", where the different > > methods are up to the device. > > > > For a mirrored RAID, method N means read from device N-1. > > For stripe/parity RAID, method 1 means "use other data blocks and parity > > blocks to reconstruct data. > > > > The default for non RAID devices is to return EINVAL for any N > 0. > > A remapping device (dm-linear, dm-stripe etc) would just pass the number > > down. I''m not sure how RAID1 over RAID5 would handle it... that might need > > some thought. > > > > So if btrfs reads a block and the checksum looks wrong, it reads again with > > a larger N. It continues incrementing N and retrying until it gets a block > > that it likes or it gets EINVAL. There should probably be an error code > > (EAGAIN?) which means "I cannot work with that number, but try the next one". > > I like this idea. It comes pretty close to what btrfs is currently doing > (what was the reason for this thread being started, wasn''t it?), only > not within struct bio. > > The way you describe the extra parameter is input only. I think it would > be a nice add on if we knew which "N" was used when "0" passed for the > request. At least for the RAID1 case, I''d like to see that when I submit > a bio with method (or whatever we call it) "0", it comes back with the > method field set to the value that reflects the method the device > actually used. Obviously, when passing non-zero values, the bio would > have to come back with that value unmodified. > > That way, we''ll get more control on the retry algorithms and are free to > decide not to use the failed method again, if we like. > > Setting "method" on the return path might be valuable not only for > RAID1, but I haven''t thought that trough. > > -JanThat sounds like it would be reasonable... It would be important not to read too much into the number though. i.e. don''t think of it as "RAID1" but keep a much more abstract idea, else you could run into trouble. A (near) future addition to md is keeping track of ''bad blocks'' so we can fail more gracefully as devices start to fail. So a read request to a RAID1 might not be served by just one device, but might be served by one device for some parts, and another device for other parts, so as to avoid one or more ''bad blocks''. I think I could still provide a reasonably consistent mapping from ''arbitrary number'' to ''set of devices'', but it may not be what you expect. And the number ''1'' may well correspond to different devices for different bi_sector offsets. i.e. the abstraction must allow the low level implementation substantial freedom to choosing how to implement each request. But yes - I don''t see why we couldn''t report which strategy was used with the implication that using that same strategy at the same offset with the same size would be expected to produce the same result. NeilBrown
On 07/14/2011 08:38 AM, NeilBrown wrote:> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler <rwheeler@redhat.com> wrote: > >>> I''m certainly open to suggestions and collaboration. Do you have in mind any >>> particular way to make the interface richer?? >>> >>> NeilBrown >> >> Hi Neil, >> >> I know that Chris has a very specific set of use cases for btrfs and think that >> Alasdair and others have started to look at what is doable. >> >> The obvious use case is the following: >> >> If a file system uses checksumming or other data corruption detection bits, it >> can detect that it got bad data on a write. If that data was protected by RAID, >> it would like to ask the block layer to try to read from another mirror (for >> raid1) or try to validate/rebuild from parity. >> >> Today, I think that a retry will basically just give us back a random chance of >> getting data from a different mirror or the same one that we got data from on >> the first go. >> >> Chris, Alasdair, was that a good summary of one concern? >> >> Thanks! >> >> Ric > > I imagine a new field in ''struct bio'' which was normally zero but could be > some small integer. It is only meaningful for read. > When 0 it means "get this data way you like". > When non-zero it means "get this data using method N", where the different > methods are up to the device.In more general terms, the filesystem should be able to require: try another read different regarding the previous ones. The term are important because we should differentiate the case of "wrong data from disk1, read from disk0" and "wrong data from disk0 read disk1". I prefer thinking the field as bitmap. Every bit represent a different way of read. So it is possible to reuse to track which "kind of read" was already used. After a 2nd read, the block layer should: a) redo the read if possible, otherwise FAIL b) pass the data to the filesystem c) if the filesystem accepts the new data, replace the wrong data with the correct one or mark the block as broken. d) inform the userspace/filesystem of the result> > For a mirrored RAID, method N means read from device N-1. > For stripe/parity RAID, method 1 means "use other data blocks and parity > blocks to reconstruct data. > > The default for non RAID devices is to return EINVAL for any N > 0. > A remapping device (dm-linear, dm-stripe etc) would just pass the number > down. I''m not sure how RAID1 over RAID5 would handle it... that might need > some thought. > > So if btrfs reads a block and the checksum looks wrong, it reads again with > a larger N. It continues incrementing N and retrying until it gets a block > that it likes or it gets EINVAL. There should probably be an error code > (EAGAIN?) which means "I cannot work with that number, but try the next one". > > It would be trivial for me to implement this for RAID1 and RAID10, and > relatively easy for RAID5. > I''d need to give a bit of thought to RAID6 as there are possibly multiple > ways to reconstruct from different combinations of parity and data. I''m not > sure if there would be much point in doing that though. > > It might make sense for a device to be able to report what the maximum > ''N'' supported is... that might make stacked raid easier to manage... > > NeilBrown > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > . >
On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:> It might make sense for a device to be able to report what the maximum > ''N'' supported is... that might make stacked raid easier to manage...I''ll just say that any solution ought to be stackable. This means understanding both that the number of data access routes may vary as you move through the stack, and that this number may depend on the offset within the device. Alasdair -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Alasdair" == Alasdair G Kergon <agk@redhat.com> writes:Alasdair> On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote:>> It might make sense for a device to be able to report what the maximum >> ''N'' supported is... that might make stacked raid easier to manage...Alasdair> I''ll just say that any solution ought to be stackable. I''ve been mulling this over too and wondering how you''d handle this, because upper layers really can''t peak down into lower layers easily. As far as I understand things. So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks How does btrfs handle errors (or does it even see them!) from the raid6 level when a single nbd device goes away? Or taking the original example, when btrfs notices a checksum isn''t correct, how would it push down multiple levels to try and find the correct data? Alasdair> This means understanding both that the number of data access Alasdair> routes may vary as you move through the stack, and that this Alasdair> number may depend on the offset within the device. It almost seems to me that the retry needs to be done at each level on it''s own, without pushing down or up the stack. But this doesn''t handle the wrong file checksum issue. Hmm... maybe instead of just one number, we need another to count the levels down you go (or just split 16bit integer in half, bottom half being count of tries, the upper half being levels down to try that read?) It seems to defeat the purpose of layers if you can go down and find out how many layers there are underneath you.... John
On Wed, Jun 29, 2011 at 3:47 AM, A. James Lewis <james@fsck.co.uk> wrote:> Is there a possibility that one could have a 3 disk RAID5 array, and > then add a 4th disk and then do a balance, growing the RAID5 onto 4 > disks and gaining the space still with RAID5? It seems that to be > consistent, BTRFS would have to do this. > > If this is the case, then I think that the BTRFS implementation of RAID5 > would have to be quite different to the MD implementation. > > James.My understanding, gleaned from IRC several months ago, is that Btrfs would use the new drive, but not change the stripe size. Each allocation would then be written across some selection of three of the four drives. In other words, if you started with four stripes across three drives: AAA BBB CCC DDD and then added a drive and balanced, you might get something like: AAAB BBCC CDDD which would give you more space, but still use ⅓ of the space for parity. Trying to remove a drive from the original three-drive configuration would be an error, similarly to trying to remove the second to last drive in RAID 1, currently. Actually changing the stripe size would be done using the same mechanism as changing RAID levels. Again, this is just an interested but uninvolved person''s understanding based on an IRC conversation, so please salt to taste. -- Erik -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 14 Jul 2011, John Stoffel wrote:> Alasdair> I''ll just say that any solution ought to be stackable. > > I''ve been mulling this over too and wondering how you''d handle this, > because upper layers really can''t peak down into lower layers easily. > As far as I understand things. > > So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks > > How does btrfs handle errors (or does it even see them!) from the > raid6 level when a single nbd device goes away? Or taking the > original example, when btrfs notices a checksum isn''t correct, how > would it push down multiple levels to try and find the correct data? > > Alasdair> This means understanding both that the number of data access > Alasdair> routes may vary as you move through the stack, and that this > Alasdair> number may depend on the offset within the device. > > It almost seems to me that the retry needs to be done at each level on > it''s own, without pushing down or up the stack. But this doesn''t > handle the wrong file checksum issue. > > Hmm... maybe instead of just one number, we need another to count the > levels down you go (or just split 16bit integer in half, bottom half > being count of tries, the upper half being levels down to try that > read?) > > It seems to defeat the purpose of layers if you can go down and find > out how many layers there are underneath you....this is why just an arbatrary ''method number'' rather than a bitmap or something like that should be used. using your example:> So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disksraid1 has at least 2 values, raid 6 has at least 2 values, the combination of the two stacked should have at least 4 values. if each layer can query the layer below it to find out how many methods it supports, it can then combine each of the methods it supports with each of the methods supported by the layer below it. this will stack to an arbatrary number of layers, only limited by how large the value is allowed to be limiting the combinational permutations of all the layers options. David Lang
On Thu, Jul 14, 2011 at 12:50 PM, John Stoffel <john@stoffel.org> wrote:>>>>>> "Alasdair" == Alasdair G Kergon <agk@redhat.com> writes: > > Alasdair> On Thu, Jul 14, 2011 at 04:38:36PM +1000, Neil Brown wrote: >>> It might make sense for a device to be able to report what the maximum >>> ''N'' supported is... that might make stacked raid easier to manage... > > Alasdair> I''ll just say that any solution ought to be stackable. > > I''ve been mulling this over too and wondering how you''d handle this, > because upper layers really can''t peak down into lower layers easily. > As far as I understand things. > > So if you have btrfs -> luks -> raid1 -> raid6 -> nbd -> remote disks > > How does btrfs handle errors (or does it even see them!) from the > raid6 level when a single nbd device goes away? Or taking the > original example, when btrfs notices a checksum isn''t correct, how > would it push down multiple levels to try and find the correct data? > > Alasdair> This means understanding both that the number of data access > Alasdair> routes may vary as you move through the stack, and that this > Alasdair> number may depend on the offset within the device. > > It almost seems to me that the retry needs to be done at each level on > it''s own, without pushing down or up the stack. But this doesn''t > handle the wrong file checksum issue. > > Hmm... maybe instead of just one number, we need another to count the > levels down you go (or just split 16bit integer in half, bottom half > being count of tries, the upper half being levels down to try that > read?) > > It seems to defeat the purpose of layers if you can go down and find > out how many layers there are underneath you.... > > JohnA random thought: What if we allow the number to wrap at each level, and, each time it wraps, increment the number passed to the next lower level. A zero would propagate down, letting each level do what it wants: luks: 0 raid1: 0 raid6: 0 nbd: 0 And higher numbers would indicate the method at each level: For a 1: luks: 1 raid1: 1 raid6: 1 nbd: 1 For a 3: luks: 1 (only one possibility, passes three down) raid1: 1 (two possibilities, so we wrap back to one and pass two down, since we wrapped once) raid6: 2 (not wrapped) nbd: 1 When the bottom-most level gets an N that it can''t handle, it would return EINVAL, which would be propagated up the stack. This would allow the same algorithm of incrementing N until we receive good data or EINVAL, and would exhaust all ways of reading the data at all levels.
Excerpts from Ric Wheeler''s message of 2011-07-14 02:57:54 -0400:> On 07/14/2011 07:38 AM, NeilBrown wrote: > > On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com> wrote: > > > >>> I''m certainly open to suggestions and collaboration. Do you have in mind any > >>> particular way to make the interface richer?? > >>> > >>> NeilBrown > >> Hi Neil, > >> > >> I know that Chris has a very specific set of use cases for btrfs and think that > >> Alasdair and others have started to look at what is doable. > >> > >> The obvious use case is the following: > >> > >> If a file system uses checksumming or other data corruption detection bits, it > >> can detect that it got bad data on a write. If that data was protected by RAID, > >> it would like to ask the block layer to try to read from another mirror (for > >> raid1) or try to validate/rebuild from parity. > >> > >> Today, I think that a retry will basically just give us back a random chance of > >> getting data from a different mirror or the same one that we got data from on > >> the first go. > >> > >> Chris, Alasdair, was that a good summary of one concern? > >> > >> Thanks! > >> > >> Ric > > I imagine a new field in ''struct bio'' which was normally zero but could be > > some small integer. It is only meaningful for read. > > When 0 it means "get this data way you like". > > When non-zero it means "get this data using method N", where the different > > methods are up to the device. > > > > For a mirrored RAID, method N means read from device N-1. > > For stripe/parity RAID, method 1 means "use other data blocks and parity > > blocks to reconstruct data. > > > > The default for non RAID devices is to return EINVAL for any N> 0. > > A remapping device (dm-linear, dm-stripe etc) would just pass the number > > down. I''m not sure how RAID1 over RAID5 would handle it... that might need > > some thought. > > > > So if btrfs reads a block and the checksum looks wrong, it reads again with > > a larger N. It continues incrementing N and retrying until it gets a block > > that it likes or it gets EINVAL. There should probably be an error code > > (EAGAIN?) which means "I cannot work with that number, but try the next one". > > > > It would be trivial for me to implement this for RAID1 and RAID10, and > > relatively easy for RAID5. > > I''d need to give a bit of thought to RAID6 as there are possibly multiple > > ways to reconstruct from different combinations of parity and data. I''m not > > sure if there would be much point in doing that though. > > > > It might make sense for a device to be able to report what the maximum > > ''N'' supported is... that might make stacked raid easier to manage... > > > > NeilBrown > > > > I think that the above makes sense. Not sure what your "0" definition is, but I > would assume that for non-raided devices (i.e., a single s-ata disk), "0" would > be an indication that there is nothing more that can be tried. The data you got > is all you are ever going to get :) > > For multiple mirrors, you might want to have a scheme where you would be able to > cycle through the mirrors. You could retry, cycling through the various mirrors > until you have tried and returned them all at which point you would get a "no > more" error back or some such thing.Hi everyone, The mirror number idea is basically what btrfs does today, and I think it fits in with Neil''s idea to have different policies for different blocks. Basically what btrfs does: read_block(block_num, mirror = 0) if (no io error and not csum error) horray() num_mirrors = get_num_copies(block number) for (i = 1; i < num_mirrors; i++) { read_block(block_num, mirror = i); } In a stacked configuration, the get_num_copies function can be smarter, basically adding up all the copies of the lower levels and finding a way to combine them. We could just send down a fake bio that is responsible for adding up the storage combinations into a bitmask or whatever works. We could also just keep retrying until the lower layers reported no more mirror were available. In btrfs at least, we don''t set the data blocks up to date until the crc has passed, so replacing bogus blocks is easy. Metadata is a little more complex, but that''s not really related to this topic. mirror number 0 just means "no mirror preference/pick the fastest mirror" to the btrfs block mapping code. Btrfs has the concept of different raid levels for different logical block numbers, so you get_num_copies might return one answer for block A and a different answer for block B. Either way, we could easily make use of a bio field here if it were exported out. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 14 Jul 2011, Chris Mason wrote:> Excerpts from Ric Wheeler''s message of 2011-07-14 02:57:54 -0400: >> On 07/14/2011 07:38 AM, NeilBrown wrote: >>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com> wrote: >>> >>>>> I''m certainly open to suggestions and collaboration. Do you have in mind any >>>>> particular way to make the interface richer?? >>>>> >>>>> NeilBrown >>>> Hi Neil, >>>> >>>> I know that Chris has a very specific set of use cases for btrfs and think that >>>> Alasdair and others have started to look at what is doable. >>>> >>>> The obvious use case is the following: >>>> >>>> If a file system uses checksumming or other data corruption detection bits, it >>>> can detect that it got bad data on a write. If that data was protected by RAID, >>>> it would like to ask the block layer to try to read from another mirror (for >>>> raid1) or try to validate/rebuild from parity. >>>> >>>> Today, I think that a retry will basically just give us back a random chance of >>>> getting data from a different mirror or the same one that we got data from on >>>> the first go. >>>> >>>> Chris, Alasdair, was that a good summary of one concern? >>>> >>>> Thanks! >>>> >>>> Ric >>> I imagine a new field in ''struct bio'' which was normally zero but could be >>> some small integer. It is only meaningful for read. >>> When 0 it means "get this data way you like". >>> When non-zero it means "get this data using method N", where the different >>> methods are up to the device. >>> >>> For a mirrored RAID, method N means read from device N-1. >>> For stripe/parity RAID, method 1 means "use other data blocks and parity >>> blocks to reconstruct data. >>> >>> The default for non RAID devices is to return EINVAL for any N> 0. >>> A remapping device (dm-linear, dm-stripe etc) would just pass the number >>> down. I''m not sure how RAID1 over RAID5 would handle it... that might need >>> some thought. >>> >>> So if btrfs reads a block and the checksum looks wrong, it reads again with >>> a larger N. It continues incrementing N and retrying until it gets a block >>> that it likes or it gets EINVAL. There should probably be an error code >>> (EAGAIN?) which means "I cannot work with that number, but try the next one". >>> >>> It would be trivial for me to implement this for RAID1 and RAID10, and >>> relatively easy for RAID5. >>> I''d need to give a bit of thought to RAID6 as there are possibly multiple >>> ways to reconstruct from different combinations of parity and data. I''m not >>> sure if there would be much point in doing that though. >>> >>> It might make sense for a device to be able to report what the maximum >>> ''N'' supported is... that might make stacked raid easier to manage... >>> >>> NeilBrown >>> >> >> I think that the above makes sense. Not sure what your "0" definition is, but I >> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would >> be an indication that there is nothing more that can be tried. The data you got >> is all you are ever going to get :) >> >> For multiple mirrors, you might want to have a scheme where you would be able to >> cycle through the mirrors. You could retry, cycling through the various mirrors >> until you have tried and returned them all at which point you would get a "no >> more" error back or some such thing. > > Hi everyone, > > The mirror number idea is basically what btrfs does today, and I think > it fits in with Neil''s idea to have different policies for different > blocks. > > Basically what btrfs does: > > read_block(block_num, mirror = 0) > if (no io error and not csum error) > horray() > > num_mirrors = get_num_copies(block number) > for (i = 1; i < num_mirrors; i++) { > read_block(block_num, mirror = i); > } > > In a stacked configuration, the get_num_copies function can be smarter, > basically adding up all the copies of the lower levels and finding a way > to combine them. We could just send down a fake bio that is responsible > for adding up the storage combinations into a bitmask or whatever works. > > We could also just keep retrying until the lower layers reported no more > mirror were available. > > In btrfs at least, we don''t set the data blocks up to date until the crc > has passed, so replacing bogus blocks is easy. Metadata is a little > more complex, but that''s not really related to this topic. > > mirror number 0 just means "no mirror preference/pick the fastest > mirror" to the btrfs block mapping code. > > Btrfs has the concept of different raid levels for different logical > block numbers, so you get_num_copies might return one answer for block A > and a different answer for block B. > > Either way, we could easily make use of a bio field here if it were > exported out.you don''t want to just pass the value down as that will cause problems with layering (especially if the lower layer supports more values than a higher layer) I would suggest that each layer take the value it''s give, do an integer division by the number of values that layer supports, using the modulo value for that layer and pass the rest of the result down to the next layer. this works for just trying values until you get the error that tells you there are no more options. things can get very ''intersesting'' if the different options below you have different number of options (say a raid1 across a raid5 and a single disk) but I can''t think of any good way to figure this out and assemble a sane order without doing an exaustive search down to find the number of options for each layer (and since this can be different for different blocks, you would have to do this each time you invoked this option) David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote:> On Thu, 14 Jul 2011, Chris Mason wrote: > > > Excerpts from Ric Wheeler''s message of 2011-07-14 02:57:54 -0400: > >> On 07/14/2011 07:38 AM, NeilBrown wrote: > >>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com> wrote: > >>> > >>>>> I''m certainly open to suggestions and collaboration. Do you have in mind any > >>>>> particular way to make the interface richer?? > >>>>> > >>>>> NeilBrown > >>>> Hi Neil, > >>>> > >>>> I know that Chris has a very specific set of use cases for btrfs and think that > >>>> Alasdair and others have started to look at what is doable. > >>>> > >>>> The obvious use case is the following: > >>>> > >>>> If a file system uses checksumming or other data corruption detection bits, it > >>>> can detect that it got bad data on a write. If that data was protected by RAID, > >>>> it would like to ask the block layer to try to read from another mirror (for > >>>> raid1) or try to validate/rebuild from parity. > >>>> > >>>> Today, I think that a retry will basically just give us back a random chance of > >>>> getting data from a different mirror or the same one that we got data from on > >>>> the first go. > >>>> > >>>> Chris, Alasdair, was that a good summary of one concern? > >>>> > >>>> Thanks! > >>>> > >>>> Ric > >>> I imagine a new field in ''struct bio'' which was normally zero but could be > >>> some small integer. It is only meaningful for read. > >>> When 0 it means "get this data way you like". > >>> When non-zero it means "get this data using method N", where the different > >>> methods are up to the device. > >>> > >>> For a mirrored RAID, method N means read from device N-1. > >>> For stripe/parity RAID, method 1 means "use other data blocks and parity > >>> blocks to reconstruct data. > >>> > >>> The default for non RAID devices is to return EINVAL for any N> 0. > >>> A remapping device (dm-linear, dm-stripe etc) would just pass the number > >>> down. I''m not sure how RAID1 over RAID5 would handle it... that might need > >>> some thought. > >>> > >>> So if btrfs reads a block and the checksum looks wrong, it reads again with > >>> a larger N. It continues incrementing N and retrying until it gets a block > >>> that it likes or it gets EINVAL. There should probably be an error code > >>> (EAGAIN?) which means "I cannot work with that number, but try the next one". > >>> > >>> It would be trivial for me to implement this for RAID1 and RAID10, and > >>> relatively easy for RAID5. > >>> I''d need to give a bit of thought to RAID6 as there are possibly multiple > >>> ways to reconstruct from different combinations of parity and data. I''m not > >>> sure if there would be much point in doing that though. > >>> > >>> It might make sense for a device to be able to report what the maximum > >>> ''N'' supported is... that might make stacked raid easier to manage... > >>> > >>> NeilBrown > >>> > >> > >> I think that the above makes sense. Not sure what your "0" definition is, but I > >> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would > >> be an indication that there is nothing more that can be tried. The data you got > >> is all you are ever going to get :) > >> > >> For multiple mirrors, you might want to have a scheme where you would be able to > >> cycle through the mirrors. You could retry, cycling through the various mirrors > >> until you have tried and returned them all at which point you would get a "no > >> more" error back or some such thing. > > > > Hi everyone, > > > > The mirror number idea is basically what btrfs does today, and I think > > it fits in with Neil''s idea to have different policies for different > > blocks. > > > > Basically what btrfs does: > > > > read_block(block_num, mirror = 0) > > if (no io error and not csum error) > > horray() > > > > num_mirrors = get_num_copies(block number) > > for (i = 1; i < num_mirrors; i++) { > > read_block(block_num, mirror = i); > > } > > > > In a stacked configuration, the get_num_copies function can be smarter, > > basically adding up all the copies of the lower levels and finding a way > > to combine them. We could just send down a fake bio that is responsible > > for adding up the storage combinations into a bitmask or whatever works. > > > > We could also just keep retrying until the lower layers reported no more > > mirror were available. > > > > In btrfs at least, we don''t set the data blocks up to date until the crc > > has passed, so replacing bogus blocks is easy. Metadata is a little > > more complex, but that''s not really related to this topic. > > > > mirror number 0 just means "no mirror preference/pick the fastest > > mirror" to the btrfs block mapping code. > > > > Btrfs has the concept of different raid levels for different logical > > block numbers, so you get_num_copies might return one answer for block A > > and a different answer for block B. > > > > Either way, we could easily make use of a bio field here if it were > > exported out. > > you don''t want to just pass the value down as that will cause problems > with layering (especially if the lower layer supports more values than a > higher layer) > > I would suggest that each layer take the value it''s give, do an integer > division by the number of values that layer supports, using the modulo > value for that layer and pass the rest of the result down to the next > layer.I, on the other hand, would suggest that each layer be given the freedom, and the responsibility, to do whatever it thinks is most appropriate. This would probably involved checking the lower levels to see how many strategies each has, and doing some appropriate arithmetic depending on how it combines those devices. One problem here is the assumption that the lower levels don''t change, and we know that not to be the case. However this is already a problem. It is entirely possible that the request size parameters of devices can change at any moment, even while a request is in-flight ... though we try to avoid that or work around it. The sort of dependencies that we see forming here would probably just make the problem worse. Not sure what to do about it though.... maybe just hope it isn''t a problem. NeilBrown> > this works for just trying values until you get the error that tells you > there are no more options. > > > things can get very ''intersesting'' if the different options below you have > different number of options (say a raid1 across a raid5 and a single disk) > but I can''t think of any good way to figure this out and assemble a > sane order without doing an exaustive search down to find the number of > options for each layer (and since this can be different for different > blocks, you would have to do this each time you invoked this option) > > David Lang
Excerpts from NeilBrown''s message of 2011-07-15 02:33:54 -0400:> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote: > > > On Thu, 14 Jul 2011, Chris Mason wrote: > > > > > Excerpts from Ric Wheeler''s message of 2011-07-14 02:57:54 -0400: > > >> On 07/14/2011 07:38 AM, NeilBrown wrote: > > >>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com> wrote: > > >>> > > >>>>> I''m certainly open to suggestions and collaboration. Do you have in mind any > > >>>>> particular way to make the interface richer?? > > >>>>> > > >>>>> NeilBrown > > >>>> Hi Neil, > > >>>> > > >>>> I know that Chris has a very specific set of use cases for btrfs and think that > > >>>> Alasdair and others have started to look at what is doable. > > >>>> > > >>>> The obvious use case is the following: > > >>>> > > >>>> If a file system uses checksumming or other data corruption detection bits, it > > >>>> can detect that it got bad data on a write. If that data was protected by RAID, > > >>>> it would like to ask the block layer to try to read from another mirror (for > > >>>> raid1) or try to validate/rebuild from parity. > > >>>> > > >>>> Today, I think that a retry will basically just give us back a random chance of > > >>>> getting data from a different mirror or the same one that we got data from on > > >>>> the first go. > > >>>> > > >>>> Chris, Alasdair, was that a good summary of one concern? > > >>>> > > >>>> Thanks! > > >>>> > > >>>> Ric > > >>> I imagine a new field in ''struct bio'' which was normally zero but could be > > >>> some small integer. It is only meaningful for read. > > >>> When 0 it means "get this data way you like". > > >>> When non-zero it means "get this data using method N", where the different > > >>> methods are up to the device. > > >>> > > >>> For a mirrored RAID, method N means read from device N-1. > > >>> For stripe/parity RAID, method 1 means "use other data blocks and parity > > >>> blocks to reconstruct data. > > >>> > > >>> The default for non RAID devices is to return EINVAL for any N> 0. > > >>> A remapping device (dm-linear, dm-stripe etc) would just pass the number > > >>> down. I''m not sure how RAID1 over RAID5 would handle it... that might need > > >>> some thought. > > >>> > > >>> So if btrfs reads a block and the checksum looks wrong, it reads again with > > >>> a larger N. It continues incrementing N and retrying until it gets a block > > >>> that it likes or it gets EINVAL. There should probably be an error code > > >>> (EAGAIN?) which means "I cannot work with that number, but try the next one". > > >>> > > >>> It would be trivial for me to implement this for RAID1 and RAID10, and > > >>> relatively easy for RAID5. > > >>> I''d need to give a bit of thought to RAID6 as there are possibly multiple > > >>> ways to reconstruct from different combinations of parity and data. I''m not > > >>> sure if there would be much point in doing that though. > > >>> > > >>> It might make sense for a device to be able to report what the maximum > > >>> ''N'' supported is... that might make stacked raid easier to manage... > > >>> > > >>> NeilBrown > > >>> > > >> > > >> I think that the above makes sense. Not sure what your "0" definition is, but I > > >> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would > > >> be an indication that there is nothing more that can be tried. The data you got > > >> is all you are ever going to get :) > > >> > > >> For multiple mirrors, you might want to have a scheme where you would be able to > > >> cycle through the mirrors. You could retry, cycling through the various mirrors > > >> until you have tried and returned them all at which point you would get a "no > > >> more" error back or some such thing. > > > > > > Hi everyone, > > > > > > The mirror number idea is basically what btrfs does today, and I think > > > it fits in with Neil''s idea to have different policies for different > > > blocks. > > > > > > Basically what btrfs does: > > > > > > read_block(block_num, mirror = 0) > > > if (no io error and not csum error) > > > horray() > > > > > > num_mirrors = get_num_copies(block number) > > > for (i = 1; i < num_mirrors; i++) { > > > read_block(block_num, mirror = i); > > > } > > > > > > In a stacked configuration, the get_num_copies function can be smarter, > > > basically adding up all the copies of the lower levels and finding a way > > > to combine them. We could just send down a fake bio that is responsible > > > for adding up the storage combinations into a bitmask or whatever works. > > > > > > We could also just keep retrying until the lower layers reported no more > > > mirror were available. > > > > > > In btrfs at least, we don''t set the data blocks up to date until the crc > > > has passed, so replacing bogus blocks is easy. Metadata is a little > > > more complex, but that''s not really related to this topic. > > > > > > mirror number 0 just means "no mirror preference/pick the fastest > > > mirror" to the btrfs block mapping code. > > > > > > Btrfs has the concept of different raid levels for different logical > > > block numbers, so you get_num_copies might return one answer for block A > > > and a different answer for block B. > > > > > > Either way, we could easily make use of a bio field here if it were > > > exported out. > > > > you don''t want to just pass the value down as that will cause problems > > with layering (especially if the lower layer supports more values than a > > higher layer) > > > > I would suggest that each layer take the value it''s give, do an integer > > division by the number of values that layer supports, using the modulo > > value for that layer and pass the rest of the result down to the next > > layer. > > I, on the other hand, would suggest that each layer be given the freedom, and > the responsibility, to do whatever it thinks is most appropriate. > > This would probably involved checking the lower levels to see how many > strategies each has, and doing some appropriate arithmetic depending on how > it combines those devices. > > One problem here is the assumption that the lower levels don''t change, and we > know that not to be the case. > However this is already a problem. It is entirely possible that the request > size parameters of devices can change at any moment, even while a request is > in-flight ... though we try to avoid that or work around it. > > The sort of dependencies that we see forming here would probably just make > the problem worse. > > Not sure what to do about it though.... maybe just hope it isn''t a problem.Agreed, if we want to go beyond best effort for a stacking config, we''ll need to put some state struct in the bio that each layer can play with. That way each layer knows which mirrors have already been tried. But, maybe the whole btrfs model is backwards for a generic layer. Instead of sending down ios and testing when they come back, we could just set a verification function (or stack of them?). For metadata, btrfs compares the crc and a few other fields of the metadata block, so we can easily add a compare function pointer and a void * to pass in. The problem is the crc can take a lot of CPU, so btrfs kicks it off to threading pools so saturate all the cpus on the box. But there''s no reason we can''t make that available lower down. If we pushed the verification down, the retries could bubble up the stack instead of the other way around. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/15/2011 12:34 PM, Chris Mason wrote:> Excerpts from NeilBrown''s message of 2011-07-15 02:33:54 -0400: >> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote: >> >>> On Thu, 14 Jul 2011, Chris Mason wrote: >>> >>>> Excerpts from Ric Wheeler''s message of 2011-07-14 02:57:54 -0400: >>>>> On 07/14/2011 07:38 AM, NeilBrown wrote: >>>>>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com> wrote: >>>>>> >>>>>>>> I''m certainly open to suggestions and collaboration. Do you have in mind any >>>>>>>> particular way to make the interface richer?? >>>>>>>> >>>>>>>> NeilBrown >>>>>>> Hi Neil, >>>>>>> >>>>>>> I know that Chris has a very specific set of use cases for btrfs and think that >>>>>>> Alasdair and others have started to look at what is doable. >>>>>>> >>>>>>> The obvious use case is the following: >>>>>>> >>>>>>> If a file system uses checksumming or other data corruption detection bits, it >>>>>>> can detect that it got bad data on a write. If that data was protected by RAID, >>>>>>> it would like to ask the block layer to try to read from another mirror (for >>>>>>> raid1) or try to validate/rebuild from parity. >>>>>>> >>>>>>> Today, I think that a retry will basically just give us back a random chance of >>>>>>> getting data from a different mirror or the same one that we got data from on >>>>>>> the first go. >>>>>>> >>>>>>> Chris, Alasdair, was that a good summary of one concern? >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Ric >>>>>> I imagine a new field in ''struct bio'' which was normally zero but could be >>>>>> some small integer. It is only meaningful for read. >>>>>> When 0 it means "get this data way you like". >>>>>> When non-zero it means "get this data using method N", where the different >>>>>> methods are up to the device. >>>>>> >>>>>> For a mirrored RAID, method N means read from device N-1. >>>>>> For stripe/parity RAID, method 1 means "use other data blocks and parity >>>>>> blocks to reconstruct data. >>>>>> >>>>>> The default for non RAID devices is to return EINVAL for any N> 0. >>>>>> A remapping device (dm-linear, dm-stripe etc) would just pass the number >>>>>> down. I''m not sure how RAID1 over RAID5 would handle it... that might need >>>>>> some thought. >>>>>> >>>>>> So if btrfs reads a block and the checksum looks wrong, it reads again with >>>>>> a larger N. It continues incrementing N and retrying until it gets a block >>>>>> that it likes or it gets EINVAL. There should probably be an error code >>>>>> (EAGAIN?) which means "I cannot work with that number, but try the next one". >>>>>> >>>>>> It would be trivial for me to implement this for RAID1 and RAID10, and >>>>>> relatively easy for RAID5. >>>>>> I''d need to give a bit of thought to RAID6 as there are possibly multiple >>>>>> ways to reconstruct from different combinations of parity and data. I''m not >>>>>> sure if there would be much point in doing that though. >>>>>> >>>>>> It might make sense for a device to be able to report what the maximum >>>>>> ''N'' supported is... that might make stacked raid easier to manage... >>>>>> >>>>>> NeilBrown >>>>>> >>>>> I think that the above makes sense. Not sure what your "0" definition is, but I >>>>> would assume that for non-raided devices (i.e., a single s-ata disk), "0" would >>>>> be an indication that there is nothing more that can be tried. The data you got >>>>> is all you are ever going to get :) >>>>> >>>>> For multiple mirrors, you might want to have a scheme where you would be able to >>>>> cycle through the mirrors. You could retry, cycling through the various mirrors >>>>> until you have tried and returned them all at which point you would get a "no >>>>> more" error back or some such thing. >>>> Hi everyone, >>>> >>>> The mirror number idea is basically what btrfs does today, and I think >>>> it fits in with Neil''s idea to have different policies for different >>>> blocks. >>>> >>>> Basically what btrfs does: >>>> >>>> read_block(block_num, mirror = 0) >>>> if (no io error and not csum error) >>>> horray() >>>> >>>> num_mirrors = get_num_copies(block number) >>>> for (i = 1; i< num_mirrors; i++) { >>>> read_block(block_num, mirror = i); >>>> } >>>> >>>> In a stacked configuration, the get_num_copies function can be smarter, >>>> basically adding up all the copies of the lower levels and finding a way >>>> to combine them. We could just send down a fake bio that is responsible >>>> for adding up the storage combinations into a bitmask or whatever works. >>>> >>>> We could also just keep retrying until the lower layers reported no more >>>> mirror were available. >>>> >>>> In btrfs at least, we don''t set the data blocks up to date until the crc >>>> has passed, so replacing bogus blocks is easy. Metadata is a little >>>> more complex, but that''s not really related to this topic. >>>> >>>> mirror number 0 just means "no mirror preference/pick the fastest >>>> mirror" to the btrfs block mapping code. >>>> >>>> Btrfs has the concept of different raid levels for different logical >>>> block numbers, so you get_num_copies might return one answer for block A >>>> and a different answer for block B. >>>> >>>> Either way, we could easily make use of a bio field here if it were >>>> exported out. >>> you don''t want to just pass the value down as that will cause problems >>> with layering (especially if the lower layer supports more values than a >>> higher layer) >>> >>> I would suggest that each layer take the value it''s give, do an integer >>> division by the number of values that layer supports, using the modulo >>> value for that layer and pass the rest of the result down to the next >>> layer. >> I, on the other hand, would suggest that each layer be given the freedom, and >> the responsibility, to do whatever it thinks is most appropriate. >> >> This would probably involved checking the lower levels to see how many >> strategies each has, and doing some appropriate arithmetic depending on how >> it combines those devices. >> >> One problem here is the assumption that the lower levels don''t change, and we >> know that not to be the case. >> However this is already a problem. It is entirely possible that the request >> size parameters of devices can change at any moment, even while a request is >> in-flight ... though we try to avoid that or work around it. >> >> The sort of dependencies that we see forming here would probably just make >> the problem worse. >> >> Not sure what to do about it though.... maybe just hope it isn''t a problem. > Agreed, if we want to go beyond best effort for a stacking config, we''ll > need to put some state struct in the bio that each layer can play with. > That way each layer knows which mirrors have already been tried. > > But, maybe the whole btrfs model is backwards for a generic layer. > Instead of sending down ios and testing when they come back, we could > just set a verification function (or stack of them?). > > For metadata, btrfs compares the crc and a few other fields of the > metadata block, so we can easily add a compare function pointer and a > void * to pass in. > > The problem is the crc can take a lot of CPU, so btrfs kicks it off to > threading pools so saturate all the cpus on the box. But there''s no > reason we can''t make that available lower down. > > If we pushed the verification down, the retries could bubble up the > stack instead of the other way around. > > -chrisI do like the idea of having the ability to do the verification and retries down the stack where you actually have the most context to figure out what is possible... Why would you need to bubble back up anything other than an error when all retries have failed? Ric
Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400:> On 07/15/2011 12:34 PM, Chris Mason wrote:[ triggering IO retries on failed crc or other checks ]> > > > But, maybe the whole btrfs model is backwards for a generic layer. > > Instead of sending down ios and testing when they come back, we could > > just set a verification function (or stack of them?). > > > > For metadata, btrfs compares the crc and a few other fields of the > > metadata block, so we can easily add a compare function pointer and a > > void * to pass in. > > > > The problem is the crc can take a lot of CPU, so btrfs kicks it off to > > threading pools so saturate all the cpus on the box. But there''s no > > reason we can''t make that available lower down. > > > > If we pushed the verification down, the retries could bubble up the > > stack instead of the other way around. > > > > -chris > > I do like the idea of having the ability to do the verification and retries down > the stack where you actually have the most context to figure out what is possible... > > Why would you need to bubble back up anything other than an error when all > retries have failed?By bubble up I mean that if you have multiple layers capable of doing retries, the lowest levels would retry first. Basically by the time we get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can do to help. -chris
On 07/15/2011 02:20 PM, Chris Mason wrote:> Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: >> On 07/15/2011 12:34 PM, Chris Mason wrote: > [ triggering IO retries on failed crc or other checks ] > >>> But, maybe the whole btrfs model is backwards for a generic layer. >>> Instead of sending down ios and testing when they come back, we could >>> just set a verification function (or stack of them?). >>> >>> For metadata, btrfs compares the crc and a few other fields of the >>> metadata block, so we can easily add a compare function pointer and a >>> void * to pass in. >>> >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to >>> threading pools so saturate all the cpus on the box. But there''s no >>> reason we can''t make that available lower down. >>> >>> If we pushed the verification down, the retries could bubble up the >>> stack instead of the other way around. >>> >>> -chris >> I do like the idea of having the ability to do the verification and retries down >> the stack where you actually have the most context to figure out what is possible... >> >> Why would you need to bubble back up anything other than an error when all >> retries have failed? > By bubble up I mean that if you have multiple layers capable of doing > retries, the lowest levels would retry first. Basically by the time we > get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can > do to help. > > -chrisAbsolutely sounds like the most sane way to go to me, thanks! Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 15, 2011 at 8:58 AM, Ric Wheeler <rwheeler@redhat.com> wrote:> On 07/15/2011 12:34 PM, Chris Mason wrote: >> >> Excerpts from NeilBrown''s message of 2011-07-15 02:33:54 -0400: >>> >>> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote: >>> >>>> On Thu, 14 Jul 2011, Chris Mason wrote: >>>> >>>>> Excerpts from Ric Wheeler''s message of 2011-07-14 02:57:54 -0400: >>>>>> >>>>>> On 07/14/2011 07:38 AM, NeilBrown wrote: >>>>>>> >>>>>>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com> >>>>>>> wrote: >>>>>>> >>>>>>>>> I''m certainly open to suggestions and collaboration. Do you have >>>>>>>>> in mind any >>>>>>>>> particular way to make the interface richer?? >>>>>>>>> >>>>>>>>> NeilBrown >>>>>>>> >>>>>>>> Hi Neil, >>>>>>>> >>>>>>>> I know that Chris has a very specific set of use cases for btrfs and >>>>>>>> think that >>>>>>>> Alasdair and others have started to look at what is doable. >>>>>>>> >>>>>>>> The obvious use case is the following: >>>>>>>> >>>>>>>> If a file system uses checksumming or other data corruption >>>>>>>> detection bits, it >>>>>>>> can detect that it got bad data on a write. If that data was >>>>>>>> protected by RAID, >>>>>>>> it would like to ask the block layer to try to read from another >>>>>>>> mirror (for >>>>>>>> raid1) or try to validate/rebuild from parity. >>>>>>>> >>>>>>>> Today, I think that a retry will basically just give us back a >>>>>>>> random chance of >>>>>>>> getting data from a different mirror or the same one that we got >>>>>>>> data from on >>>>>>>> the first go. >>>>>>>> >>>>>>>> Chris, Alasdair, was that a good summary of one concern? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Ric >>>>>>> >>>>>>> I imagine a new field in ''struct bio'' which was normally zero but >>>>>>> could be >>>>>>> some small integer. It is only meaningful for read. >>>>>>> When 0 it means "get this data way you like". >>>>>>> When non-zero it means "get this data using method N", where the >>>>>>> different >>>>>>> methods are up to the device. >>>>>>> >>>>>>> For a mirrored RAID, method N means read from device N-1. >>>>>>> For stripe/parity RAID, method 1 means "use other data blocks and >>>>>>> parity >>>>>>> blocks to reconstruct data. >>>>>>> >>>>>>> The default for non RAID devices is to return EINVAL for any N> 0. >>>>>>> A remapping device (dm-linear, dm-stripe etc) would just pass the >>>>>>> number >>>>>>> down. I''m not sure how RAID1 over RAID5 would handle it... that >>>>>>> might need >>>>>>> some thought. >>>>>>> >>>>>>> So if btrfs reads a block and the checksum looks wrong, it reads >>>>>>> again with >>>>>>> a larger N. It continues incrementing N and retrying until it gets a >>>>>>> block >>>>>>> that it likes or it gets EINVAL. There should probably be an error >>>>>>> code >>>>>>> (EAGAIN?) which means "I cannot work with that number, but try the >>>>>>> next one". >>>>>>> >>>>>>> It would be trivial for me to implement this for RAID1 and RAID10, >>>>>>> and >>>>>>> relatively easy for RAID5. >>>>>>> I''d need to give a bit of thought to RAID6 as there are possibly >>>>>>> multiple >>>>>>> ways to reconstruct from different combinations of parity and data. >>>>>>> I''m not >>>>>>> sure if there would be much point in doing that though. >>>>>>> >>>>>>> It might make sense for a device to be able to report what the >>>>>>> maximum >>>>>>> ''N'' supported is... that might make stacked raid easier to manage... >>>>>>> >>>>>>> NeilBrown >>>>>>> >>>>>> I think that the above makes sense. Not sure what your "0" definition >>>>>> is, but I >>>>>> would assume that for non-raided devices (i.e., a single s-ata disk), >>>>>> "0" would >>>>>> be an indication that there is nothing more that can be tried. The >>>>>> data you got >>>>>> is all you are ever going to get :) >>>>>> >>>>>> For multiple mirrors, you might want to have a scheme where you would >>>>>> be able to >>>>>> cycle through the mirrors. You could retry, cycling through the >>>>>> various mirrors >>>>>> until you have tried and returned them all at which point you would >>>>>> get a "no >>>>>> more" error back or some such thing. >>>>> >>>>> Hi everyone, >>>>> >>>>> The mirror number idea is basically what btrfs does today, and I think >>>>> it fits in with Neil''s idea to have different policies for different >>>>> blocks. >>>>> >>>>> Basically what btrfs does: >>>>> >>>>> read_block(block_num, mirror = 0) >>>>> if (no io error and not csum error) >>>>> horray() >>>>> >>>>> num_mirrors = get_num_copies(block number) >>>>> for (i = 1; i< num_mirrors; i++) { >>>>> read_block(block_num, mirror = i); >>>>> } >>>>> >>>>> In a stacked configuration, the get_num_copies function can be smarter, >>>>> basically adding up all the copies of the lower levels and finding a >>>>> way >>>>> to combine them. We could just send down a fake bio that is >>>>> responsible >>>>> for adding up the storage combinations into a bitmask or whatever >>>>> works. >>>>> >>>>> We could also just keep retrying until the lower layers reported no >>>>> more >>>>> mirror were available. >>>>> >>>>> In btrfs at least, we don''t set the data blocks up to date until the >>>>> crc >>>>> has passed, so replacing bogus blocks is easy. Metadata is a little >>>>> more complex, but that''s not really related to this topic. >>>>> >>>>> mirror number 0 just means "no mirror preference/pick the fastest >>>>> mirror" to the btrfs block mapping code. >>>>> >>>>> Btrfs has the concept of different raid levels for different logical >>>>> block numbers, so you get_num_copies might return one answer for block >>>>> A >>>>> and a different answer for block B. >>>>> >>>>> Either way, we could easily make use of a bio field here if it were >>>>> exported out. >>>> >>>> you don''t want to just pass the value down as that will cause problems >>>> with layering (especially if the lower layer supports more values than a >>>> higher layer) >>>> >>>> I would suggest that each layer take the value it''s give, do an integer >>>> division by the number of values that layer supports, using the modulo >>>> value for that layer and pass the rest of the result down to the next >>>> layer. >>> >>> I, on the other hand, would suggest that each layer be given the freedom, >>> and >>> the responsibility, to do whatever it thinks is most appropriate. >>> >>> This would probably involved checking the lower levels to see how many >>> strategies each has, and doing some appropriate arithmetic depending on >>> how >>> it combines those devices. >>> >>> One problem here is the assumption that the lower levels don''t change, >>> and we >>> know that not to be the case. >>> However this is already a problem. It is entirely possible that the >>> request >>> size parameters of devices can change at any moment, even while a request >>> is >>> in-flight ... though we try to avoid that or work around it. >>> >>> The sort of dependencies that we see forming here would probably just >>> make >>> the problem worse. >>> >>> Not sure what to do about it though.... maybe just hope it isn''t a >>> problem. >> >> Agreed, if we want to go beyond best effort for a stacking config, we''ll >> need to put some state struct in the bio that each layer can play with. >> That way each layer knows which mirrors have already been tried. >> >> But, maybe the whole btrfs model is backwards for a generic layer. >> Instead of sending down ios and testing when they come back, we could >> just set a verification function (or stack of them?). >> >> For metadata, btrfs compares the crc and a few other fields of the >> metadata block, so we can easily add a compare function pointer and a >> void * to pass in. >> >> The problem is the crc can take a lot of CPU, so btrfs kicks it off to >> threading pools so saturate all the cpus on the box. But there''s no >> reason we can''t make that available lower down. >> >> If we pushed the verification down, the retries could bubble up the >> stack instead of the other way around. >> >> -chris > > I do like the idea of having the ability to do the verification and retries > down the stack where you actually have the most context to figure out what > is possible... > > Why would you need to bubble back up anything other than an error when all > retries have failed?Right, passing a validator is natural. We''ve done the same for validating DM thinp''s metadata (crc, blocknr, etc), btree nodes, bitmaps, etc. The dm-thinp code is very much under review but I figured I''d share. See: https://github.com/jthornber/linux-2.6/blob/4c4089de2e5a4f343d9810f76531cb25aa13f91c/drivers/md/dm-thin-metadata.c Specifically: static struct dm_block_validator sb_validator_ So we pass a dm_block_validator around when doing block IO. Each validator has a .check and a .preparse_for_write method. If the greater kernel offered a comparable mechanism we''d just switch over to using it. Mike
Excerpts from Ric Wheeler''s message of 2011-07-15 09:31:37 -0400:> On 07/15/2011 02:20 PM, Chris Mason wrote: > > Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: > >> On 07/15/2011 12:34 PM, Chris Mason wrote: > > [ triggering IO retries on failed crc or other checks ] > > > >>> But, maybe the whole btrfs model is backwards for a generic layer. > >>> Instead of sending down ios and testing when they come back, we could > >>> just set a verification function (or stack of them?). > >>> > >>> For metadata, btrfs compares the crc and a few other fields of the > >>> metadata block, so we can easily add a compare function pointer and a > >>> void * to pass in. > >>> > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to > >>> threading pools so saturate all the cpus on the box. But there''s no > >>> reason we can''t make that available lower down. > >>> > >>> If we pushed the verification down, the retries could bubble up the > >>> stack instead of the other way around. > >>> > >>> -chris > >> I do like the idea of having the ability to do the verification and retries down > >> the stack where you actually have the most context to figure out what is possible... > >> > >> Why would you need to bubble back up anything other than an error when all > >> retries have failed? > > By bubble up I mean that if you have multiple layers capable of doing > > retries, the lowest levels would retry first. Basically by the time we > > get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can > > do to help. > > > > -chris > > Absolutely sounds like the most sane way to go to me, thanks! >It really seemed like a good idea, but I just realized it doesn''t work well when parts of the stack transform the data. Picture dm-crypt on top of raid1. If raid1 is responsible for the crc retries, there''s no way to crc the data because it needs to be decrypted first. I think the raided dm-crypt config is much more common (and interesting) than multiple layers that can retry for other reasons (raid1 on top of raid10?) In other words, do we really want to do a lot of design work for multiple layers where each one maintains multiple copies of the data blocks? Are there configs where this really makes sense? -chris
On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote:> Excerpts from Ric Wheeler''s message of 2011-07-15 09:31:37 -0400: > > On 07/15/2011 02:20 PM, Chris Mason wrote: > > > Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: > > >> On 07/15/2011 12:34 PM, Chris Mason wrote: > > > [ triggering IO retries on failed crc or other checks ] > > > > > >>> But, maybe the whole btrfs model is backwards for a generic layer. > > >>> Instead of sending down ios and testing when they come back, we could > > >>> just set a verification function (or stack of them?). > > >>> > > >>> For metadata, btrfs compares the crc and a few other fields of the > > >>> metadata block, so we can easily add a compare function pointer and a > > >>> void * to pass in. > > >>> > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to > > >>> threading pools so saturate all the cpus on the box. But there''s no > > >>> reason we can''t make that available lower down. > > >>> > > >>> If we pushed the verification down, the retries could bubble up the > > >>> stack instead of the other way around. > > >>> > > >>> -chris > > >> I do like the idea of having the ability to do the verification and retries down > > >> the stack where you actually have the most context to figure out what is possible... > > >> > > >> Why would you need to bubble back up anything other than an error when all > > >> retries have failed? > > > By bubble up I mean that if you have multiple layers capable of doing > > > retries, the lowest levels would retry first. Basically by the time we > > > get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can > > > do to help. > > > > > > -chris > > > > Absolutely sounds like the most sane way to go to me, thanks! > > > > It really seemed like a good idea, but I just realized it doesn''t work > well when parts of the stack transform the data. > > Picture dm-crypt on top of raid1. If raid1 is responsible for the > crc retries, there''s no way to crc the data because it needs to be > decrypted first. > > I think the raided dm-crypt config is much more common (and interesting) > than multiple layers that can retry for other reasons (raid1 on top of > raid10?)Isn''t this a case where the transformative mid-layer would replace the validation function before passing it down the stack? So btrfs hands dm-crypt a checksum function; dm-crypt then stores that function for its own purposes and hands off a new function to the DM layer below that which decrypts the data and calls the btrfs checksum function it stored earlier.> In other words, do we really want to do a lot of design work for > multiple layers where each one maintains multiple copies of the data > blocks? Are there configs where this really makes sense?Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- "What are we going to do tonight?" "The same thing we do --- every night, Pinky. Try to take over the world!"
Excerpts from Hugo Mills''s message of 2011-07-15 10:07:24 -0400:> On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote: > > Excerpts from Ric Wheeler''s message of 2011-07-15 09:31:37 -0400: > > > On 07/15/2011 02:20 PM, Chris Mason wrote: > > > > Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: > > > >> On 07/15/2011 12:34 PM, Chris Mason wrote: > > > > [ triggering IO retries on failed crc or other checks ] > > > > > > > >>> But, maybe the whole btrfs model is backwards for a generic layer. > > > >>> Instead of sending down ios and testing when they come back, we could > > > >>> just set a verification function (or stack of them?). > > > >>> > > > >>> For metadata, btrfs compares the crc and a few other fields of the > > > >>> metadata block, so we can easily add a compare function pointer and a > > > >>> void * to pass in. > > > >>> > > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to > > > >>> threading pools so saturate all the cpus on the box. But there''s no > > > >>> reason we can''t make that available lower down. > > > >>> > > > >>> If we pushed the verification down, the retries could bubble up the > > > >>> stack instead of the other way around. > > > >>> > > > >>> -chris > > > >> I do like the idea of having the ability to do the verification and retries down > > > >> the stack where you actually have the most context to figure out what is possible... > > > >> > > > >> Why would you need to bubble back up anything other than an error when all > > > >> retries have failed? > > > > By bubble up I mean that if you have multiple layers capable of doing > > > > retries, the lowest levels would retry first. Basically by the time we > > > > get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can > > > > do to help. > > > > > > > > -chris > > > > > > Absolutely sounds like the most sane way to go to me, thanks! > > > > > > > It really seemed like a good idea, but I just realized it doesn''t work > > well when parts of the stack transform the data. > > > > Picture dm-crypt on top of raid1. If raid1 is responsible for the > > crc retries, there''s no way to crc the data because it needs to be > > decrypted first. > > > > I think the raided dm-crypt config is much more common (and interesting) > > than multiple layers that can retry for other reasons (raid1 on top of > > raid10?) > > Isn''t this a case where the transformative mid-layer would replace > the validation function before passing it down the stack? So btrfs > hands dm-crypt a checksum function; dm-crypt then stores that function > for its own purposes and hands off a new function to the DM layer > below that which decrypts the data and calls the btrfs checksum > function it stored earlier.Then we''re requiring each transformation layer to have their own crcs, and if the higher layers have a stronger crc (or other checks), there''s no path to ask the lower layers for other copies. Here''s a concrete example. In each metadata block, btrfs stores the fsid and the transid of the transaction that created it. In the case of a missed write, we''ll read a perfect block from the lower layers. Any crcs will be correct and it''ll pass through dm-crypt with flying colors. But, it won''t be the right block. Btrfs will notice this and EIO. In the current ask-for-another-mirror config we''ll go down and grab the other copy. In the stacked validation function model, dm-crypt replaces our verification functions with something that operates on the encrypted data, and it won''t be able to detect the error or kick down to the underlying raid1 for another copy. -chris
Am Freitag, den 15.07.2011, 10:24 -0400 schrieb Chris Mason:> Excerpts from Hugo Mills''s message of 2011-07-15 10:07:24 -0400: > > On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote: > > > Excerpts from Ric Wheeler''s message of 2011-07-15 09:31:37 -0400: > > > > On 07/15/2011 02:20 PM, Chris Mason wrote: > > > > > Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: > > > > >> On 07/15/2011 12:34 PM, Chris Mason wrote: > > > > > [ triggering IO retries on failed crc or other checks ] > > > > > > > > > >>> But, maybe the whole btrfs model is backwards for a generic layer. > > > > >>> Instead of sending down ios and testing when they come back, we could > > > > >>> just set a verification function (or stack of them?). > > > > >>> > > > > >>> For metadata, btrfs compares the crc and a few other fields of the > > > > >>> metadata block, so we can easily add a compare function pointer and a > > > > >>> void * to pass in. > > > > >>> > > > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to > > > > >>> threading pools so saturate all the cpus on the box. But there''s no > > > > >>> reason we can''t make that available lower down. > > > > >>> > > > > >>> If we pushed the verification down, the retries could bubble up the > > > > >>> stack instead of the other way around. > > > > >>> > > > > >>> -chris > > > > >> I do like the idea of having the ability to do the verification and retries down > > > > >> the stack where you actually have the most context to figure out what is possible... > > > > >> > > > > >> Why would you need to bubble back up anything other than an error when all > > > > >> retries have failed? > > > > > By bubble up I mean that if you have multiple layers capable of doing > > > > > retries, the lowest levels would retry first. Basically by the time we > > > > > get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can > > > > > do to help. > > > > > > > > > > -chris > > > > > > > > Absolutely sounds like the most sane way to go to me, thanks! > > > > > > > > > > It really seemed like a good idea, but I just realized it doesn''t work > > > well when parts of the stack transform the data. > > > > > > Picture dm-crypt on top of raid1. If raid1 is responsible for the > > > crc retries, there''s no way to crc the data because it needs to be > > > decrypted first. > > > > > > I think the raided dm-crypt config is much more common (and interesting) > > > than multiple layers that can retry for other reasons (raid1 on top of > > > raid10?) > > > > Isn''t this a case where the transformative mid-layer would replace > > the validation function before passing it down the stack? So btrfs > > hands dm-crypt a checksum function; dm-crypt then stores that function > > for its own purposes and hands off a new function to the DM layer > > below that which decrypts the data and calls the btrfs checksum > > function it stored earlier. > > Then we''re requiring each transformation layer to have their own crcs, > and if the higher layers have a stronger crc (or other checks), there''s > no path to ask the lower layers for other copies. > > Here''s a concrete example. In each metadata block, btrfs stores the > fsid and the transid of the transaction that created it. In the case of > a missed write, we''ll read a perfect block from the lower layers. Any > crcs will be correct and it''ll pass through dm-crypt with flying colors. > > But, it won''t be the right block. Btrfs will notice this and EIO. In > the current ask-for-another-mirror config we''ll go down and grab the > other copy. > > In the stacked validation function model, dm-crypt replaces our > verification functions with something that operates on the encrypted > data, and it won''t be able to detect the error or kick down to the > underlying raid1 for another copy. > > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.htmlI think the point is not to replace the crc function in the dm_crypt case, but to wrap it with an decrypt function which then calls the crc function. So even if a lower mirror uses the new dm-crypt crc function, the btrfs crc function still gets called - at the end of the chain. Regards, Christian Aßfalg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 15, 2011 at 10:24:25AM -0400, Chris Mason wrote:> Excerpts from Hugo Mills''s message of 2011-07-15 10:07:24 -0400: > > On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote: > > > Excerpts from Ric Wheeler''s message of 2011-07-15 09:31:37 -0400: > > > > On 07/15/2011 02:20 PM, Chris Mason wrote: > > > > > Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: > > > > >> On 07/15/2011 12:34 PM, Chris Mason wrote: > > > > > [ triggering IO retries on failed crc or other checks ] > > > > > > > > > >>> But, maybe the whole btrfs model is backwards for a generic layer. > > > > >>> Instead of sending down ios and testing when they come back, we could > > > > >>> just set a verification function (or stack of them?). > > > > >>> > > > > >>> For metadata, btrfs compares the crc and a few other fields of the > > > > >>> metadata block, so we can easily add a compare function pointer and a > > > > >>> void * to pass in. > > > > >>> > > > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to > > > > >>> threading pools so saturate all the cpus on the box. But there''s no > > > > >>> reason we can''t make that available lower down. > > > > >>> > > > > >>> If we pushed the verification down, the retries could bubble up the > > > > >>> stack instead of the other way around. > > > > >>> > > > > >>> -chris > > > > >> I do like the idea of having the ability to do the verification and retries down > > > > >> the stack where you actually have the most context to figure out what is possible... > > > > >> > > > > >> Why would you need to bubble back up anything other than an error when all > > > > >> retries have failed? > > > > > By bubble up I mean that if you have multiple layers capable of doing > > > > > retries, the lowest levels would retry first. Basically by the time we > > > > > get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can > > > > > do to help. > > > > > > > > > > -chris > > > > > > > > Absolutely sounds like the most sane way to go to me, thanks! > > > > > > > > > > It really seemed like a good idea, but I just realized it doesn''t work > > > well when parts of the stack transform the data. > > > > > > Picture dm-crypt on top of raid1. If raid1 is responsible for the > > > crc retries, there''s no way to crc the data because it needs to be > > > decrypted first. > > > > > > I think the raided dm-crypt config is much more common (and interesting) > > > than multiple layers that can retry for other reasons (raid1 on top of > > > raid10?) > > > > Isn''t this a case where the transformative mid-layer would replace > > the validation function before passing it down the stack? So btrfs > > hands dm-crypt a checksum function; dm-crypt then stores that function > > for its own purposes and hands off a new function to the DM layer > > below that which decrypts the data and calls the btrfs checksum > > function it stored earlier. > > Then we''re requiring each transformation layer to have their own crcs, > and if the higher layers have a stronger crc (or other checks), there''s > no path to ask the lower layers for other copies. > > Here''s a concrete example. In each metadata block, btrfs stores the > fsid and the transid of the transaction that created it. In the case of > a missed write, we''ll read a perfect block from the lower layers. Any > crcs will be correct and it''ll pass through dm-crypt with flying colors. > > But, it won''t be the right block. Btrfs will notice this and EIO. In > the current ask-for-another-mirror config we''ll go down and grab the > other copy. > > In the stacked validation function model, dm-crypt replaces our > verification functions with something that operates on the encrypted > data, and it won''t be able to detect the error or kick down to the > underlying raid1 for another copy.What I''m suggesting is easiest to describe with a functional language, or mathematical notation, but I''ll try without those anyway... A generic validation function is a two-parameter function, taking a block of data and some layer-dependent state, and returning a boolean. So, at the btrfs layer, we have something like: struct btrfs_validate_state { u32 cs; }; bool btrfs_validate(void *state, char *data) { return crc32(data) == (btrfs_validate_state*)state->cs; } When reading a specific block, we look up the checksum for that block, put it in state->cs and pass both our state and the btrfs_validate function to the lower layer. The crypto layer beneath will look something like: struct crypto_validate_state { void *old_state; bool (*old_validator)(void *state, char* data); }; bool crypto_validate(void *state, char *data) { char plaintext[4096]; decrypt(plaintext, data); return state->old_validator(state->old_state, plaintext); } Then, when a request is received from the upper layer, we can put the (state, validator) pair into a crypto_validate_state structure, call that our new state, and pass on the new (state, crypto_validate) to the layer underneath. So, where a transformative layer exists in the stack, it replaces the validation function with a helper function that does the necessary transformation, and performs the check. The code I''ve described above would clearly need to be hooked into the existing data-transformation paths so that we don''t replicate the effort of decrypting or uncompressing good data once we''ve decided that it _is_ good. Oh, and if there''s a layer with its own validation, we can put in a "firebreak" implementation of the above code, which basically drops the validator it''s passed, and replaces it completely with its own. (This would look exactly like the btrfs version at the top of my reply here). The standard case for most block-layer implementations will simply be to pass the (state, validator) pair unchanged to the lower layer, because the block being checked won''t be any different. It''s only where we have layers that change things, or definitely know better than the top layer that we need to play tricks. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Sometimes, when I''m alone, I Google myself. ---
Excerpts from Hugo Mills''s message of 2011-07-15 10:54:36 -0400:> On Fri, Jul 15, 2011 at 10:24:25AM -0400, Chris Mason wrote: > > Excerpts from Hugo Mills''s message of 2011-07-15 10:07:24 -0400: > > > On Fri, Jul 15, 2011 at 10:00:35AM -0400, Chris Mason wrote: > > > > Excerpts from Ric Wheeler''s message of 2011-07-15 09:31:37 -0400: > > > > > On 07/15/2011 02:20 PM, Chris Mason wrote: > > > > > > Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: > > > > > >> On 07/15/2011 12:34 PM, Chris Mason wrote: > > > > > > [ triggering IO retries on failed crc or other checks ] > > > > > > > > > > > >>> But, maybe the whole btrfs model is backwards for a generic layer. > > > > > >>> Instead of sending down ios and testing when they come back, we could > > > > > >>> just set a verification function (or stack of them?). > > > > > >>> > > > > > >>> For metadata, btrfs compares the crc and a few other fields of the > > > > > >>> metadata block, so we can easily add a compare function pointer and a > > > > > >>> void * to pass in. > > > > > >>> > > > > > >>> The problem is the crc can take a lot of CPU, so btrfs kicks it off to > > > > > >>> threading pools so saturate all the cpus on the box. But there''s no > > > > > >>> reason we can''t make that available lower down. > > > > > >>> > > > > > >>> If we pushed the verification down, the retries could bubble up the > > > > > >>> stack instead of the other way around. > > > > > >>> > > > > > >>> -chris > > > > > >> I do like the idea of having the ability to do the verification and retries down > > > > > >> the stack where you actually have the most context to figure out what is possible... > > > > > >> > > > > > >> Why would you need to bubble back up anything other than an error when all > > > > > >> retries have failed? > > > > > > By bubble up I mean that if you have multiple layers capable of doing > > > > > > retries, the lowest levels would retry first. Basically by the time we > > > > > > get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can > > > > > > do to help. > > > > > > > > > > > > -chris > > > > > > > > > > Absolutely sounds like the most sane way to go to me, thanks! > > > > > > > > > > > > > It really seemed like a good idea, but I just realized it doesn''t work > > > > well when parts of the stack transform the data. > > > > > > > > Picture dm-crypt on top of raid1. If raid1 is responsible for the > > > > crc retries, there''s no way to crc the data because it needs to be > > > > decrypted first. > > > > > > > > I think the raided dm-crypt config is much more common (and interesting) > > > > than multiple layers that can retry for other reasons (raid1 on top of > > > > raid10?) > > > > > > Isn''t this a case where the transformative mid-layer would replace > > > the validation function before passing it down the stack? So btrfs > > > hands dm-crypt a checksum function; dm-crypt then stores that function > > > for its own purposes and hands off a new function to the DM layer > > > below that which decrypts the data and calls the btrfs checksum > > > function it stored earlier. > > > > Then we''re requiring each transformation layer to have their own crcs, > > and if the higher layers have a stronger crc (or other checks), there''s > > no path to ask the lower layers for other copies. > > > > Here''s a concrete example. In each metadata block, btrfs stores the > > fsid and the transid of the transaction that created it. In the case of > > a missed write, we''ll read a perfect block from the lower layers. Any > > crcs will be correct and it''ll pass through dm-crypt with flying colors. > > > > But, it won''t be the right block. Btrfs will notice this and EIO. In > > the current ask-for-another-mirror config we''ll go down and grab the > > other copy. > > > > In the stacked validation function model, dm-crypt replaces our > > verification functions with something that operates on the encrypted > > data, and it won''t be able to detect the error or kick down to the > > underlying raid1 for another copy. > > What I''m suggesting is easiest to describe with a functional > language, or mathematical notation, but I''ll try without those > anyway... > > A generic validation function is a two-parameter function, taking a > block of data and some layer-dependent state, and returning a boolean. > > So, at the btrfs layer, we have something like: > > struct btrfs_validate_state { > u32 cs; > }; > > bool btrfs_validate(void *state, char *data) { > return crc32(data) == (btrfs_validate_state*)state->cs; > } > > When reading a specific block, we look up the checksum for that > block, put it in state->cs and pass both our state and the > btrfs_validate function to the lower layer. > > The crypto layer beneath will look something like: > > struct crypto_validate_state { > void *old_state; > bool (*old_validator)(void *state, char* data); > }; > > bool crypto_validate(void *state, char *data) { > char plaintext[4096]; > decrypt(plaintext, data); > > return state->old_validator(state->old_state, plaintext); > } > > Then, when a request is received from the upper layer, we can put > the (state, validator) pair into a crypto_validate_state structure, > call that our new state, and pass on the new (state, crypto_validate) > to the layer underneath. > > So, where a transformative layer exists in the stack, it replaces > the validation function with a helper function that does the necessary > transformation, and performs the check. > > The code I''ve described above would clearly need to be hooked into > the existing data-transformation paths so that we don''t replicate the > effort of decrypting or uncompressing good data once we''ve decided > that it _is_ good. > > Oh, and if there''s a layer with its own validation, we can put in a > "firebreak" implementation of the above code, which basically drops > the validator it''s passed, and replaces it completely with its own. > (This would look exactly like the btrfs version at the top of my reply > here). > > The standard case for most block-layer implementations will simply > be to pass the (state, validator) pair unchanged to the lower layer, > because the block being checked won''t be any different. It''s only > where we have layers that change things, or definitely know better > than the top layer that we need to play tricks.It''s not that I disagree, but what you''re describing is the entire end_io chain in the current code. Replace validator with end_io and we have the current configuration, since everyone ends up stacking the end_io handlers now. The difference is that in the validator config, we''ll be running the validation stack at the lower layer in the IO stack. This sounds like a great idea because we can go ahead and retry at the lower layer or move up the stack and retry up higher. But, lets say we''re doing striping over raid1. Now the lower layer IO completion (and stack of validations) may have to wait for multiple stripes to end their IO. I think the complexity gets out of hand really quickly on this. -chris
On Fri, 15 Jul 2011, NeilBrown wrote:> On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) david@lang.hm wrote: > >> On Thu, 14 Jul 2011, Chris Mason wrote: >> >>> Excerpts from Ric Wheeler''s message of 2011-07-14 02:57:54 -0400: >>>> On 07/14/2011 07:38 AM, NeilBrown wrote: >>>>> On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler<rwheeler@redhat.com> wrote: >>>>> >> >> I would suggest that each layer take the value it''s give, do an integer >> division by the number of values that layer supports, using the modulo >> value for that layer and pass the rest of the result down to the next >> layer. > > I, on the other hand, would suggest that each layer be given the freedom, and > the responsibility, to do whatever it thinks is most appropriate. > > This would probably involved checking the lower levels to see how many > strategies each has, and doing some appropriate arithmetic depending on how > it combines those devices.nothing is wrong with doing something smarter than what I was proposing, I was just intending to propose a default rule that would work (just about) everywhere. I was specifically trying to counter the throught hat the method number would/should be contant as it''s passed down the layers. The proposal had been made to have each layer do the retries for the layer below it, that would avoid this stacking problem, but at the cost of potentially doing a LOT of retries without the upper level being able to limit it. David Lang> One problem here is the assumption that the lower levels don''t change, and we > know that not to be the case. > However this is already a problem. It is entirely possible that the request > size parameters of devices can change at any moment, even while a request is > in-flight ... though we try to avoid that or work around it. > > The sort of dependencies that we see forming here would probably just make > the problem worse. > > Not sure what to do about it though.... maybe just hope it isn''t a problem. > > NeilBrown > > >> >> this works for just trying values until you get the error that tells you >> there are no more options. >> >> >> things can get very ''intersesting'' if the different options below you have >> different number of options (say a raid1 across a raid5 and a single disk) >> but I can''t think of any good way to figure this out and assemble a >> sane order without doing an exaustive search down to find the number of >> options for each layer (and since this can be different for different >> blocks, you would have to do this each time you invoked this option) >> >> David Lang > >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 15 Jul 2011, Chris Mason wrote:> Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: >> On 07/15/2011 12:34 PM, Chris Mason wrote: > > By bubble up I mean that if you have multiple layers capable of doing > retries, the lowest levels would retry first. Basically by the time we > get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can > do to help.the problem with doing this is that it can end up stalling the box for significant amounts of time while all the retries happen. we already see this happening today where a disk read failure is retried multiple times by the disk, multiple times by the raid controller, and then multiple times by Linux, resulting is multi-minute stalls when you hit a disk error in some cases. having the lower layers do the retries automatically runs the risk of making this even worse. This needs to be able to be throttled by some layer that can see the entire picture (either by cutting off the retries after a number, after some time, or by spacing out the retries to allow other queries to get in and let the box do useful work in the meantime) David Lang
On 07/15/2011 05:23 PM, david@lang.hm wrote:> On Fri, 15 Jul 2011, Chris Mason wrote: > >> Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: >>> On 07/15/2011 12:34 PM, Chris Mason wrote: >> >> By bubble up I mean that if you have multiple layers capable of doing >> retries, the lowest levels would retry first. Basically by the time we >> get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can >> do to help. > > the problem with doing this is that it can end up stalling the box for > significant amounts of time while all the retries happen. > > we already see this happening today where a disk read failure is retried > multiple times by the disk, multiple times by the raid controller, and then > multiple times by Linux, resulting is multi-minute stalls when you hit a disk > error in some cases. > > having the lower layers do the retries automatically runs the risk of making > this even worse. > > This needs to be able to be throttled by some layer that can see the entire > picture (either by cutting off the retries after a number, after some time, or > by spacing out the retries to allow other queries to get in and let the box do > useful work in the meantime) > > David Lang >That should not be an issue - we have a "fast fail" path for IO that should avoid retrying just for those reasons (i.e., for multi-path or when recovering a flaky drive). This is not a scheme for unbounded retries. If you have a 3 disk mirror in RAID1, you would read the data no more than 2 extra times and almost never more than once. That should be *much* faster than the multiple-second long timeout you see when waiting for SCSI timeout to fire, etc. Ric
On Fri, 15 Jul 2011, Ric Wheeler wrote:> On 07/15/2011 05:23 PM, david@lang.hm wrote: >> On Fri, 15 Jul 2011, Chris Mason wrote: >> >>> Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: >>>> On 07/15/2011 12:34 PM, Chris Mason wrote: >>> >>> By bubble up I mean that if you have multiple layers capable of doing >>> retries, the lowest levels would retry first. Basically by the time we >>> get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can >>> do to help. >> >> the problem with doing this is that it can end up stalling the box for >> significant amounts of time while all the retries happen. >> >> we already see this happening today where a disk read failure is retried >> multiple times by the disk, multiple times by the raid controller, and then >> multiple times by Linux, resulting is multi-minute stalls when you hit a >> disk error in some cases. >> >> having the lower layers do the retries automatically runs the risk of >> making this even worse. >> >> This needs to be able to be throttled by some layer that can see the entire >> picture (either by cutting off the retries after a number, after some time, >> or by spacing out the retries to allow other queries to get in and let the >> box do useful work in the meantime) >> >> David Lang >> > > That should not be an issue - we have a "fast fail" path for IO that should > avoid retrying just for those reasons (i.e., for multi-path or when > recovering a flaky drive). > > This is not a scheme for unbounded retries. If you have a 3 disk mirror in > RAID1, you would read the data no more than 2 extra times and almost never > more than once. That should be *much* faster than the multiple-second long > timeout you see when waiting for SCSI timeout to fire, etc.this issue is when you stack things. what if you have a 3 piece raid 1 on top of a 4 piece raid 6? so you have 3 raid1 retries * N raid 6 retries. depending on the order that you do the retries in, and how long it takes that try to fail, this could start to take significant amounts of time. if you do a retry on the lower level first, the raid 6 could try several different ways to combine the drives to get the valid data (disks 1,2 2,3 3,4 1,3 1,4 2,4 1,2,3 1,2,4 1,3,4 2,3,4) add more disks and it gets worse fast. add more layers and you multiple the number of possible retries. my guess is that changing to a different method at the upper level is going to avoid the problem area faster then doing so at a lower level (because there is less hardware in common with the method that just gave the wrong answer) David Lang
On 07/15/2011 06:01 PM, david@lang.hm wrote:> On Fri, 15 Jul 2011, Ric Wheeler wrote: > >> On 07/15/2011 05:23 PM, david@lang.hm wrote: >>> On Fri, 15 Jul 2011, Chris Mason wrote: >>> >>>> Excerpts from Ric Wheeler''s message of 2011-07-15 08:58:04 -0400: >>>>> On 07/15/2011 12:34 PM, Chris Mason wrote: >>>> >>>> By bubble up I mean that if you have multiple layers capable of doing >>>> retries, the lowest levels would retry first. Basically by the time we >>>> get an -EIO_ALREADY_RETRIED we know there''s nothing that lower level can >>>> do to help. >>> >>> the problem with doing this is that it can end up stalling the box for >>> significant amounts of time while all the retries happen. >>> >>> we already see this happening today where a disk read failure is retried >>> multiple times by the disk, multiple times by the raid controller, and then >>> multiple times by Linux, resulting is multi-minute stalls when you hit a >>> disk error in some cases. >>> >>> having the lower layers do the retries automatically runs the risk of making >>> this even worse. >>> >>> This needs to be able to be throttled by some layer that can see the entire >>> picture (either by cutting off the retries after a number, after some time, >>> or by spacing out the retries to allow other queries to get in and let the >>> box do useful work in the meantime) >>> >>> David Lang >>> >> >> That should not be an issue - we have a "fast fail" path for IO that should >> avoid retrying just for those reasons (i.e., for multi-path or when >> recovering a flaky drive). >> >> This is not a scheme for unbounded retries. If you have a 3 disk mirror in >> RAID1, you would read the data no more than 2 extra times and almost never >> more than once. That should be *much* faster than the multiple-second long >> timeout you see when waiting for SCSI timeout to fire, etc. > > this issue is when you stack things. > > what if you have a 3 piece raid 1 on top of a 4 piece raid 6? > > so you have 3 raid1 retries * N raid 6 retries. depending on the order that > you do the retries in, and how long it takes that try to fail, this could > start to take significant amounts of time. > > if you do a retry on the lower level first, the raid 6 could try several > different ways to combine the drives to get the valid data (disks 1,2 2,3 3,4 > 1,3 1,4 2,4 1,2,3 1,2,4 1,3,4 2,3,4) add more disks and it gets worse fast. > > add more layers and you multiple the number of possible retries. > > my guess is that changing to a different method at the upper level is going to > avoid the problem area faster then doing so at a lower level (because there is > less hardware in common with the method that just gave the wrong answer) > > David LangAt some point, the question is why would you do that? Two parity drives for each 4 drive RAID-6 set and then mirror that 3 times (12 drives in total, only 2 data drives)? Better off doing a 4 way mirror :) I think that you are still missing the point. If the lowest layer can repair the data, it would return the first validated answer. The major time sync would be the IO to read the sectors from each of the 4 drives in your example, not computing the various combination of parity or validating (all done in memory). If the RAID-6 layer failed, you would do the same for each of the mirrors which would read the IO from each of their RAID-6 low levels exactly once as well (and then verify in memory). Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html