?Force export? for the DMU serves a similar purpose as a feature we added for block devices in Linux in relation to exports. When failover is initiated, the OSS/MDS servers stop sending replies and requests that are still being processed interact with the block devices in a model where the devices discard write commands WITHOUT returning errors. This is different from merely declaring the device READONLY in which case errors are returned. The latter is a default feature in the Linux kernel, what we did is a patch (but could be a mapper module). The thinking behind this approach was (many years ago) that this avoids exposing the server layers to errors (caused by writes to read only devices) from the block devices which might cause the server to panic, thereby taking out other targets inadvertently. However, the approach is flawed. It is (theoretically, but not so likely) possible for the server to write something, believe it has been done, and read it back getting the wrong data (because it wasn?t written), and still panic. So I would like to suggest that for the DMU we do this differently and rely on a normal read only device. So, the server, during recovery, will be using standard read only devices (and similar under the DMU). If the file system or DMU returns errors because writes cannot be performed for requests that are in progress during the failover event, then these errors should be handled gracefully (without panics). Note that the errors will never reach the client, not over the network and not through reply reconstruction, because failover was initiated before they happened. The hacked feature retains value because it can generate an artificially large amount of rollback data, which is useful for testing the replay recovery mechanisms in Lustre. However, with DMU snapshots this can easily be simulated in a different manner. Nikita, Alex ? I think the key issue here is that the error handling in the new servers that you have written needs to be resilient enough to handle this. Can you think about it? Ricardo ? for the DMU all you need to do is make sure you can quickly turn a device read only below the DMU and the DMU can handle that (its like doing ?mount ?o remount, ro?). Regards Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080416/965dd99f/attachment-0004.html
Ricardo M. Correia
2008-Apr-16 16:40 UTC
[Lustre-devel] Failover & Force export for the DMU
On Qua, 2008-04-16 at 08:37 -0700, Peter Braam wrote:> However, the approach is flawed. It is (theoretically, but not so > likely) possible for the server to write something, believe it has > been done, and read it back getting the wrong data (because it wasn?t > written), and still panic.With the DMU there is a similar problem, but its behavior is more sane and much more interesting. If a write is discarded without the DMU''s knowledge, when the data is read back the checksum will necessarily fail due to the ZFS design''s cleverness of storing the checksum on the block pointer (in the parent block, which itself has its checksum on its parent block, and so on up until the uberblock). So if the checksum fails 2 things can happen: - If a read is a normal read, the caller will get an ECKSUM error, propagating the error back to the DMU''s consumer (this is what is used for all data reads). - If a read if a special "must succeed" read, then the behavior will depend on the "failmode" property of the pool (explained below). A "must succeed" read, like the name indicates, is a critical read which always succeeds (caller is blocked until it does), used in situations where failure would lead to data loss. It is only used for some metadata reads.> So I would like to suggest that for the DMU we do this differently and > rely on a normal read only device. So, the server, during recovery, > will be using standard read only devices (and similar under the DMU). > If the file system or DMU returns errors because writes cannot be > performed for requests that are in progress during the failover event, > then these errors should be handled gracefully (without panics). Note > that the errors will never reach the client, not over the network and > not through reply reconstruction, because failover was initiated > before they happened.I agree, but I''m not so sure we should still continue to send read requests to the storage devices when we are failing over. One of the reasons the failover could be happening is due to a failure somewhere in the server -> storage path, and if this is happening we may experience delays of 30 or 60 seconds for the IOs to timeout, especially if we''re doing synchronous I/O in the ZIO threads like we are doing now. So I think returning EIO for reads on the backend storage might be more appropriate during a failover.> Ricardo ? for the DMU all you need to do is make sure you can quickly > turn a device read only below the DMU and the DMU can handle that (its > like doing ?mount ?o remount, ro?).Well, it''s a bit more complicated than that.. If there is a fatal failure to write to the backend devices, the error will be returned to the ZIO pipeline and the DMU''s behavior will again depend on the "failmode" property of the pool, which can have 3 different values: - wait mode: I/O is blocked until the administrator corrects the problem manually. This is useful for regular ZFS pools, because the administrator has a chance to replace the device that is experiencing IO failures and therefore prevent any data loss. - continue mode: (quoting) "Returns EIO to any new write I/O requests" (in the transaction phase) ".. but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked." - panic mode: in userspace, we do an abort(). This would be a good solution for Lustre if we didn''t have multiple ZFS pools in the same userspace server, but it''s not useful at all in that case. The big problem here is that neither the "wait" mode nor the "continue" mode allow a pool with dirty data to be exported if the backend devices are returning errors in the pwrite() calls (be it EROFS, EIO, or any other), due to ZFS''s insistence on preserving data integrity (which I think is very well designed). I have thought a lot about this, and my conclusion is that when force-exporting a pool we should make the DMU discard all writes to the backend storage, make reads (even "must succeed" reads) return EIO, and then go through the normal DMU export process. I believe this is the only sane way of successfully getting rid of dirty data in the DMU without any loss of transactional integrity or weird failures, but it will also require changing the DMU to gracefully handle failures in "must succeed" reads, which will not be easy.. The consequence for Lustre is that the OSS/MDS servers *must* be able to handle errors gracefully because the DMU could return a lot of EIOs during failover. Cheers, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080416/821454ce/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080416/821454ce/attachment-0004.gif
On 4/16/08 9:40 AM, "Ricardo M. Correia" <Ricardo.M.Correia at Sun.COM> wrote: ... SNIP> I agree, but I''m not so sure we should still continue to send read requests to > the storage devices when we are failing over. One of the reasons the failover > could be happening is due to a failure somewhere in the server -> storage > path, and if this is happening we may experience delays of 30 or 60 seconds > for the IOs to timeout, especially if we''re doing synchronous I/O in the ZIO > threads like we are doing now. > > So I think returning EIO for reads on the backend storage might be more > appropriate during a failover. >I think that is fine ? again, the key issue is not to kill the server while it gets these errors. It may well be that the server needs a special ?I?m recovering be gentle with errors? mode to avoid reasonable panics.> >> Ricardo ? for the DMU all you need to do is make sure you can quickly turn a >> device read only below the DMU and the DMU can handle that (its like doing >> ?mount ?o remount, ro?). > > Well, it''s a bit more complicated than that.. > If there is a fatal failure to write to the backend devices, the error will be > returned to the ZIO pipeline and the DMU''s behavior will again depend on the > "failmode" property of the pool, which can have 3 different values: > > - wait mode: I/O is blocked until the administrator corrects the problem > manually. This is useful for regular ZFS pools, because the administrator has > a chance to replace the device that is experiencing IO failures and therefore > prevent any data loss. > > - continue mode: (quoting) "Returns EIO to any new write I/O requests" (in the > transaction phase) ".. but allows reads to any of the remaining healthy > devices. Any write requests that have yet to be committed to disk would be > blocked." > > - panic mode: in userspace, we do an abort(). This would be a good solution > for Lustre if we didn''t have multiple ZFS pools in the same userspace server, > but it''s not useful at all in that case. >Well yes, the problem is that controlled failovers are required, for example when you fail back.> > > The big problem here is that neither the "wait" mode nor the "continue" mode > allow a pool with dirty data to be exported if the backend devices are > returning errors in the pwrite() calls (be it EROFS, EIO, or any other), due > to ZFS''s insistence on preserving data integrity (which I think is very well > designed). >Please explain why we want to export such a pool and on which node we want to export it, in fact what is ?export? (it should be similar to unmount)? If things are failing, then, on the node that is failing, we don?t need this pool anymore, we need to shut things down, in most cases for a reboot. We need the pool on the failover node. In fact there is a very useful distinction to make. There are two failover scenarios: 1. fail over to move services away from failures on the OSS. In this case a reboot/panic is not really harmful. 2. fail over from a fully functioning OSS/DMU to redistribute services. In this case we need a control mechanism to turn the device read-only and clean up the DMU. Unfortunately we cannot consider mandating that there is only one file system per OSS because then we need an idle node to act as the failover node. We must handle the problem of shutting ?one of more? down, but only in the clean case (2).> > I have thought a lot about this, and my conclusion is that when > force-exporting a pool we should make the DMU discard all writes to the > backend storage, make reads (even "must succeed" reads) return EIO, and then > go through the normal DMU export process. I believe this is the only sane way > of successfully getting rid of dirty data in the DMU without any loss of > transactional integrity or weird failures, but it will also require changing > the DMU to gracefully handle failures in "must succeed" reads, which will not > be easy.. >Sun already has products (a CIFS server) that can failover on ZFS. It might be interesting to ask them if they can handle failing over one ZFS file system while keeping others, because this is essentially the same problem as we have from a DMU perspective. Peter> > The consequence for Lustre is that the OSS/MDS servers *must* be able to > handle errors gracefully because the DMU could return a lot of EIOs during > failover. > > Cheers, > Ricardo > -- > Ricardo Manuel Correia > Lustre Engineering > > Sun Microsystems, Inc. > Portugal > Phone +351.214134023 / x58723 > Mobile +351.912590825 > Email Ricardo.M.Correia at Sun.COM > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080416/7096dade/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080416/7096dade/attachment-0004.gif
Ricardo M. Correia
2008-Apr-17 16:10 UTC
[Lustre-devel] Failover & Force export for the DMU
Hi Peter, Please see my comments. On Qua, 2008-04-16 at 17:18 -0700, Peter Braam wrote:> I think that is fine ? again, the key issue is not to kill the server > while it gets these errors. It may well be that the server needs a > special ?I?m recovering be gentle with errors? mode to avoid > reasonable panics.I would say any error returned by the filesystem even in normal operation should be handled gently :)> Please explain why we want to export such a pool and on which node we > want to export it, in fact what is ?export? (it should be similar to > unmount)? If things are failing, then, on the node that is failing, > we don?t need this pool anymore, we need to shut things down, in most > cases for a reboot. We need the pool on the failover node.The DMU has the notion of importing and exporting a pool, which is different from mounting/unmounting a filesystem inside the pool. Basically, an import consists in scanning and reading the labels of all the devices of a pool to find out the pool configuration. After this process, the pool transitions to the imported state, which means that the DMU knows about the pool (has the pool configuration cached) and the user can perform any operation he desires on the pool. Usually after an import ZFS also mounts the filesystems inside the pool automatically, but this is not relevant here. In ZFS, an export consists of unmounting any filesystem belonging to the pool, flushing dirty data, marking the pool as exported on-disk and then removing the pool configuration from the cache. In Lustre/ZFS, strictly speaking there are no filesystems mounted so we don''t do that, but of course the export would fail if Lustre has an open objset, so we need to close them first. After this, the user can only operate/manipulate the pool if he re-imports it. So basically, what we need to do when things are failing (in the node that is failing) is to close the filesystems and export the pool. The big problem is that the DMU cannot export a pool if the devices are experiencing fatal write failures, which is why we need a force-export mechanism. After that, we need to import the pool on the failover node and mount all the MDTs/OSTs that were stored there, do recovery, etc (I''m sure you understand this process much better than I do :)> In fact there is a very useful distinction to make. There are two > failover scenarios: > 1. fail over to move services away from failures on the OSS. In > this case a reboot/panic is not really harmful.That''s why when I heard about the need for this feature, I immediately proposed doing a panic, which wouldn''t have any consequences assuming Lustre recovery does its job. But it''s not useful in a "multiple pools in the same server" scenario.> 1. fail over from a fully functioning OSS/DMU to redistribute > services. In this case we need a control mechanism to turn > the device read-only and clean up the DMU.Why do we need to turn the device read-only in this case? Why can''t we do a clean unmount/export if the devices are fully functioning? Andreas has told me before that with ldiskfs, doing a clean unmount could take a lot of time if there''s a lot of dirty data, but I don''t believe this will be true with the DMU. Even if such a problem were to arise, in the DMU it''s trivial to limit the transaction group size and therefore limit the time it takes to sync a txg.> Unfortunately we cannot consider mandating that there is only one file > system per OSS because then we need an idle node to act as the > failover node. We must handle the problem of shutting ?one of more? > down, but only in the clean case (2).In the clean case, we don''t need force-export. Force-export is only really needed if all of the following conditions are true: 1) We have more than 1 filesystem (MDT/OST) running in the same userspace process (note how I didn''t say "same server". Also note that for Lustre 2.0, we will have a limitation of 1 userspace process per server). 2) The MDTs/OSTs are stored in more than 1 ZFS pool (note how I didn''t say "more than 1 device". A single ZFS pool can use multiple disk devices.). 3) One or more, but not all of the ZFS pools are suffering from fatal IO failures. 4) We only want to failover the MDTs/OSTs stored on the pools that are suffering IO failures, but we still want to keep the remaining MDTs/OSTs working in the same server. If there is a requirement of supporting a scenario where all of these conditions are true, then we need force-export. From my latest discussion with Andreas about this, we do need that. If not all of the conditions are true, we could either do a clean export or do a panic, depending on the situation. At least, that is my understanding :) Thanks, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080417/5fa11443/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080417/5fa11443/attachment-0004.gif
On 4/17/08 9:10 AM, "Ricardo M. Correia" <Ricardo.M.Correia at Sun.COM> wrote:> >> In fact there is a very useful distinction to make. There are two failover >> scenarios: >> 1. fail over to move services away from failures on the OSS. In this case a >> reboot/panic is not really harmful. > > That''s why when I heard about the need for this feature, I immediately > proposed doing a panic, which wouldn''t have any consequences assuming Lustre > recovery does its job. But it''s not useful in a "multiple pools in the same > server" scenario. >I don?t think this is valid reasoning. If one pool is hosed, it is just as well to reboot the node. At best what you are proposing is a ?nice to have refinement? but not necessary for proper management of Lustre clusters. Following my proposal seems to eliminate the requirement for very complicated work.> >> >> 1. fail over from a fully functioning OSS/DMU to redistribute services. In >> this case we need a control mechanism to turn the device read-only and clean >> up the DMU. > > Why do we need to turn the device read-only in this case? Why can''t we do a > clean unmount/export if the devices are fully functioning? > Andreas has told me before that with ldiskfs, doing a clean unmount could take > a lot of time if there''s a lot of dirty data, but I don''t believe this will be > true with the DMU. > Even if such a problem were to arise, in the DMU it''s trivial to limit the > transaction group size and therefore limit the time it takes to sync a txg. > >> Unfortunately we cannot consider mandating that there is only one file >> system per OSS because then we need an idle node to act as the failover node. >> We must handle the problem of shutting ?one of more? down, but only in the >> clean case (2). > > In the clean case, we don''t need force-export. > > Force-export is only really needed if all of the following conditions are > true: > > 1) We have more than 1 filesystem (MDT/OST) running in the same userspace > process (note how I didn''t say "same server". Also note that for Lustre 2.0, > we will have a limitation of 1 userspace process per server). > > 2) The MDTs/OSTs are stored in more than 1 ZFS pool (note how I didn''t say > "more than 1 device". A single ZFS pool can use multiple disk devices.). > > 3) One or more, but not all of the ZFS pools are suffering from fatal IO > failures. > > 4) We only want to failover the MDTs/OSTs stored on the pools that are > suffering IO failures, but we still want to keep the remaining MDTs/OSTs > working in the same server. >Yes. But this is not a requirement, because for example 4) is not necessary for customer happiness.> > If there is a requirement of supporting a scenario where all of these > conditions are true, then we need force-export. From my latest discussion with > Andreas about this, we do need that. >No we do not. Andreas, please get in touch with me. I think this is a ?nice to have? but not important enough. -Peter -> > If not all of the conditions are true, we could either do a clean export or do > a panic, depending on the situation. > > At least, that is my understanding :) > > Thanks, > Ricardo > > -- > Ricardo Manuel Correia > Lustre Engineering > > Sun Microsystems, Inc. > Portugal > Phone +351.214134023 / x58723 > Mobile +351.912590825 > Email Ricardo.M.Correia at Sun.COM > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080417/7660fef9/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080417/7660fef9/attachment-0004.gif
I forgot one other comment/question: shutdown of Lustre servers was traditionally sometimes very slow because of timeouts ? however with the Sandia ?kill the export features? is this still true? - peter - On 4/17/08 9:10 AM, "Ricardo M. Correia" <Ricardo.M.Correia at Sun.COM> wrote:> Hi Peter, > > Please see my comments. > > On Qua, 2008-04-16 at 17:18 -0700, Peter Braam wrote: >> I think that is fine ? again, the key issue is not to kill the server while >> it gets these errors. It may well be that the server needs a special ?I?m >> recovering be gentle with errors? mode to avoid reasonable panics. > > I would say any error returned by the filesystem even in normal operation > should be handled gently :) > >> Please explain why we want to export such a pool and on which node we want >> to export it, in fact what is ?export? (it should be similar to unmount)? If >> things are failing, then, on the node that is failing, we don?t need this >> pool anymore, we need to shut things down, in most cases for a reboot. We >> need the pool on the failover node. > > The DMU has the notion of importing and exporting a pool, which is different > from mounting/unmounting a filesystem inside the pool. > > Basically, an import consists in scanning and reading the labels of all the > devices of a pool to find out the pool configuration. > After this process, the pool transitions to the imported state, which means > that the DMU knows about the pool (has the pool configuration cached) and the > user can perform any operation he desires on the pool. > > Usually after an import ZFS also mounts the filesystems inside the pool > automatically, but this is not relevant here. > > In ZFS, an export consists of unmounting any filesystem belonging to the pool, > flushing dirty data, marking the pool as exported on-disk and then removing > the pool configuration from the cache. > In Lustre/ZFS, strictly speaking there are no filesystems mounted so we don''t > do that, but of course the export would fail if Lustre has an open objset, so > we need to close them first. > After this, the user can only operate/manipulate the pool if he re-imports it. > > So basically, what we need to do when things are failing (in the node that is > failing) is to close the filesystems and export the pool. The big problem is > that the DMU cannot export a pool if the devices are experiencing fatal write > failures, which is why we need a force-export mechanism. > > After that, we need to import the pool on the failover node and mount all the > MDTs/OSTs that were stored there, do recovery, etc (I''m sure you understand > this process much better than I do :) > > >> In fact there is a very useful distinction to make. There are two failover >> scenarios: >> 1. fail over to move services away from failures on the OSS. In this case a >> reboot/panic is not really harmful. > > That''s why when I heard about the need for this feature, I immediately > proposed doing a panic, which wouldn''t have any consequences assuming Lustre > recovery does its job. But it''s not useful in a "multiple pools in the same > server" scenario. > >> >> 1. fail over from a fully functioning OSS/DMU to redistribute services. In >> this case we need a control mechanism to turn the device read-only and clean >> up the DMU. > > Why do we need to turn the device read-only in this case? Why can''t we do a > clean unmount/export if the devices are fully functioning? > Andreas has told me before that with ldiskfs, doing a clean unmount could take > a lot of time if there''s a lot of dirty data, but I don''t believe this will be > true with the DMU. > Even if such a problem were to arise, in the DMU it''s trivial to limit the > transaction group size and therefore limit the time it takes to sync a txg. > >> Unfortunately we cannot consider mandating that there is only one file >> system per OSS because then we need an idle node to act as the failover node. >> We must handle the problem of shutting ?one of more? down, but only in the >> clean case (2). > > In the clean case, we don''t need force-export. > > Force-export is only really needed if all of the following conditions are > true: > > 1) We have more than 1 filesystem (MDT/OST) running in the same userspace > process (note how I didn''t say "same server". Also note that for Lustre 2.0, > we will have a limitation of 1 userspace process per server). > > 2) The MDTs/OSTs are stored in more than 1 ZFS pool (note how I didn''t say > "more than 1 device". A single ZFS pool can use multiple disk devices.). > > 3) One or more, but not all of the ZFS pools are suffering from fatal IO > failures. > > 4) We only want to failover the MDTs/OSTs stored on the pools that are > suffering IO failures, but we still want to keep the remaining MDTs/OSTs > working in the same server. > > If there is a requirement of supporting a scenario where all of these > conditions are true, then we need force-export. From my latest discussion with > Andreas about this, we do need that. > If not all of the conditions are true, we could either do a clean export or do > a panic, depending on the situation. > > At least, that is my understanding :) > > Thanks, > Ricardo > > -- > Ricardo Manuel Correia > Lustre Engineering > > Sun Microsystems, Inc. > Portugal > Phone +351.214134023 / x58723 > Mobile +351.912590825 > Email Ricardo.M.Correia at Sun.COM > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080417/c936ff61/attachment-0004.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image.gif Type: image/gif Size: 1257 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080417/c936ff61/attachment-0004.gif