Hi, I''ve written some notes on simplified interoperation which you can find at... http://arch.lustre.org/index.php?title=Simplified_Interoperation Cheers, Eric
Hi, Thanks for summarizing this, comments inline.> Description > At the start of a controlled shutdown, the server notifies all its > connected clients that it is shutting down and refuses connection > requests from new (but not currently connected) clients. >Why refuse connections to new clients? Now that we are adding a quiescent mode to the client, we can use that instead of failing new mounts. (We could do the same thing when we receive new connection during recovery, too, for that matter.) I''d just hate to add another source of mount failures when it seems we can avoid it.> The clients prepare for shutdown by ensuring at a minimum that no > further requests are sent to the server and they have cleaned and > evicted all cached server state. >Clients need to notify the server when they are finished flushing state.> The server notifies all clients when all outstanding requests have > been committed. >There is already a mechanism in place for the server to notify the clients of the last committed, so we don''t need to add anything for this. I''m not convinced the server needs to do anything here except failover, but we could withhold the reply to the clients'' "i''m done" request mentioned above until the server is ready to shutdown. That reply would have the current last committed and as a side-effect would cause the clients to flush their replay queues. The same thing will happen when the clients reconnect, though, so I''m not sure it''s worth adding another special reply.> The clients may then disconnect and the server can halt when all > clients have disconnected. > > When the server restarts, clients reconnect, replay open files and > proceed. >If the clients disconnect right away, then they will have no way of knowing when they need to reconnect. They need to remain connected and continue pinging so they will detect when the server has failed and recover normally. One last thing - the clients need to know when it is safe to being sending new requests again. Do we do this automatically after recovery? Or is this an explicit operation done by the admin? Also, the admin might decide the cancel the upgrade before failing the server, so we''ll need a way to resume normal operations without going through recovery. robert On Oct 9, 2008, at 14:54 , Eric Barton wrote:> Hi, > > I''ve written some notes on simplified interoperation which > you can find at... > > http://arch.lustre.org/index.php?title=Simplified_Interoperation > > > Cheers, > Eric > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
> Thanks for summarizing this, comments inline. > > > Description > > At the start of a controlled shutdown, the server notifies all its > > connected clients that it is shutting down and refuses connection > > requests from new (but not currently connected) clients. > > > > Why refuse connections to new clients? Now that we are adding a > quiescent mode to the client, we can use that instead of failing new > mounts. (We could do the same thing when we receive new connection > during recovery, too, for that matter.) I''d just hate to add another > source of mount failures when it seems we can avoid it.Just to prevent new clients connecting to it as if it wasn''t there at all - it''s about to go in any case and a new server is about to start up in its place, which these clients should shortly succeed in connecting to.> > The clients prepare for shutdown by ensuring at a minimum that no > > further requests are sent to the server and they have cleaned and > > evicted all cached server state. > > > > Clients need to notify the server when they are finished flushing state.Yes indeed - e.g. releasing the lock used for the shutdown notification BAST.> > The server notifies all clients when all outstanding requests have > > been committed. > > > There is already a mechanism in place for the server to notify the > clients of the last committed, so we don''t need to add anything for > this. I''m not convinced the server needs to do anything here except > failover, but we could withhold the reply to the clients'' "i''m done" > request mentioned above until the server is ready to shutdown. That > reply would have the current last committed and as a side-effect would > cause the clients to flush their replay queues. The same thing will > happen when the clients reconnect, though, so I''m not sure it''s worth > adding another special reply.I really want the replay queue to be empty when the client disconnects.> > The clients may then disconnect and the server can halt when all > > clients have disconnected. > > > > When the server restarts, clients reconnect, replay open files and > > proceed. > > If the clients disconnect right away, then they will have no way of > knowing when they need to reconnect. They need to remain connected and > continue pinging so they will detect when the server has failed and > recover normally.What''s the difference between remaining connected and pinging, and disconnecting and attempting reconnection?> One last thing - the clients need to know when it is safe to being > sending new requests again. Do we do this automatically after > recovery?Yes.> Or is this an explicit operation done by the admin?No.> Also, > the admin might decide the cancel the upgrade before failing the > server, so we''ll need a way to resume normal operations without going > through recovery.We''re not actually failing the server - we''re just doing an orderly shutdown that guarantees to minimize client state and simplify recovery on reconnection. Nothing bad happens if the server reboots with the same version - the client just does the same minimal recovery it would do with a version-upped server. Cheers, Eric