Goldwyn Rodrigues
2011-Sep-09 22:22 UTC
[Ocfs2-devel] Can recovery be done in process context (as opposed to kthread)?
Hi, I finally got back to improve the recovery procedure by offloading work to work queues. However, I would like to know if we can completely do away with ocfs2rec kthread. The process would just mark the nodes which need recovery and offload the work on the work queues and wait until all is over. The reason for doing it this way is to make the mount process killable. Currently the dlm locks are taken by ocfs2rec kthread while the mount waits in uninterruptible sleep while the recovery happens. This would help the High Availability software which send signals to mount procedure if it does not complete within timeout. This usually happens when journal takes a long time to replay; especially for nodes waiting for recovery to complete and not doing the actual recovery. Consider one node down procedure in the middle of I/O on a mounted system as well. We could keep the kthread with co-ordination as well. -- Goldwyn
Sunil Mushran
2011-Sep-10 14:29 UTC
[Ocfs2-devel] Can recovery be done in process context (as opposed to kthread)?
On 09/09/2011 03:22 PM, Goldwyn Rodrigues wrote:> Hi, > > I finally got back to improve the recovery procedure by offloading > work to work queues. However, I would like to know if we can > completely do away with ocfs2rec kthread. The process would just mark > the nodes which need recovery and offload the work on the work queues > and wait until all is over. > > The reason for doing it this way is to make the mount process > killable. Currently the dlm locks are taken by ocfs2rec kthread while > the mount waits in uninterruptible sleep while the recovery happens. > > This would help the High Availability software which send signals to > mount procedure if it does not complete within timeout. This usually > happens when journal takes a long time to replay; especially for nodes > waiting for recovery to complete and not doing the actual recovery. > > Consider one node down procedure in the middle of I/O on a mounted > system as well. > > We could keep the kthread with co-ordination as well.I am not sure what that buys. The focus should be fixing what ever that got the reco stuck in the first place. For the most part, it gets stuck for reasons unrelated to ocfs2. Our focus has been on allowing users to quickly identify the "bad" node quickly.
Joel Becker
2011-Sep-11 08:45 UTC
[Ocfs2-devel] Can recovery be done in process context (as opposed to kthread)?
On Fri, Sep 09, 2011 at 05:22:49PM -0500, Goldwyn Rodrigues wrote:> Hi, > > I finally got back to improve the recovery procedure by offloading > work to work queues. However, I would like to know if we can > completely do away with ocfs2rec kthread. The process would just mark > the nodes which need recovery and offload the work on the work queues > and wait until all is over. > > The reason for doing it this way is to make the mount process > killable. Currently the dlm locks are taken by ocfs2rec kthread while > the mount waits in uninterruptible sleep while the recovery happens.If the mount dies, but then actually succeeds in the background...that's weird and violates the Principle of Least Surprise. Joel -- "Conservative, n. A statesman who is enamoured of existing evils, as distinguished from the Liberal, who wishes to replace them with others." - Ambrose Bierce, The Devil's Dictionary http://www.jlbec.org/ jlbec at evilplan.org