Jan Kara
2009-Jan-12 22:06 UTC
[Ocfs2-devel] Bug in inode deletion code leading to stale inodes
Hello, I've hit a bug in OCFS2 delete code which results in inodes being left on disk without any links to them. The workload triggering this creates directories on one node and deletes them on another node in the cluster. The inode is not deleted because both nodes bail out from ocfs2_delete_inode() with: Skipping delete of 100405 because it is in use on other nodes The scenario which I think is happening is as follows: node1 node2 rmdir("d"); ocfs2_remote_dentry_delete() ocfs2_dentry_convert_worker() finishes ocfs2_unlink() eventually enters ocfs2_delete_inode() ocfs2_inode_lock() ocfs2_query_inode_wipe() -> fail ocfs2_inode_unlock() ocfs2_dentry_post_unlock() ocfs2_drop_dentry_lock() iput() ocfs2_delete_inode() ocfs2_inode_lock() ocfs2_query_inode_wipe() -> fail ocfs2_inode_unlock() clear_inode() clear_inode() The question is how to avoid this. It seems to me that we have to really do open_lock() and not just trylock to avoid the race. Is there any reason why we cannot move the open_lock() before inode_lock() in ocfs2_delete_inode()? Honza -- Jan Kara <jack at suse.cz> SUSE Labs, CR
Mark Fasheh
2009-Jan-15 01:59 UTC
[Ocfs2-devel] Bug in inode deletion code leading to stale inodes
On Mon, Jan 12, 2009 at 11:06:35PM +0100, Jan Kara wrote:> I've hit a bug in OCFS2 delete code which results in inodes being left on > disk without any links to them. The workload triggering this creates > directories on one node and deletes them on another node in the cluster. > The inode is not deleted because both nodes bail out from > ocfs2_delete_inode() with: > Skipping delete of 100405 because it is in use on other nodes > > The scenario which I think is happening is as follows: > > node1 node2 > rmdir("d"); > ocfs2_remote_dentry_delete() > ocfs2_dentry_convert_worker() > finishes ocfs2_unlink() > eventually enters ocfs2_delete_inode() > ocfs2_inode_lock() > ocfs2_query_inode_wipe() -> fail > ocfs2_inode_unlock() > ocfs2_dentry_post_unlock() > ocfs2_drop_dentry_lock() > iput() > ocfs2_delete_inode() > ocfs2_inode_lock() > ocfs2_query_inode_wipe() -> fail > ocfs2_inode_unlock() > clear_inode() > clear_inode()FWIW, your analysis looks correct to me.> The question is how to avoid this. It seems to me that we have to really > do open_lock() and not just trylock to avoid the race. Is there any reason > why we cannot move the open_lock() before inode_lock() in > ocfs2_delete_inode()?We can't do that because open_lock() will deadlock with nodes still using the inode. The way the open lock works is that each node takes a RO on it when the inode is instantiated. That PR is dropped in ocfs2_clear_inode(). If a node wants to delete an inode, it will trylock at EX to see if any other nodes have an active lock. I think we have to look at how the downconvert thread deals with these locks. Either we push some parts to a work queue, or we figure out how to reorder some of the operations. The problem with pushing off to a work queue, is we run the risk of slowing down unlink related ops. Pushing the dentry downconvert away will result in high unlink() latency, while pushing the final iput() away runs the risk of bouncing the orphan dir lock (if the iput() comes after the deleting nodes query in ->delete_inode()) Do you think there's a way we could just reverse the order - iput the inode, BEFORE downconverting the dentry lock? --Mark -- Mark Fasheh