On Mon, 2008-07-21 at 00:41 +0530, Balaji Rao wrote:> Hi, > > There''s a problem in btrfs_readdir that tries to lock a root node with one > being held. This happens when NFS calls vfs_readdir function with a nfs > specific filldir function pointer. This filldir function, called with the > lock held calls btrfs_lookup, which tries to take the same lock. So, it keeps > waiting on lock_page indefinitely - a deadlock. This is not seen if the inode > is RAM in which case, lookup is not called.We''ve seen precisely this problem (nfs3 readdirplus calling ->lookup from within ->readdir and deadlocking) in a number of other file systems. I have an evil workaround hack for JFFS2 at http://lists.infradead.org/pipermail/linux-mtd/2007-March/017663.html> Why can''t we allow multiple readers to read a page ?In the general case we don''t allow that because if there''s a writer waiting, they could end up waiting for _ever_ if you keep letting a constant stream of new readers take the lock. The only reason the readdirplus code is calling ->lookup() is so that it can then generate a file handle. It doesn''t really _need_ to pull every inode into the icache as it does; it only needs the fh. Perhaps we could extend the file system''s readdir() method so it returns the filehandle directly, instead. That might be a cleaner and more efficient solution all round. Or we could add a ->lookup_fh() method which only generates the filehandle, perhaps. Although you''d still have the locking issues with the latter if it can _ever_ be called other than from readdir(). The way GFS1 (and also XFS iirc) handles it is to build up a complete list of responses to readdir() in a buffer, drop the lock, and then iterate over that buffer calling filldir(). I don''t much like that version either. -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, There''s a problem in btrfs_readdir that tries to lock a root node with one being held. This happens when NFS calls vfs_readdir function with a nfs specific filldir function pointer. This filldir function, called with the lock held calls btrfs_lookup, which tries to take the same lock. So, it keeps waiting on lock_page indefinitely - a deadlock. This is not seen if the inode is RAM in which case, lookup is not called. Why can''t we allow multiple readers to read a page ? Please clarify. Balaji Rao -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sunday 20 July 2008 10:13:50 pm David Woodhouse wrote:> On Mon, 2008-07-21 at 00:41 +0530, Balaji Rao wrote: > > Hi, > > > > There''s a problem in btrfs_readdir that tries to lock a root node with > > one being held. This happens when NFS calls vfs_readdir function with a > > nfs specific filldir function pointer. This filldir function, called with > > the lock held calls btrfs_lookup, which tries to take the same lock. So, > > it keeps waiting on lock_page indefinitely - a deadlock. This is not seen > > if the inode is RAM in which case, lookup is not called. > > We''ve seen precisely this problem (nfs3 readdirplus calling ->lookup > from within ->readdir and deadlocking) in a number of other file > systems. >oh, ok!> I have an evil workaround hack for JFFS2 at > http://lists.infradead.org/pipermail/linux-mtd/2007-March/017663.html >The idea of the patch seems correct to me, that once we "own" the lock, an attempt to take it again should be a nop. Balaji -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-07-21 at 01:06 +0530, Balaji Rao wrote:> The idea of the patch seems correct to me, that once we "own" the > lock, an attempt to take it again should be a nop.Well, kind of correct. There are potentially race conditions with the handling of f->readdir_process, but given that we only ever really compare it to ''current'', it''s actually going to turn out OK in practice. You won''t ever actually get a false negative (and deadlock) or a false positive (and go through lookup() without locking), as far as I can tell. But still, it''s ugly as sin and I''d much rather come up with a _proper_ fix in the VFS. I was looking at it at one point, and didn''t actually apply that patch to the JFFS2 tree. Since general btrfs work is now more of a priority for me than getting JFFS2 to be NFS-exportable was, I think I''ll pick up where I left off with that. In the meantime, we have an evil hack which at least ought to work for now. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jul 20, 2008 at 09:43:50AM -0700, David Woodhouse wrote:> The way GFS1 (and also XFS iirc) handles it is to build up a complete > list of responses to readdir() in a buffer, drop the lock, and then > iterate over that buffer calling filldir(). I don''t much like that > version either.Yes. My prefered mid-term solution would to simply lift that code from XFS (where it''s nicely isolated and all code is prefixed with hack_) to nfsd so that local users don''t aren''t penalized for this. OCFS2 also ran into this lately, so someone at Oracle might already be working on it. And ubifs will also run into it once it gets nfs serving support.. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-07-21 at 04:10 -0400, Christoph Hellwig wrote:> On Sun, Jul 20, 2008 at 09:43:50AM -0700, David Woodhouse wrote: > > The way GFS1 (and also XFS iirc) handles it is to build up a complete > > list of responses to readdir() in a buffer, drop the lock, and then > > iterate over that buffer calling filldir(). I don''t much like that > > version either. > > Yes. My prefered mid-term solution would to simply lift that code from > XFS (where it''s nicely isolated and all code is prefixed with hack_) to > nfsd so that local users don''t aren''t penalized for this. >Reiserfs also uses the local buffer and lock dropping trick, and that''s what I would do for now in btrfs. But, the extent allocation tree also needs recursive locking, so I''m going to take a stab at that as soon as I get the pending patch list merged. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html