thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Jan Kara

2012-Apr-16 16:13 UTC

[Ocfs2-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks

Hello,

  here is the fifth iteration of my patches to improve filesystem freezing.
No serious changes since last time. Mostly I rebased patches and merged this
series with series moving file_update_time() to ->page_mkwrite() to simplify
testing and merging.

Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 13 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved
filesystem
file_update_time() call to ->page_mkwrite callback (patches 01-07) and put
freeze handling in mnt_want_write() / mnt_drop_write(). That however required
some code shuffling and changes to kern_path_create() (see patches 09-12). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

Changes since v4:
  * added a couple of Acked-by's
  * added some comments & doc update
  * added patches from series "Push file_update_time() into
.page_mkwrite"
    since it doesn't make much sense to keep them separate anymore
  * rebased on top of 3.4-rc2

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
    filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

								Honza

CC: Alex Elder <elder at kernel.org>
CC: Anton Altaparmakov <anton at tuxera.com>
CC: Ben Myers <bpm at sgi.com>
CC: Chris Mason <chris.mason at oracle.com>
CC: cluster-devel at redhat.com
CC: "David S. Miller" <davem at davemloft.net>
CC: fuse-devel at lists.sourceforge.net
CC: "J. Bruce Fields" <bfields at fieldses.org>
CC: Joel Becker <jlbec at evilplan.org>
CC: KONISHI Ryusuke <konishi.ryusuke at lab.ntt.co.jp>
CC: linux-btrfs at vger.kernel.org
CC: linux-ext4 at vger.kernel.org
CC: linux-nfs at vger.kernel.org
CC: linux-nilfs at vger.kernel.org
CC: linux-ntfs-dev at lists.sourceforge.net
CC: Mark Fasheh <mfasheh at suse.com>
CC: Miklos Szeredi <miklos at szeredi.hu>
CC: ocfs2-devel at oss.oracle.com
CC: OGAWA Hirofumi <hirofumi at mail.parknet.co.jp>
CC: Steven Whitehouse <swhiteho at redhat.com>
CC: "Theodore Ts'o" <tytso at mit.edu>
CC: xfs at oss.sgi.com

Jan Kara

2012-Apr-16 16:13 UTC

head link

[Ocfs2-devel] [PATCH 09/27] fs: Push mnt_want_write() outside of i_mutex

Currently, mnt_want_write() is sometimes called with i_mutex held and sometimes
without it. This isn't really a problem because mnt_want_write() is a
non-blocking operation (essentially has a trylock semantics) but when the
function starts to handle also frozen filesystems, it will get a full lock
semantics and thus proper lock ordering has to be established. So move
all mnt_want_write() calls outside of i_mutex.

One non-trivial case needing conversion is kern_path_create() /
user_path_create() which didn't include mnt_want_write() but now needs to
because it acquires i_mutex.  Because there are virtual file systems which
don't bother with freeze / remount-ro protection we actually provide both
versions of the function - one which calls mnt_want_write() and one which does
not.

CC: ocfs2-devel at oss.oracle.com
CC: Mark Fasheh <mfasheh at suse.com>
CC: Joel Becker <jlbec at evilplan.org>
CC: "David S. Miller" <davem at davemloft.net>
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal at canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis at canonical.com>
Tested-by: Dann Frazier <dann.frazier at canonical.com>
Tested-by: Massimo Morana <massimo.morana at canonical.com>
Signed-off-by: Jan Kara <jack at suse.cz>
---
 fs/namei.c              |  115 +++++++++++++++++++++++++++--------------------
 fs/ocfs2/refcounttree.c |   10 +---
 include/linux/namei.h   |    2 +
 net/unix/af_unix.c      |   13 ++----
 4 files changed, 74 insertions(+), 66 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 0062dd1..5417fa1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2460,7 +2460,9 @@ struct file *do_file_open_root(struct dentry *dentry,
struct vfsmount *mnt,
 	return file;
 }
 
-struct dentry *kern_path_create(int dfd, const char *pathname, struct path
*path, int is_dir)
+static struct dentry *do_kern_path_create(int dfd, const char *pathname,
+					  struct path *path, int is_dir,
+					  int freeze_protect)
 {
 	struct dentry *dentry = ERR_PTR(-EEXIST);
 	struct nameidata nd;
@@ -2478,6 +2480,14 @@ struct dentry *kern_path_create(int dfd, const char
*pathname, struct path *path
 	nd.flags |= LOOKUP_CREATE | LOOKUP_EXCL;
 	nd.intent.open.flags = O_EXCL;
 
+	if (freeze_protect) {
+		error = mnt_want_write(nd.path.mnt);
+		if (error) {
+			dentry = ERR_PTR(error);
+			goto out;
+		}
+	}
+
 	/*
 	 * Do the final lookup.
 	 */
@@ -2506,24 +2516,49 @@ eexist:
 	dentry = ERR_PTR(-EEXIST);
 fail:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+	if (freeze_protect)
+		mnt_drop_write(nd.path.mnt);
 out:
 	path_put(&nd.path);
 	return dentry;
 }
+
+struct dentry *kern_path_create(int dfd, const char *pathname, struct path
*path, int is_dir)
+{
+	return do_kern_path_create(dfd, pathname, path, is_dir, 0);
+}
 EXPORT_SYMBOL(kern_path_create);
 
+struct dentry *kern_path_create_thawed(int dfd, const char *pathname, struct
path *path, int is_dir)
+{
+	return do_kern_path_create(dfd, pathname, path, is_dir, 1);
+}
+EXPORT_SYMBOL(kern_path_create_thawed);
+
 struct dentry *user_path_create(int dfd, const char __user *pathname, struct
path *path, int is_dir)
 {
 	char *tmp = getname(pathname);
 	struct dentry *res;
 	if (IS_ERR(tmp))
 		return ERR_CAST(tmp);
-	res = kern_path_create(dfd, tmp, path, is_dir);
+	res = do_kern_path_create(dfd, tmp, path, is_dir, 0);
 	putname(tmp);
 	return res;
 }
 EXPORT_SYMBOL(user_path_create);
 
+struct dentry *user_path_create_thawed(int dfd, const char __user *pathname,
struct path *path, int is_dir)
+{
+	char *tmp = getname(pathname);
+	struct dentry *res;
+	if (IS_ERR(tmp))
+		return ERR_CAST(tmp);
+	res = do_kern_path_create(dfd, tmp, path, is_dir, 1);
+	putname(tmp);
+	return res;
+}
+EXPORT_SYMBOL(user_path_create_thawed);
+
 int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t
dev)
 {
 	int error = may_create(dir, dentry);
@@ -2579,7 +2614,7 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *,
filename, umode_t, mode,
 	if (S_ISDIR(mode))
 		return -EPERM;
 
-	dentry = user_path_create(dfd, filename, &path, 0);
+	dentry = user_path_create_thawed(dfd, filename, &path, 0);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
@@ -2588,12 +2623,9 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *,
filename, umode_t, mode,
 	error = may_mknod(mode);
 	if (error)
 		goto out_dput;
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_mknod(&path, dentry, mode, dev);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(path.dentry->d_inode,dentry,mode,NULL);
@@ -2606,11 +2638,10 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *,
filename, umode_t, mode,
 			error = vfs_mknod(path.dentry->d_inode,dentry,mode,0);
 			break;
 	}
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 
 	return error;
@@ -2652,24 +2683,20 @@ SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *,
pathname, umode_t, mode)
 	struct path path;
 	int error;
 
-	dentry = user_path_create(dfd, pathname, &path, 1);
+	dentry = user_path_create_thawed(dfd, pathname, &path, 1);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_mkdir(&path, dentry, mode);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 	return error;
 }
@@ -2764,6 +2791,9 @@ static long do_rmdir(int dfd, const char __user *pathname)
 	}
 
 	nd.flags &= ~LOOKUP_PARENT;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto exit1;
 
 	mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex,
I_MUTEX_PARENT);
 	dentry = lookup_hash(&nd);
@@ -2774,19 +2804,15 @@ static long do_rmdir(int dfd, const char __user
*pathname)
 		error = -ENOENT;
 		goto exit3;
 	}
-	error = mnt_want_write(nd.path.mnt);
-	if (error)
-		goto exit3;
 	error = security_path_rmdir(&nd.path, dentry);
 	if (error)
-		goto exit4;
+		goto exit3;
 	error = vfs_rmdir(nd.path.dentry->d_inode, dentry);
-exit4:
-	mnt_drop_write(nd.path.mnt);
 exit3:
 	dput(dentry);
 exit2:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+	mnt_drop_write(nd.path.mnt);
 exit1:
 	path_put(&nd.path);
 	putname(name);
@@ -2853,6 +2879,9 @@ static long do_unlinkat(int dfd, const char __user
*pathname)
 		goto exit1;
 
 	nd.flags &= ~LOOKUP_PARENT;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto exit1;
 
 	mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex,
I_MUTEX_PARENT);
 	dentry = lookup_hash(&nd);
@@ -2865,21 +2894,17 @@ static long do_unlinkat(int dfd, const char __user
*pathname)
 		if (!inode)
 			goto slashes;
 		ihold(inode);
-		error = mnt_want_write(nd.path.mnt);
-		if (error)
-			goto exit2;
 		error = security_path_unlink(&nd.path, dentry);
 		if (error)
-			goto exit3;
+			goto exit2;
 		error = vfs_unlink(nd.path.dentry->d_inode, dentry);
-exit3:
-		mnt_drop_write(nd.path.mnt);
-	exit2:
+exit2:
 		dput(dentry);
 	}
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
 	if (inode)
 		iput(inode);	/* truncate the inode here */
+	mnt_drop_write(nd.path.mnt);
 exit1:
 	path_put(&nd.path);
 	putname(name);
@@ -2939,23 +2964,19 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
 	if (IS_ERR(from))
 		return PTR_ERR(from);
 
-	dentry = user_path_create(newdfd, newname, &path, 0);
+	dentry = user_path_create_thawed(newdfd, newname, &path, 0);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out_putname;
 
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_symlink(&path, dentry, from);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_symlink(path.dentry->d_inode, dentry, from);
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 out_putname:
 	putname(from);
@@ -3048,7 +3069,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *,
oldname,
 	if (error)
 		return error;
 
-	new_dentry = user_path_create(newdfd, newname, &new_path, 0);
+	new_dentry = user_path_create_thawed(newdfd, newname, &new_path, 0);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto out;
@@ -3056,18 +3077,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user
*, oldname,
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto out_dput;
-	error = mnt_want_write(new_path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_link(old_path.dentry, &new_path, new_dentry);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry);
-out_drop_write:
-	mnt_drop_write(new_path.mnt);
 out_dput:
 	dput(new_dentry);
 	mutex_unlock(&new_path.dentry->d_inode->i_mutex);
+	mnt_drop_write(new_path.mnt);
 	path_put(&new_path);
 out:
 	path_put(&old_path);
@@ -3264,6 +3281,10 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user
*, oldname,
 	if (newnd.last_type != LAST_NORM)
 		goto exit2;
 
+	error = mnt_want_write(oldnd.path.mnt);
+	if (error)
+		goto exit2;
+
 	oldnd.flags &= ~LOOKUP_PARENT;
 	newnd.flags &= ~LOOKUP_PARENT;
 	newnd.flags |= LOOKUP_RENAME_TARGET;
@@ -3299,23 +3320,19 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user
*, oldname,
 	if (new_dentry == trap)
 		goto exit5;
 
-	error = mnt_want_write(oldnd.path.mnt);
-	if (error)
-		goto exit5;
 	error = security_path_rename(&oldnd.path, old_dentry,
 				     &newnd.path, new_dentry);
 	if (error)
-		goto exit6;
+		goto exit5;
 	error = vfs_rename(old_dir->d_inode, old_dentry,
 				   new_dir->d_inode, new_dentry);
-exit6:
-	mnt_drop_write(oldnd.path.mnt);
 exit5:
 	dput(new_dentry);
 exit4:
 	dput(old_dentry);
 exit3:
 	unlock_rename(new_dir, old_dir);
+	mnt_drop_write(oldnd.path.mnt);
 exit2:
 	path_put(&newnd.path);
 	putname(to);
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index cf78233..a99b8e2 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4453,7 +4453,7 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 		return error;
 	}
 
-	new_dentry = user_path_create(AT_FDCWD, newname, &new_path, 0);
+	new_dentry = user_path_create_thawed(AT_FDCWD, newname, &new_path, 0);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry)) {
 		mlog_errno(error);
@@ -4466,19 +4466,13 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 		goto out_dput;
 	}
 
-	error = mnt_want_write(new_path.mnt);
-	if (error) {
-		mlog_errno(error);
-		goto out_dput;
-	}
-
 	error = ocfs2_vfs_reflink(old_path.dentry,
 				  new_path.dentry->d_inode,
 				  new_dentry, preserve);
-	mnt_drop_write(new_path.mnt);
 out_dput:
 	dput(new_dentry);
 	mutex_unlock(&new_path.dentry->d_inode->i_mutex);
+	mnt_drop_write(new_path.mnt);
 	path_put(&new_path);
 out:
 	path_put(&old_path);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index ffc0213..432f6bb 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -77,7 +77,9 @@ extern int user_path_at_empty(int, const char __user *,
unsigned, struct path *,
 extern int kern_path(const char *, unsigned, struct path *);
 
 extern struct dentry *kern_path_create(int, const char *, struct path *, int);
+extern struct dentry *kern_path_create_thawed(int, const char *, struct path *,
int);
 extern struct dentry *user_path_create(int, const char __user *, struct path *,
int);
+extern struct dentry *user_path_create_thawed(int, const char __user *, struct
path *, int);
 extern int kern_path_parent(const char *, struct nameidata *);
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
 			   const char *, unsigned int, struct path *);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index d510353..c532632 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -865,7 +865,7 @@ static int unix_bind(struct socket *sock, struct sockaddr
*uaddr, int addr_len)
 		 * Get the parent directory, calculate the hash for last
 		 * component.
 		 */
-		dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
+		dentry = kern_path_create_thawed(AT_FDCWD, sun_path, &path, 0);
 		err = PTR_ERR(dentry);
 		if (IS_ERR(dentry))
 			goto out_mknod_parent;
@@ -875,19 +875,13 @@ static int unix_bind(struct socket *sock, struct sockaddr
*uaddr, int addr_len)
 		 */
 		mode = S_IFSOCK |
 		       (SOCK_INODE(sock)->i_mode & ~current_umask());
-		err = mnt_want_write(path.mnt);
-		if (err)
-			goto out_mknod_dput;
 		err = security_path_mknod(&path, dentry, mode, 0);
 		if (err)
-			goto out_mknod_drop_write;
-		err = vfs_mknod(path.dentry->d_inode, dentry, mode, 0);
-out_mknod_drop_write:
-		mnt_drop_write(path.mnt);
-		if (err)
 			goto out_mknod_dput;
+		err = vfs_mknod(path.dentry->d_inode, dentry, mode, 0);
 		mutex_unlock(&path.dentry->d_inode->i_mutex);
 		dput(path.dentry);
+		mnt_drop_write(path.mnt);
 		path.dentry = dentry;
 
 		addr->hash = UNIX_HASH_SIZE;
@@ -924,6 +918,7 @@ out:
 out_mknod_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 out_mknod_parent:
 	if (err == -EEXIST)
-- 
1.7.1

Jan Kara

2012-Apr-16 16:13 UTC

head link

[PATCH 11/27] btrfs: Push mnt_want_write() outside of i_mutex

When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/btrfs/ioctl.c |   23 +++++++++++------------
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 18cc23d..869d913 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void
__user *arg)
 	if (!inode_owner_or_capable(inode))
 		return -EACCES;
 
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
 	mutex_lock(&inode->i_mutex);
 
 	ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void
__user *arg)
 		}
 	}
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		goto out_unlock;
-
 	if (flags & FS_SYNC_FL)
 		ip->flags |= BTRFS_INODE_SYNC;
 	else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void
__user *arg)
 		inode->i_flags = i_oldflags;
 	}
 
-	mnt_drop_write_file(file);
  out_unlock:
 	mutex_unlock(&inode->i_mutex);
+	mnt_drop_write_file(file);
 	return ret;
 }
 
@@ -639,6 +639,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
 	struct dentry *dentry;
 	int error;
 
+	error = mnt_want_write(parent->mnt);
+	if (error)
+		return error;
+
 	mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
 	dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -650,13 +654,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
 	if (dentry->d_inode)
 		goto out_dput;
 
-	error = mnt_want_write(parent->mnt);
-	if (error)
-		goto out_dput;
-
 	error = btrfs_may_create(dir, dentry);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 
 	down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
 
@@ -674,12 +674,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
 		fsnotify_mkdir(dir, dentry);
 out_up_read:
 	up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
-	mnt_drop_write(parent->mnt);
 out_dput:
 	dput(dentry);
 out_unlock:
 	mutex_unlock(&dir->i_mutex);
+	mnt_drop_write(parent->mnt);
 	return error;
 }
 
-- 
1.7.1

Jan Kara

2012-Apr-16 16:13 UTC

head link

[Ocfs2-devel] [PATCH 19/27] ocfs2: Convert to new freezing mechanism

Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new
freeze protection. We also protect several ioctl entry points which
were missing the protection.

CC: Mark Fasheh <mfasheh at suse.com>
CC: Joel Becker <jlbec at evilplan.org>
CC: ocfs2-devel at oss.oracle.com
Signed-off-by: Jan Kara <jack at suse.cz>
---
 fs/ocfs2/file.c  |   11 +++++++++--
 fs/ocfs2/ioctl.c |   14 ++++++++++++--
 fs/ocfs2/mmap.c  |    2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 061591a..9b1e3d4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1971,6 +1971,7 @@ int ocfs2_change_file_space(struct file *file, unsigned
int cmd,
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	int ret;
 
 	if ((cmd == OCFS2_IOC_RESVSP || cmd == OCFS2_IOC_RESVSP64) &&
 	    !ocfs2_writes_unwritten_extents(osb))
@@ -1985,7 +1986,12 @@ int ocfs2_change_file_space(struct file *file, unsigned
int cmd,
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 
-	return __ocfs2_change_file_space(file, inode, file->f_pos, cmd, sr, 0);
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+	ret = __ocfs2_change_file_space(file, inode, file->f_pos, cmd, sr, 0);
+	mnt_drop_write_file(file);
+	return ret;
 }
 
 static long ocfs2_fallocate(struct file *file, int mode, loff_t offset,
@@ -2261,7 +2267,7 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
 	if (iocb->ki_left == 0)
 		return 0;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
 	appending = file->f_flags & O_APPEND ? 1 : 0;
 	direct_io = file->f_flags & O_DIRECT ? 1 : 0;
@@ -2434,6 +2440,7 @@ out_sems:
 		ocfs2_iocb_clear_sem_locked(iocb);
 
 	mutex_unlock(&inode->i_mutex);
+	sb_end_write(inode->i_sb);
 
 	if (written)
 		ret = written;
diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index a1a1bfd..d9c352b 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -926,7 +926,12 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
 		if (get_user(new_clusters, (int __user *)arg))
 			return -EFAULT;
 
-		return ocfs2_group_extend(inode, new_clusters);
+		status = mnt_want_write_file(filp);
+		if (status)
+			return status;
+		status = ocfs2_group_extend(inode, new_clusters);
+		mnt_drop_write_file(filp);
+		return status;
 	case OCFS2_IOC_GROUP_ADD:
 	case OCFS2_IOC_GROUP_ADD64:
 		if (!capable(CAP_SYS_RESOURCE))
@@ -935,7 +940,12 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
 		if (copy_from_user(&input, (int __user *) arg, sizeof(input)))
 			return -EFAULT;
 
-		return ocfs2_group_add(inode, &input);
+		status = mnt_want_write_file(filp);
+		if (status)
+			return status;
+		status = ocfs2_group_add(inode, &input);
+		mnt_drop_write_file(filp);
+		return status;
 	case OCFS2_IOC_REFLINK:
 		if (copy_from_user(&args, (struct reflink_arguments *)arg,
 				   sizeof(args)))
diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
index 9cd4108..d150372 100644
--- a/fs/ocfs2/mmap.c
+++ b/fs/ocfs2/mmap.c
@@ -136,6 +136,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma,
struct vm_fault *vmf)
 	sigset_t oldset;
 	int ret;
 
+	sb_start_pagefault(inode->i_sb);
 	ocfs2_block_signals(&oldset);
 
 	/*
@@ -165,6 +166,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma,
struct vm_fault *vmf)
 
 out:
 	ocfs2_unblock_signals(&oldset);
+	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
 
-- 
1.7.1

Jan Kara

2012-Apr-16 16:14 UTC

head link

[PATCH 24/27] btrfs: Convert to new freezing mechanism

We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn''t have
anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/btrfs/disk-io.c |    3 ---
 fs/btrfs/file.c    |    3 ++-
 fs/btrfs/inode.c   |    6 +++++-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 20196f4..555a57a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1527,8 +1527,6 @@ static int cleaner_kthread(void *arg)
 	struct btrfs_root *root = arg;
 
 	do {
-		vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
-
 		if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
 		    mutex_trylock(&root->fs_info->cleaner_mutex)) {
 			btrfs_run_delayed_iputs(root);
@@ -1560,7 +1558,6 @@ static int transaction_kthread(void *arg)
 	do {
 		cannot_commit = false;
 		delay = HZ * 30;
-		vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
 		mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
 		spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d83260d..a48251e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1358,7 +1358,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 	ssize_t err = 0;
 	size_t count, ocount;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
 	mutex_lock(&inode->i_mutex);
 
@@ -1449,6 +1449,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 			num_written = err;
 	}
 out:
+	sb_end_write(inode->i_sb);
 	current->backing_dev_info = NULL;
 	return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 115bc05..db4fc01 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6592,6 +6592,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct
vm_fault *vmf)
 	u64 page_start;
 	u64 page_end;
 
+	sb_start_pagefault(inode->i_sb);
 	ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
 	if (!ret) {
 		ret = btrfs_update_time(vma->vm_file);
@@ -6681,12 +6682,15 @@ again:
 	unlock_extent_cached(io_tree, page_start, page_end, &cached_state,
GFP_NOFS);
 
 out_unlock:
-	if (!ret)
+	if (!ret) {
+		sb_end_pagefault(inode->i_sb);
 		return VM_FAULT_LOCKED;
+	}
 	unlock_page(page);
 out:
 	btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
 
-- 
1.7.1

Jan Kara

2012-Apr-16 16:16 UTC

head link

[Ocfs2-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks

The subject should have been [PATCH 00/27]... Sorry for the mistake.

								Honza

On Mon 16-04-12 18:13:38, Jan Kara wrote:>   Hello,
> 
>   here is the fifth iteration of my patches to improve filesystem freezing.
> No serious changes since last time. Mostly I rebased patches and merged
this
> series with series moving file_update_time() to ->page_mkwrite() to
simplify
> testing and merging.
> 
> Filesystem freezing is currently racy and thus we can end up with dirty
data on
> frozen filesystem (see changelog patch 13 for detailed race description).
This
> patch series aims at fixing this.
> 
> To be able to block all places where inodes get dirtied, I've moved
filesystem
> file_update_time() call to ->page_mkwrite callback (patches 01-07) and
put
> freeze handling in mnt_want_write() / mnt_drop_write(). That however
required
> some code shuffling and changes to kern_path_create() (see patches 09-12).
I
> think the result is OK but opinions may differ ;). The advantage of this
change
> also is that all filesystems get freeze protection almost for free - even
ext2
> can handle freezing well now.
> 
> Another potential contention point might be patch 19. In that patch we make
> freeze_super() refuse to freeze the filesystem when there are open but
unlinked
> files which may be impractical in some cases. The main reason for this is
the
> problem with handling of file deletion from fput() called with mmap_sem
held
> (e.g. from munmap(2)), and then there's the fact that we cannot really
force
> such filesystem into a consistent state... But if people think that
freezing
> with open but unlinked files should happen, then I have some possible
> solutions in mind (maybe as a separate patchset since this is large
enough).
> 
> I'm not able to hit any deadlocks, lockdep warnings, or dirty data on
frozen
> filesystem despite beating it with fsstress and bash-shared-mapping while
> freezing and unfreezing for several hours (using ext4 and xfs) so I'm
> reasonably confident this could finally be the right solution.
> 
> Changes since v4:
>   * added a couple of Acked-by's
>   * added some comments & doc update
>   * added patches from series "Push file_update_time() into
.page_mkwrite"
>     since it doesn't make much sense to keep them separate anymore
>   * rebased on top of 3.4-rc2
> 
> Changes since v3:
>   * added third level of freezing for fs internal purposes - hooked some
>     filesystems to use it (XFS, nilfs2)
>   * removed racy i_size check from filemap_mkwrite()
> 
> Changes since v2:
>   * completely rewritten
>   * freezing is now blocked at VFS entry points
>   * two stage freezing to handle both mmapped writes and other IO
> 
> The biggest changes since v1:
>   * have two counters to provide safe state transitions for SB_FREEZE_WRITE
>     and SB_FREEZE_TRANS states
>   * use percpu counters instead of own percpu structure
>   * added documentation fixes from the old fs freezing series
>   * converted XFS to use SB_FREEZE_TRANS counter instead of its private
>     m_active_trans counter
> 
> 								Honza
> 
> CC: Alex Elder <elder at kernel.org>
> CC: Anton Altaparmakov <anton at tuxera.com>
> CC: Ben Myers <bpm at sgi.com>
> CC: Chris Mason <chris.mason at oracle.com>
> CC: cluster-devel at redhat.com
> CC: "David S. Miller" <davem at davemloft.net>
> CC: fuse-devel at lists.sourceforge.net
> CC: "J. Bruce Fields" <bfields at fieldses.org>
> CC: Joel Becker <jlbec at evilplan.org>
> CC: KONISHI Ryusuke <konishi.ryusuke at lab.ntt.co.jp>
> CC: linux-btrfs at vger.kernel.org
> CC: linux-ext4 at vger.kernel.org
> CC: linux-nfs at vger.kernel.org
> CC: linux-nilfs at vger.kernel.org
> CC: linux-ntfs-dev at lists.sourceforge.net
> CC: Mark Fasheh <mfasheh at suse.com>
> CC: Miklos Szeredi <miklos at szeredi.hu>
> CC: ocfs2-devel at oss.oracle.com
> CC: OGAWA Hirofumi <hirofumi at mail.parknet.co.jp>
> CC: Steven Whitehouse <swhiteho at redhat.com>
> CC: "Theodore Ts'o" <tytso at mit.edu>
> CC: xfs at oss.sgi.com-- 
Jan Kara <jack at suse.cz>
SUSE Labs, CR

Andreas Dilger

2012-Apr-16 22:02 UTC

head link

[Ocfs2-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks

On 2012-04-16, at 9:13 AM, Jan Kara wrote:> Another potential contention point might be patch 19. In that patch
> we make freeze_super() refuse to freeze the filesystem when there
> are open but unlinked files which may be impractical in some cases.
> The main reason for this is the problem with handling of file deletion
> from fput() called with mmap_sem held (e.g. from munmap(2)), and
> then there's the fact that we cannot really force such filesystem
> into a consistent state... But if people think that freezing with
> open but unlinked files should happen, then I have some possible
> solutions in mind (maybe as a separate patchset since this is
> large enough).
Looking at a desktop system, I think it is very typical that there
are open-unlinked files present, so I don't know if this is really
an acceptable solution.  It isn't clear from your comments whether
this is a blanket refusal for all open-unlinked files, or only in
some particular cases...

lsof | grep deleted
nautilus  25393  adilger   19r      REG           253,0      340     253954
/home/adilger/.local/share/gvfs-metadata/home (deleted)
nautilus  25393  adilger   20r      REG           253,0    32768     253964
/home/adilger/.local/share/gvfs-metadata/home-f332a8f3.log (deleted)
gnome-ter 25623  adilger   22u      REG            0,18    17841    2717846
/tmp/vtePIRJCW (deleted)
gnome-ter 25623  adilger   23u      REG            0,18     5568    2717847
/tmp/vteDCSJCW (deleted)
gnome-ter 25623  adilger   29u      REG            0,18      480    2728484
/tmp/vte6C1TCW (deleted)

Cheers, Andreas

Ocfs2 devel - Apr 2012 - [PATCH 00/19 v5] Fix filesystem freezing deadlocks

[Ocfs2-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks

[Ocfs2-devel] [PATCH 09/27] fs: Push mnt_want_write() outside of i_mutex

[PATCH 11/27] btrfs: Push mnt_want_write() outside of i_mutex

[Ocfs2-devel] [PATCH 19/27] ocfs2: Convert to new freezing mechanism

[PATCH 24/27] btrfs: Convert to new freezing mechanism

[Ocfs2-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks

[Ocfs2-devel] [PATCH 00/19 v5] Fix filesystem freezing deadlocks