thr3ads.net - Btrfs devel - [PATCH 0/8] remove i_alloc

If this information is useful, please help other people find it:
Share via:

Christoph Hellwig

2011-Jun-20 20:15 UTC

[PATCH 0/8] remove i_alloc_sem

i_alloc_sem has always been a bit of an odd "lock".  It''s the
only remaining
rw_semaphore that can be released by a different thread than the one that
locked it, and it''s use case in the core direct I/O code is more like a
counter given that the writers already have external serialization.

This series removes it in favour of a simpler counter scheme, thus getting
rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
same time shrinking the size of struct inode by 160 bytes on 64-bit systems.

The only nasty bit is that two filesystems (fat and ext4) have started
abusing the lock for their own purposes.  I''ve added a new rw_semaphore
to their filesystem-specific inode structures to keep the current behaviour,
but I suspect the maintainers might have smarted ideas to archive the same
goal.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:15 UTC

head link

[PATCH 1/8] far: remove i_alloc_sem abuse

Add a new rw_semaphore to protect bmap against truncate.  Previous
i_alloc_sem was abused for this, but it''s going away in this series.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/fat/inode.c
==================================================================---
linux-2.6.orig/fs/fat/inode.c	2011-06-20 21:28:19.707963855 +0200
+++ linux-2.6/fs/fat/inode.c	2011-06-20 21:29:25.031293882 +0200
@@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address
 	sector_t blocknr;
 
 	/* fat_get_cluster() assumes the requested blocknr isn''t truncated.
*/
-	down_read(&mapping->host->i_alloc_sem);
+	down_read(&MSDOS_I(mapping->host)->truncate_lock);
 	blocknr = generic_block_bmap(mapping, block, fat_get_block);
-	up_read(&mapping->host->i_alloc_sem);
+	up_read(&MSDOS_I(mapping->host)->truncate_lock);
 
 	return blocknr;
 }
@@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str
 	ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS);
 	if (!ei)
 		return NULL;
+
+	init_rwsem(&ei->truncate_lock);
 	return &ei->vfs_inode;
 }
 
Index: linux-2.6/fs/fat/fat.h
==================================================================---
linux-2.6.orig/fs/fat/fat.h	2011-06-20 21:28:19.724630522 +0200
+++ linux-2.6/fs/fat/fat.h	2011-06-20 21:29:25.034627215 +0200
@@ -109,6 +109,7 @@ struct msdos_inode_info {
 	int i_attrs;		/* unused attribute bits */
 	loff_t i_pos;		/* on-disk position of directory entry or 0 */
 	struct hlist_node i_fat_hash;	/* hash by i_location */
+	struct rw_semaphore truncate_lock; /* protect bmap against truncate */
 	struct inode vfs_inode;
 };
 
Index: linux-2.6/fs/fat/file.c
==================================================================---
linux-2.6.orig/fs/fat/file.c	2011-06-20 21:28:19.744630521 +0200
+++ linux-2.6/fs/fat/file.c	2011-06-20 21:29:54.501292390 +0200
@@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
+		down_write(&MSDOS_I(inode)->truncate_lock);
 		truncate_setsize(inode, attr->ia_size);
 		fat_truncate_blocks(inode, attr->ia_size);
+		up_write(&MSDOS_I(inode)->truncate_lock);
 	}
 
 	setattr_copy(inode, attr);

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:15 UTC

head link

[PATCH 2/8] ext4: remove i_alloc_sem abuse

Add a new rw_semaphore to protect page_mkwrite against truncate.  Previous
i_alloc_sem was abused for this, but it''s going away in this series.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/ext4/inode.c
==================================================================---
linux-2.6.orig/fs/ext4/inode.c	2011-06-20 14:25:33.779248128 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-20 14:27:53.025907745 +0200
@@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry,
 			if (attr->ia_size > sbi->s_bitmap_maxbytes)
 				return -EFBIG;
 		}
+		down_write(&(EXT4_I(inode)->truncate_lock));
 	}
 
 	if (S_ISREG(inode->i_mode) &&
@@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry,
 			ext4_truncate(inode);
 	}
 
+	if (attr->ia_valid & ATTR_SIZE)
+		up_write(&(EXT4_I(inode)->truncate_lock));
+
 	if (!rc) {
 		setattr_copy(inode, attr);
 		mark_inode_dirty(inode);
@@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str
 	struct address_space *mapping = inode->i_mapping;
 
 	/*
-	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
-	 * get i_mutex because we are already holding mmap_sem.
+	 * Get truncate_lock to stop truncates messing with the inode. We
+	 * cannot get i_mutex because we are already holding mmap_sem.
 	 */
-	down_read(&inode->i_alloc_sem);
+	down_read(&(EXT4_I(inode)->truncate_lock));
 	size = i_size_read(inode);
 	if (page->mapping != mapping || size <= page_offset(page)
 	    || !PageUptodate(page)) {
@@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str
 	lock_page(page);
 	wait_on_page_writeback(page);
 	if (PageMappedToDisk(page)) {
-		up_read(&inode->i_alloc_sem);
+		up_read(&(EXT4_I(inode)->truncate_lock));
 		return VM_FAULT_LOCKED;
 	}
 
@@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
-			up_read(&inode->i_alloc_sem);
+			up_read(&(EXT4_I(inode)->truncate_lock));
 			return VM_FAULT_LOCKED;
 		}
 	}
@@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str
 	 */
 	lock_page(page);
 	wait_on_page_writeback(page);
-	up_read(&inode->i_alloc_sem);
+	up_read(&(EXT4_I(inode)->truncate_lock));
 	return VM_FAULT_LOCKED;
 out_unlock:
 	if (ret)
 		ret = VM_FAULT_SIGBUS;
-	up_read(&inode->i_alloc_sem);
+	up_read(&(EXT4_I(inode)->truncate_lock));
 	return ret;
 }
Index: linux-2.6/fs/ext4/super.c
==================================================================---
linux-2.6.orig/fs/ext4/super.c	2011-06-20 14:25:33.795914793 +0200
+++ linux-2.6/fs/ext4/super.c	2011-06-20 14:27:06.269243443 +0200
@@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_ioend_count, 0);
 	atomic_set(&ei->i_aiodio_unwritten, 0);
+	init_rwsem(&ei->truncate_lock);
 
 	return &ei->vfs_inode;
 }
Index: linux-2.6/fs/ext4/ext4.h
==================================================================---
linux-2.6.orig/fs/ext4/ext4.h	2011-06-20 14:25:33.752581461 +0200
+++ linux-2.6/fs/ext4/ext4.h	2011-06-20 14:27:06.272576777 +0200
@@ -844,6 +844,9 @@ struct ext4_inode_info {
 	/* on-disk additional length */
 	__u16 i_extra_isize;
 
+	/* protect page_mkwrite against truncates */
+	struct rw_semaphore truncate_lock;
+
 #ifdef CONFIG_QUOTA
 	/* quota space reservation, managed internally by quota code */
 	qsize_t i_reserved_quota;

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:15 UTC

head link

[PATCH 3/8] fs: simpler handling of zero sized reads in __blockdev_direct_IO

Reject zero sized reads as soon as we know our I/O length, and don''t
borther with locks or allocations that might have to be cleaned up
otherwise.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
==================================================================---
linux-2.6.orig/fs/direct-io.c	2011-06-11 12:10:10.205165161 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-11 12:12:49.161823781 +0200
@@ -1200,6 +1200,10 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 	}
 
+	/* watch out for a 0 len io from a tricksy fs */
+	if (rw == READ && end == offset)
+		return 0;
+
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
@@ -1213,8 +1217,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
 	dio->flags = flags;
 	if (dio->flags & DIO_LOCKING) {
-		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
+		if (rw == READ) {
 			struct address_space *mapping  					iocb->ki_filp->f_mapping;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:15 UTC

head link

[PATCH 4/8] fs: kill i_alloc_sem

i_alloc_sem is a rather special rw_semaphore.  It''s the last one that
may
be released by a non-owner, and it''s write side is always mirrored by
real exclusion.  It''s intended use it to wait for all pending direct
I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

 - exclusion for truncates is already guaranteed by i_mutex, so it can
   simply fall way
 - the reader side is replaced by an i_dio_count member in struct inode
   that counts the number of pending direct I/O requests.  Truncate
can''t
   proceed as long as it''s non-zero
 - when i_dio_count reaches non-zero we wake up a pending truncate using
   wake_up_bit on a new bit in i_flags
 - new references to i_dio_count can''t appear while we are waiting for
   it to read zero because the direct I/O count always needs i_mutex
   (or an equivalent like XFS''s i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
==================================================================---
linux-2.6.orig/fs/direct-io.c	2011-06-20 14:55:31.000000000 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-20 14:55:34.602490284 +0200
@@ -136,6 +136,27 @@ struct dio {
 };
 
 /*
+ * Wait for outstanding DIO requests to finish.  Must be locked against
+ * increments of i_dio_count by i_mutex.
+ */
+void inode_dio_wait(struct inode *inode)
+{
+	might_sleep();
+	while (atomic_read(&inode->i_dio_count)) {
+		wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait,
+			    TASK_UNINTERRUPTIBLE);
+	}
+}
+EXPORT_SYMBOL_GPL(inode_dio_wait);
+
+void inode_dio_wake(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_dio_count))
+		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
+}
+EXPORT_SYMBOL_GPL(inode_dio_wake);
+
+/*
  * How many pages are in the queue?
  */
 static inline unsigned dio_pages_present(struct dio *dio)
@@ -254,9 +275,7 @@ static ssize_t dio_complete(struct dio *
 	}
 
 	if (dio->flags & DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
-
+		inode_dio_wake(dio->inode);
 	return ret;
 }
 
@@ -980,9 +999,6 @@ out:
 	return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1146,15 +1162,14 @@ direct_io_worker(int rw, struct kiocb *i
  *    For writes this function is called under i_mutex and returns with
  *    i_mutex held, for reads, i_mutex is not held on entry, but it is
  *    taken and dropped again before returning.
- *    For reads and writes i_alloc_sem is taken in shared mode and released
- *    on I/O completion (which may happen asynchronously after returning to
- *    the caller).
+ *    The i_dio_count counter keeps track of the number of outstanding
+ *    direct I/O requests, and truncate waits for it to reach zero.
+ *    New references to i_dio_count must only be grabbed with i_mutex
+ *    held.
  *
  *  - if the flags value does NOT contain DIO_LOCKING we don''t use any
  *    internal locking but rather rely on the filesystem to synchronize
  *    direct I/O reads/writes versus each other and truncate.
- *    For reads and writes both i_mutex and i_alloc_sem are not held on
- *    entry and are never taken.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1234,10 +1249,9 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 
 		/*
-		 * Will be released at I/O completion, possibly in a
-		 * different thread.
+		 * Will be decremented at I/O completion time.
 		 */
-		down_read_non_owner(&inode->i_alloc_sem);
+		atomic_inc(&inode->i_dio_count);
 	}
 
 	/*
Index: linux-2.6/mm/filemap.c
==================================================================---
linux-2.6.orig/mm/filemap.c	2011-06-20 14:19:27.019266696 +0200
+++ linux-2.6/mm/filemap.c	2011-06-20 14:55:34.605823617 +0200
@@ -78,9 +78,6 @@
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
  *
- *  ->i_mutex
- *    ->i_alloc_sem             (various)
- *
  *  inode_wb_list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
Index: linux-2.6/mm/rmap.c
==================================================================---
linux-2.6.orig/mm/rmap.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/rmap.c	2011-06-20 14:55:34.605823617 +0200
@@ -21,7 +21,6 @@
  * Lock ordering in mm:
  *
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
- *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_mutex
Index: linux-2.6/fs/attr.c
==================================================================---
linux-2.6.orig/fs/attr.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/attr.c	2011-06-20 14:55:34.609156951 +0200
@@ -233,16 +233,13 @@ int notify_change(struct dentry * dentry
 		return error;
 
 	if (ia_valid & ATTR_SIZE)
-		down_write(&dentry->d_inode->i_alloc_sem);
+		inode_dio_wait(inode);
 
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
 		error = simple_setattr(dentry, attr);
 
-	if (ia_valid & ATTR_SIZE)
-		up_write(&dentry->d_inode->i_alloc_sem);
-
 	if (!error)
 		fsnotify_change(dentry, ia_valid);
 
Index: linux-2.6/fs/ntfs/file.c
==================================================================---
linux-2.6.orig/fs/ntfs/file.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/ntfs/file.c	2011-06-20 14:55:34.609156951 +0200
@@ -1832,9 +1832,8 @@ static ssize_t ntfs_file_buffered_write(
 	 * fails again.
 	 */
 	if (unlikely(NInoTruncateFailed(ni))) {
-		down_write(&vi->i_alloc_sem);
+		inode_dio_wait(vi);
 		err = ntfs_truncate(vi);
-		up_write(&vi->i_alloc_sem);
 		if (err || NInoTruncateFailed(ni)) {
 			if (!err)
 				err = -EIO;
Index: linux-2.6/fs/reiserfs/xattr.c
==================================================================---
linux-2.6.orig/fs/reiserfs/xattr.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/reiserfs/xattr.c	2011-06-20 14:55:34.612490285 +0200
@@ -555,11 +555,10 @@ reiserfs_xattr_set_handle(struct reiserf
 
 		reiserfs_write_unlock(inode->i_sb);
 		mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_XATTR);
-		down_write(&dentry->d_inode->i_alloc_sem);
+		inode_dio_wait(dentry->d_inode);
 		reiserfs_write_lock(inode->i_sb);
 
 		err = reiserfs_setattr(dentry, &newattrs);
-		up_write(&dentry->d_inode->i_alloc_sem);
 		mutex_unlock(&dentry->d_inode->i_mutex);
 	} else
 		update_ctime(inode);
Index: linux-2.6/include/linux/fs.h
==================================================================---
linux-2.6.orig/include/linux/fs.h	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/include/linux/fs.h	2011-06-20 14:55:34.615823619 +0200
@@ -776,7 +776,7 @@ struct inode {
 	struct timespec		i_ctime;
 	blkcnt_t		i_blocks;
 	unsigned short          i_bytes;
-	struct rw_semaphore	i_alloc_sem;
+	atomic_t		i_dio_count;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops
*/
 	struct file_lock	*i_flock;
 	struct address_space	*i_mapping;
@@ -1692,6 +1692,10 @@ struct super_operations {
  *			set during data writeback, and cleared with a wakeup
  *			on the bit address once it is done.
  *
+ * I_REFERENCED		Marks the inode as recently references on the LRU list.
+ *
+ * I_DIO_WAKEUP		Never set.  Only used as a key for wait_on_bit().
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -1705,6 +1709,8 @@ struct super_operations {
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
 #define I_REFERENCED		(1 << 8)
+#define __I_DIO_WAKEUP		9
+#define I_DIO_WAKEUP		(1 << I_DIO_WAKEUP)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
@@ -1815,7 +1821,6 @@ struct file_system_type {
 	struct lock_class_key i_lock_key;
 	struct lock_class_key i_mutex_key;
 	struct lock_class_key i_mutex_dir_key;
-	struct lock_class_key i_alloc_sem_key;
 };
 
 extern struct dentry *mount_ns(struct file_system_type *fs_type, int flags,
@@ -2367,6 +2372,8 @@ enum {
 };
 
 void dio_end_io(struct bio *bio, int error);
+void inode_dio_wait(struct inode *inode);
+void inode_dio_wake(struct inode *inode);
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
Index: linux-2.6/mm/memory.c
==================================================================---
linux-2.6.orig/mm/memory.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/memory.c	2011-06-20 14:55:34.619156952 +0200
@@ -2811,12 +2811,11 @@ int vmtruncate_range(struct inode *inode
 		return -ENOSYS;
 
 	mutex_lock(&inode->i_mutex);
-	down_write(&inode->i_alloc_sem);
+	inode_dio_wait(inode);
 	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	truncate_inode_pages_range(mapping, offset, end);
 	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	inode->i_op->truncate_range(inode, offset, end);
-	up_write(&inode->i_alloc_sem);
 	mutex_unlock(&inode->i_mutex);
 
 	return 0;
Index: linux-2.6/fs/inode.c
==================================================================---
linux-2.6.orig/fs/inode.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/inode.c	2011-06-20 14:55:34.625823618 +0200
@@ -176,8 +176,7 @@ int inode_init_always(struct super_block
 	mutex_init(&inode->i_mutex);
 	lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);
 
-	init_rwsem(&inode->i_alloc_sem);
-	lockdep_set_class(&inode->i_alloc_sem,
&sb->s_type->i_alloc_sem_key);
+	atomic_set(&inode->i_dio_count, 0);
 
 	mapping->a_ops = &empty_aops;
 	mapping->host = inode;
Index: linux-2.6/fs/ntfs/inode.c
==================================================================---
linux-2.6.orig/fs/ntfs/inode.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/ntfs/inode.c	2011-06-20 14:55:34.629156951 +0200
@@ -2357,12 +2357,7 @@ static const char *es = "  Leaving incon
  *
  * Returns 0 on success or -errno on error.
  *
- * Called with ->i_mutex held.  In all but one case ->i_alloc_sem is held
for
- * writing.  The only case in the kernel where ->i_alloc_sem is not held is
- * mm/filemap.c::generic_file_buffered_write() where vmtruncate() is called
- * with the current i_size as the offset.  The analogous place in NTFS is in
- * fs/ntfs/file.c::ntfs_file_buffered_write() where we call vmtruncate() again
- * without holding ->i_alloc_sem.
+ * Called with ->i_mutex held.
  */
 int ntfs_truncate(struct inode *vi)
 {
@@ -2887,8 +2882,7 @@ void ntfs_truncate_vfs(struct inode *vi)
  * We also abort all changes of user, group, and mode as we do not implement
  * the NTFS ACLs yet.
  *
- * Called with ->i_mutex held.  For the ATTR_SIZE (i.e. ->truncate) case,
also
- * called with ->i_alloc_sem held for writing.
+ * Called with ->i_mutex held.
  */
 int ntfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
Index: linux-2.6/fs/ocfs2/aops.c
==================================================================---
linux-2.6.orig/fs/ocfs2/aops.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-20 14:55:34.629156951 +0200
@@ -551,9 +551,8 @@ bail:
 
 /*
  * ocfs2_dio_end_io is called by the dio core when a dio is finished. 
We''re
- * particularly interested in the aio/dio case.  Like the core uses
- * i_alloc_sem, we use the rw_lock DLM lock to protect io on one node from
- * truncation on another.
+ * particularly interested in the aio/dio case.  We use the rw_lock DLM lock
+ * to protect io on one node from truncation on another.
  */
 static void ocfs2_dio_end_io(struct kiocb *iocb,
 			     loff_t offset,
@@ -569,7 +568,7 @@ static void ocfs2_dio_end_io(struct kioc
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
 	if (ocfs2_iocb_is_sem_locked(iocb)) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 
Index: linux-2.6/fs/ocfs2/file.c
==================================================================---
linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 14:55:34.635823617 +0200
@@ -2236,9 +2236,9 @@ static ssize_t ocfs2_file_aio_write(stru
 	ocfs2_iocb_clear_sem_locked(iocb);
 
 relock:
-	/* to match setattr''s i_mutex -> i_alloc_sem -> rw_lock
ordering */
+	/* to match setattr''s i_mutex -> rw_lock ordering */
 	if (direct_io) {
-		down_read(&inode->i_alloc_sem);
+		atomic_inc(&inode->i_dio_count);
 		have_alloc_sem = 1;
 		/* communicate with ocfs2_dio_end_io */
 		ocfs2_iocb_set_sem_locked(iocb);
@@ -2290,7 +2290,7 @@ relock:
 	 */
 	if (direct_io && !can_do_direct) {
 		ocfs2_rw_unlock(inode, rw_level);
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 
 		have_alloc_sem = 0;
 		rw_level = -1;
@@ -2361,8 +2361,7 @@ out_dio:
 	/*
 	 * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
 	 * function pointer which is called when o_direct io completes so that
-	 * it can unlock our rw lock.  (it''s the clustered equivalent of
-	 * i_alloc_sem; protects truncate from racing with pending ios).
+	 * it can unlock our rw lock.
 	 * Unfortunately there are error cases which call end_io and others
 	 * that don''t.  so we don''t have to unlock the rw_lock if
either an
 	 * async dio is going to do it in the future or an end_io after an
@@ -2379,7 +2378,7 @@ out:
 
 out_sems:
 	if (have_alloc_sem) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 
@@ -2531,8 +2530,8 @@ static ssize_t ocfs2_file_aio_read(struc
 	 * need locks to protect pending reads from racing with truncate.
 	 */
 	if (filp->f_flags & O_DIRECT) {
-		down_read(&inode->i_alloc_sem);
 		have_alloc_sem = 1;
+		atomic_inc(&inode->i_dio_count);
 		ocfs2_iocb_set_sem_locked(iocb);
 
 		ret = ocfs2_rw_lock(inode, 0);
@@ -2575,7 +2574,7 @@ static ssize_t ocfs2_file_aio_read(struc
 
 bail:
 	if (have_alloc_sem) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 	if (rw_level != -1)
Index: linux-2.6/mm/madvise.c
==================================================================---
linux-2.6.orig/mm/madvise.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/madvise.c	2011-06-20 14:55:34.635823617 +0200
@@ -218,7 +218,7 @@ static long madvise_remove(struct vm_are
 	endoff = (loff_t)(end - vma->vm_start - 1)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
-	/* vmtruncate_range needs to take i_mutex and i_alloc_sem */
+	/* vmtruncate_range needs to take i_mutex */
 	up_read(&current->mm->mmap_sem);
 	error = vmtruncate_range(mapping->host, offset, endoff);
 	down_read(&current->mm->mmap_sem);

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:15 UTC

head link

[PATCH 5/8] fs: move inode_dio_wait calls into ->setattr

Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand.  This means filesystem-specific locks to prevent
new dio referenes from appearing can be held.  This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/ocfs2/file.c
==================================================================---
linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 09:28:54.516815966 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 09:31:34.706807855 +0200
@@ -1142,6 +1142,8 @@ int ocfs2_setattr(struct dentry *dentry,
 		if (status)
 			goto bail_unlock;
 
+		inode_dio_wait(inode);
+
 		if (i_size_read(inode) > attr->ia_size) {
 			if (ocfs2_should_order_data(inode)) {
 				status = ocfs2_begin_ordered_truncate(inode,
Index: linux-2.6/fs/attr.c
==================================================================---
linux-2.6.orig/fs/attr.c	2011-06-20 09:28:54.490149300 +0200
+++ linux-2.6/fs/attr.c	2011-06-20 09:29:06.000000000 +0200
@@ -232,9 +232,6 @@ int notify_change(struct dentry * dentry
 	if (error)
 		return error;
 
-	if (ia_valid & ATTR_SIZE)
-		inode_dio_wait(inode);
-
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
Index: linux-2.6/fs/ext2/inode.c
==================================================================---
linux-2.6.orig/fs/ext2/inode.c	2011-06-18 12:54:28.058273680 +0200
+++ linux-2.6/fs/ext2/inode.c	2011-06-20 09:29:06.500148692 +0200
@@ -1184,6 +1184,8 @@ static int ext2_setsize(struct inode *in
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
 
+	inode_dio_wait(inode);
+
 	if (mapping_is_xip(inode->i_mapping))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
Index: linux-2.6/fs/ext3/inode.c
==================================================================---
linux-2.6.orig/fs/ext3/inode.c	2011-06-18 12:54:28.071607014 +0200
+++ linux-2.6/fs/ext3/inode.c	2011-06-20 09:29:06.500148692 +0200
@@ -3216,6 +3216,9 @@ int ext3_setattr(struct dentry *dentry,
 		ext3_journal_stop(handle);
 	}
 
+	if (attr->ia_valid & ATTR_SIZE)
+		inode_dio_wait(inode);
+
 	if (S_ISREG(inode->i_mode) &&
 	    attr->ia_valid & ATTR_SIZE && attr->ia_size <
inode->i_size) {
 		handle_t *handle;
Index: linux-2.6/fs/ext4/inode.c
==================================================================---
linux-2.6.orig/fs/ext4/inode.c	2011-06-20 09:28:54.506815967 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -5351,6 +5351,8 @@ int ext4_setattr(struct dentry *dentry,
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
+		inode_dio_wait(inode);
+
 		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 			struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
Index: linux-2.6/fs/fat/file.c
==================================================================---
linux-2.6.orig/fs/fat/file.c	2011-06-18 12:54:28.118273678 +0200
+++ linux-2.6/fs/fat/file.c	2011-06-20 09:29:06.000000000 +0200
@@ -397,6 +397,8 @@ int fat_setattr(struct dentry *dentry, s
 	 * sequence.
 	 */
 	if (attr->ia_valid & ATTR_SIZE) {
+		inode_dio_wait(inode);
+
 		if (attr->ia_size > inode->i_size) {
 			error = fat_cont_expand(inode, attr->ia_size);
 			if (error || attr->ia_valid == ATTR_SIZE)
Index: linux-2.6/fs/gfs2/bmap.c
==================================================================---
linux-2.6.orig/fs/gfs2/bmap.c	2011-06-18 12:54:28.141607009 +0200
+++ linux-2.6/fs/gfs2/bmap.c	2011-06-20 09:29:06.510148693 +0200
@@ -1224,6 +1224,8 @@ int gfs2_setattr_size(struct inode *inod
 	if (ret)
 		return ret;
 
+	inode_dio_wait(inode);
+
 	oldsize = inode->i_size;
 	if (newsize >= oldsize)
 		return do_grow(inode, newsize);
Index: linux-2.6/fs/hfs/inode.c
==================================================================---
linux-2.6.orig/fs/hfs/inode.c	2011-06-18 12:54:28.154940342 +0200
+++ linux-2.6/fs/hfs/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -615,6 +615,8 @@ int hfs_inode_setattr(struct dentry *den
 
 	if ((attr->ia_valid & ATTR_SIZE) &&
 	    attr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
Index: linux-2.6/fs/hfsplus/inode.c
==================================================================---
linux-2.6.orig/fs/hfsplus/inode.c	2011-06-18 12:54:28.168273676 +0200
+++ linux-2.6/fs/hfsplus/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -296,6 +296,8 @@ static int hfsplus_setattr(struct dentry
 
 	if ((attr->ia_valid & ATTR_SIZE) &&
 	    attr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
Index: linux-2.6/fs/jfs/file.c
==================================================================---
linux-2.6.orig/fs/jfs/file.c	2011-06-18 12:54:28.191607007 +0200
+++ linux-2.6/fs/jfs/file.c	2011-06-20 09:29:06.000000000 +0200
@@ -110,6 +110,8 @@ int jfs_setattr(struct dentry *dentry, s
 
 	if ((iattr->ia_valid & ATTR_SIZE) &&
 	    iattr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		rc = vmtruncate(inode, iattr->ia_size);
 		if (rc)
 			return rc;
Index: linux-2.6/fs/nilfs2/inode.c
==================================================================---
linux-2.6.orig/fs/nilfs2/inode.c	2011-06-18 12:54:28.204940339 +0200
+++ linux-2.6/fs/nilfs2/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -778,6 +778,8 @@ int nilfs_setattr(struct dentry *dentry,
 
 	if ((iattr->ia_valid & ATTR_SIZE) &&
 	    iattr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		err = vmtruncate(inode, iattr->ia_size);
 		if (unlikely(err))
 			goto out_err;
Index: linux-2.6/fs/reiserfs/inode.c
==================================================================---
linux-2.6.orig/fs/reiserfs/inode.c	2011-06-18 12:54:28.218273673 +0200
+++ linux-2.6/fs/reiserfs/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -3114,6 +3114,9 @@ int reiserfs_setattr(struct dentry *dent
 			error = -EFBIG;
 			goto out;
 		}
+
+		inode_dio_wait(inode);
+
 		/* fill in hole pointers in the expanding truncate case. */
 		if (attr->ia_size > inode->i_size) {
 			error = generic_cont_expand_simple(inode, attr->ia_size);

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:15 UTC

head link

[PATCH 6/8] fs: always maintain i_dio_count

Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code.  Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.

Behaviour doesn''t change for filesystems never calling inode_dio_wait,
which are all that never use DIO_LOCKING.

For ext4 behaviour changes with the dioread_nonlock option, which previous
was missing any protection between truncate and direct I/O reads.

For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code noew available.

As a result inode_dio_wake can now be made static in direct-io.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
==================================================================---
linux-2.6.orig/fs/direct-io.c	2011-06-20 14:55:34.602490284 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-20 14:57:24.575818051 +0200
@@ -149,12 +149,11 @@ void inode_dio_wait(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_dio_wait);
 
-void inode_dio_wake(struct inode *inode)
+static inline void inode_dio_wake(struct inode *inode)
 {
 	if (atomic_dec_and_test(&inode->i_dio_count))
 		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
 }
-EXPORT_SYMBOL_GPL(inode_dio_wake);
 
 /*
  * How many pages are in the queue?
@@ -274,8 +273,7 @@ static ssize_t dio_complete(struct dio *
 		aio_complete(dio->iocb, ret, 0);
 	}
 
-	if (dio->flags & DIO_LOCKING)
-		inode_dio_wake(dio->inode);
+	inode_dio_wake(dio->inode);
 	return ret;
 }
 
@@ -1162,14 +1160,16 @@ direct_io_worker(int rw, struct kiocb *i
  *    For writes this function is called under i_mutex and returns with
  *    i_mutex held, for reads, i_mutex is not held on entry, but it is
  *    taken and dropped again before returning.
- *    The i_dio_count counter keeps track of the number of outstanding
- *    direct I/O requests, and truncate waits for it to reach zero.
- *    New references to i_dio_count must only be grabbed with i_mutex
- *    held.
- *
  *  - if the flags value does NOT contain DIO_LOCKING we don''t use any
  *    internal locking but rather rely on the filesystem to synchronize
  *    direct I/O reads/writes versus each other and truncate.
+ *
+ * To help with locking against truncate we incremented the i_dio_count
+ * counter before starting direct I/O, and decrement it once we are done.
+ * Truncate can wait for it to reach zero to provide exclusion.  It is
+ * expected that filesystem provide exclusion between new direct I/O
+ * and truncates.  For DIO_LOCKING filesystems this is done by i_mutex,
+ * but other filesystems need to take care of this on their own.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1247,14 +1247,14 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 			}
 		}
-
-		/*
-		 * Will be decremented at I/O completion time.
-		 */
-		atomic_inc(&inode->i_dio_count);
 	}
 
 	/*
+	 * Will be decremented at I/O completion time.
+	 */
+	atomic_inc(&inode->i_dio_count);
+
+	/*
 	 * For file extending writes updating i_size before data
 	 * writeouts complete can expose uninitialized blocks. So
 	 * even for AIO, we need to wait for i/o to complete before
Index: linux-2.6/fs/ocfs2/aops.c
==================================================================---
linux-2.6.orig/fs/ocfs2/aops.c	2011-06-20 14:55:34.629156951 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-20 14:56:59.259152666 +0200
@@ -567,10 +567,8 @@ static void ocfs2_dio_end_io(struct kioc
 	/* this io''s submitter should not have unlocked this before we could
*/
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
-	if (ocfs2_iocb_is_sem_locked(iocb)) {
-		inode_dio_wake(inode);
+	if (ocfs2_iocb_is_sem_locked(iocb))
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
 
 	ocfs2_iocb_clear_rw_locked(iocb);
 
Index: linux-2.6/fs/ocfs2/file.c
==================================================================---
linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 14:56:55.375819530 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 14:56:59.262485999 +0200
@@ -2240,7 +2240,6 @@ static ssize_t ocfs2_file_aio_write(stru
 relock:
 	/* to match setattr''s i_mutex -> rw_lock ordering */
 	if (direct_io) {
-		atomic_inc(&inode->i_dio_count);
 		have_alloc_sem = 1;
 		/* communicate with ocfs2_dio_end_io */
 		ocfs2_iocb_set_sem_locked(iocb);
@@ -2292,7 +2291,6 @@ relock:
 	 */
 	if (direct_io && !can_do_direct) {
 		ocfs2_rw_unlock(inode, rw_level);
-		inode_dio_wake(inode);
 
 		have_alloc_sem = 0;
 		rw_level = -1;
@@ -2379,10 +2377,8 @@ out:
 		ocfs2_rw_unlock(inode, rw_level);
 
 out_sems:
-	if (have_alloc_sem) {
-		inode_dio_wake(inode);
+	if (have_alloc_sem)
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
 
 	mutex_unlock(&inode->i_mutex);
 
@@ -2533,7 +2529,6 @@ static ssize_t ocfs2_file_aio_read(struc
 	 */
 	if (filp->f_flags & O_DIRECT) {
 		have_alloc_sem = 1;
-		atomic_inc(&inode->i_dio_count);
 		ocfs2_iocb_set_sem_locked(iocb);
 
 		ret = ocfs2_rw_lock(inode, 0);
@@ -2575,10 +2570,9 @@ static ssize_t ocfs2_file_aio_read(struc
 	}
 
 bail:
-	if (have_alloc_sem) {
-		inode_dio_wake(inode);
+	if (have_alloc_sem)
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
+
 	if (rw_level != -1)
 		ocfs2_rw_unlock(inode, rw_level);
 
Index: linux-2.6/include/linux/fs.h
==================================================================---
linux-2.6.orig/include/linux/fs.h	2011-06-20 14:57:08.582485528 +0200
+++ linux-2.6/include/linux/fs.h	2011-06-20 14:57:10.099152117 +0200
@@ -2373,7 +2373,6 @@ enum {
 
 void dio_end_io(struct bio *bio, int error);
 void inode_dio_wait(struct inode *inode);
-void inode_dio_wake(struct inode *inode);
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:15 UTC

head link

[PATCH 7/8] btrfs: wait for direct I/O requests in truncate

Wait for all direct I/O requests to finish before performing a truncate.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/btrfs/inode.c
==================================================================---
linux-2.6.orig/fs/btrfs/inode.c	2011-06-11 12:58:46.615017504 +0200
+++ linux-2.6/fs/btrfs/inode.c	2011-06-11 12:59:23.218348984 +0200
@@ -3550,6 +3550,8 @@ static int btrfs_setsize(struct inode *i
 	loff_t oldsize = i_size_read(inode);
 	int ret;
 
+	inode_dio_wait(inode);
+
 	if (newsize == oldsize)
 		return 0;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:15 UTC

head link

[PATCH 8/8] rw_semaphore: remove up/down_read_non_owner

Now that the last users is gone these can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/include/linux/rwsem.h
==================================================================---
linux-2.6.orig/include/linux/rwsem.h	2011-06-20 14:58:15.449148809 +0200
+++ linux-2.6/include/linux/rwsem.h	2011-06-20 14:58:30.915814691 +0200
@@ -124,19 +124,9 @@ extern void downgrade_write(struct rw_se
  */
 extern void down_read_nested(struct rw_semaphore *sem, int subclass);
 extern void down_write_nested(struct rw_semaphore *sem, int subclass);
-/*
- * Take/release a lock when not the owner will release it.
- *
- * [ This API should be avoided as much as possible - the
- *   proper abstraction for this case is completions. ]
- */
-extern void down_read_non_owner(struct rw_semaphore *sem);
-extern void up_read_non_owner(struct rw_semaphore *sem);
 #else
 # define down_read_nested(sem, subclass)		down_read(sem)
 # define down_write_nested(sem, subclass)	down_write(sem)
-# define down_read_non_owner(sem)		down_read(sem)
-# define up_read_non_owner(sem)			up_read(sem)
 #endif
 
 #endif /* _LINUX_RWSEM_H */
Index: linux-2.6/kernel/rwsem.c
==================================================================---
linux-2.6.orig/kernel/rwsem.c	2011-06-20 14:58:34.509147842 +0200
+++ linux-2.6/kernel/rwsem.c	2011-06-20 14:58:47.479147187 +0200
@@ -117,15 +117,6 @@ void down_read_nested(struct rw_semaphor
 
 EXPORT_SYMBOL(down_read_nested);
 
-void down_read_non_owner(struct rw_semaphore *sem)
-{
-	might_sleep();
-
-	__down_read(sem);
-}
-
-EXPORT_SYMBOL(down_read_non_owner);
-
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
@@ -136,13 +127,6 @@ void down_write_nested(struct rw_semapho
 
 EXPORT_SYMBOL(down_write_nested);
 
-void up_read_non_owner(struct rw_semaphore *sem)
-{
-	__up_read(sem);
-}
-
-EXPORT_SYMBOL(up_read_non_owner);
-
 #endif
 
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 20:32 UTC

head link

Re: [PATCH 0/8] remove i_alloc_sem

On Mon, Jun 20, 2011 at 04:15:33PM -0400, Christoph Hellwig
wrote:> This series removes it in favour of a simpler counter scheme, thus getting
> rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
> same time shrinking the size of struct inode by 160 bytes on 64-bit
systems.
This should be 160 bits, aka 20 bytes of course, sorry.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Joel Becker

2011-Jun-20 21:29 UTC

head link

Re: [PATCH 6/8] fs: always maintain i_dio_count

On Mon, Jun 20, 2011 at 04:15:39PM -0400, Christoph Hellwig
wrote:> Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
> This these filesystems to also protect truncate against direct I/O requests
> by using common code.  Right now the only non-DIO_LOCKING filesystem that
> appears to do so is XFS, which uses an opencoded variant of the i_dio_count
> scheme.
> 
> Behaviour doesn''t change for filesystems never calling
inode_dio_wait,
> which are all that never use DIO_LOCKING.
> 
> For ext4 behaviour changes with the dioread_nonlock option, which previous
> was missing any protection between truncate and direct I/O reads.
> 
> For ocfs2 that handcrafted i_dio_count manipulations are replaced with
> the common code noew available.
	Oh god you''re making the world scary.  Are you guaranteeing that
all allocation changes are locked out by the time we get into
file_aio_write() and file_aio_read()?  This is not obvious to me.

Joel

-- 

"Glory is fleeting, but obscurity is forever."  
         - Napoleon Bonaparte

			http://www.jlbec.org/
			jlbec@evilplan.org
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Joel Becker

2011-Jun-20 21:32 UTC

head link

Re: [PATCH 4/8] fs: kill i_alloc_sem

On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig
wrote:> i_alloc_sem is a rather special rw_semaphore.  It''s the last one
that may
> be released by a non-owner, and it''s write side is always mirrored
by
> real exclusion.  It''s intended use it to wait for all pending
direct I/O
> requests to finish before starting a truncate.
> 
> Replace it with a hand-grown construct:
> 
>  - exclusion for truncates is already guaranteed by i_mutex, so it can
>    simply fall way
>  - the reader side is replaced by an i_dio_count member in struct inode
>    that counts the number of pending direct I/O requests.  Truncate
can''t
>    proceed as long as it''s non-zero
>  - when i_dio_count reaches non-zero we wake up a pending truncate using
>    wake_up_bit on a new bit in i_flags
>  - new references to i_dio_count can''t appear while we are waiting
for
>    it to read zero because the direct I/O count always needs i_mutex
>    (or an equivalent like XFS''s i_iolock) for starting a new
operation.
> 
> This scheme is much simpler, and saves the space of a spinlock_t and a
> struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
> system).
	Are we guaranteed that all allocation changes are locked out by
i_dio_count>0?  I don''t think we are.  The ocfs2 code very strongly
assumes the state of a file''s allocation when it holds i_alloc_sem.  I
feel like we lose that here. 

Joel

-- 

"I don''t even butter my bread; I consider that cooking."
         - Katherine Cebrian

			http://www.jlbec.org/
			jlbec@evilplan.org
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 22:18 UTC

head link

Re: [PATCH 4/8] fs: kill i_alloc_sem

On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker
wrote:> 	Are we guaranteed that all allocation changes are locked out by
> i_dio_count>0?  I don''t think we are.  The ocfs2 code very
strongly
> assumes the state of a file''s allocation when it holds
i_alloc_sem.  I
> feel like we lose that here. 
You aren''t, neither with the old i_alloc_sem code, nor with the 1:1
replacement using i_dio_count.

Do a quick grep who gets i_alloc_sem exclusively (down_write): it''s
really just the truncate code, and it''s cut & paste duplicates in
ntfs
and reiserfs.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-20 22:23 UTC

head link

Re: [PATCH 6/8] fs: always maintain i_dio_count

On Mon, Jun 20, 2011 at 02:29:24PM -0700, Joel Becker
wrote:> 	Oh god you''re making the world scary.  Are you guaranteeing that
> all allocation changes are locked out by the time we get into
> file_aio_write() and file_aio_read()?  This is not obvious to me.
I have no idea how ocfs2''s internal allocator locking works, but this
patch doesn''t change it.  What this patch touches is exclusion between
truncate and pending direct I/O requests, and even there only the
implementation and not the semantics.

The old and new semantics are that you may have either

	1 ongoing truncate

OR

	n (>= 0; <= ATOMIC_T_MAX) ongoing direct I/O reads or writes

before that was enforced using the i_alloc_sem rw_semaphore, including
non-owner releases of it from AIO code, in the new code it''s done using
a combination of i_mutex which was already taken in the truncate path,
and when starting new direct I/O requests, and the new i_dio_count
counter.
> 
> Joel
> 
> -- 
> 
> "Glory is fleeting, but obscurity is forever."  
>          - Napoleon Bonaparte
> 
> 			http://www.jlbec.org/
> 			jlbec@evilplan.org---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner

2011-Jun-21 05:40 UTC

head link

Re: [PATCH 4/8] fs: kill i_alloc_sem

On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig
wrote:> i_alloc_sem is a rather special rw_semaphore.  It''s the last one
that may
> be released by a non-owner, and it''s write side is always mirrored
by
> real exclusion.  It''s intended use it to wait for all pending
direct I/O
> requests to finish before starting a truncate.
> 
> Replace it with a hand-grown construct:
> 
>  - exclusion for truncates is already guaranteed by i_mutex, so it can
>    simply fall way
>  - the reader side is replaced by an i_dio_count member in struct inode
>    that counts the number of pending direct I/O requests.  Truncate
can''t
>    proceed as long as it''s non-zero
>  - when i_dio_count reaches non-zero we wake up a pending truncate using
>    wake_up_bit on a new bit in i_flags
>  - new references to i_dio_count can''t appear while we are waiting
for
>    it to read zero because the direct I/O count always needs i_mutex
>    (or an equivalent like XFS''s i_iolock) for starting a new
operation.
> 
> This scheme is much simpler, and saves the space of a spinlock_t and a
> struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
> system).
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/direct-io.c
> ==================================================================> ---
linux-2.6.orig/fs/direct-io.c	2011-06-20 14:55:31.000000000 +0200
> +++ linux-2.6/fs/direct-io.c	2011-06-20 14:55:34.602490284 +0200
> @@ -136,6 +136,27 @@ struct dio {
>  };
>  
>  /*
> + * Wait for outstanding DIO requests to finish.  Must be locked against
> + * increments of i_dio_count by i_mutex.
> + */
> +void inode_dio_wait(struct inode *inode)
> +{
> +	might_sleep();
> +	while (atomic_read(&inode->i_dio_count)) {
> +		wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait,
> +			    TASK_UNINTERRUPTIBLE);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(inode_dio_wait);
> +
> +void inode_dio_wake(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_dio_count))
> +		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
> +}
> +EXPORT_SYMBOL_GPL(inode_dio_wake);
Modification of inode->i_state is not safe outside the
inode->i_lock.

This probably needs to be implemented similar to the
__I_NEW/__wait_on_freeing_inode() and
__I_SYNC/inode_wait_for_writeback() pattern...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-21 09:35 UTC

head link

Re: [PATCH 4/8] fs: kill i_alloc_sem

On Tue, Jun 21, 2011 at 03:40:56PM +1000, Dave Chinner
wrote:> Modification of inode->i_state is not safe outside the
> inode->i_lock.
We never actually set the new bit in i_state, we just use it as a key
for the hashed lookups.  Or rather we try to, as I misunderstood how
wait_on_bit works, so currently we busywait for i_dio_count to reach
zero.  I''ll respin a version that actually works as expected.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

OGAWA Hirofumi

2011-Jun-21 15:57 UTC

head link

Re: [PATCH 1/8] far: remove i_alloc_sem abuse

Christoph Hellwig <hch@infradead.org> writes:
> Add a new rw_semaphore to protect bmap against truncate.  Previous
> i_alloc_sem was abused for this, but it''s going away in this
series.
In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using
->i_mutex to call ->bmap. So, this was chosen instead.

I''m not checking current version yet though, if shmfs had change, we
can
use ->i_mutex.

BTW, this patch looks good to me.

Thanks.
> Signed-off-by: Christoph Hellwig <hch@lst.de>
>
> Index: linux-2.6/fs/fat/inode.c
> ==================================================================> ---
linux-2.6.orig/fs/fat/inode.c	2011-06-20 21:28:19.707963855 +0200
> +++ linux-2.6/fs/fat/inode.c	2011-06-20 21:29:25.031293882 +0200
> @@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address
>  	sector_t blocknr;
>  
>  	/* fat_get_cluster() assumes the requested blocknr isn''t
truncated. */
> -	down_read(&mapping->host->i_alloc_sem);
> +	down_read(&MSDOS_I(mapping->host)->truncate_lock);
>  	blocknr = generic_block_bmap(mapping, block, fat_get_block);
> -	up_read(&mapping->host->i_alloc_sem);
> +	up_read(&MSDOS_I(mapping->host)->truncate_lock);
>  
>  	return blocknr;
>  }
> @@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str
>  	ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS);
>  	if (!ei)
>  		return NULL;
> +
> +	init_rwsem(&ei->truncate_lock);
>  	return &ei->vfs_inode;
>  }
>  
> Index: linux-2.6/fs/fat/fat.h
> ==================================================================> ---
linux-2.6.orig/fs/fat/fat.h	2011-06-20 21:28:19.724630522 +0200
> +++ linux-2.6/fs/fat/fat.h	2011-06-20 21:29:25.034627215 +0200
> @@ -109,6 +109,7 @@ struct msdos_inode_info {
>  	int i_attrs;		/* unused attribute bits */
>  	loff_t i_pos;		/* on-disk position of directory entry or 0 */
>  	struct hlist_node i_fat_hash;	/* hash by i_location */
> +	struct rw_semaphore truncate_lock; /* protect bmap against truncate */
>  	struct inode vfs_inode;
>  };
>  
> Index: linux-2.6/fs/fat/file.c
> ==================================================================> ---
linux-2.6.orig/fs/fat/file.c	2011-06-20 21:28:19.744630521 +0200
> +++ linux-2.6/fs/fat/file.c	2011-06-20 21:29:54.501292390 +0200
> @@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s
>  	}
>  
>  	if (attr->ia_valid & ATTR_SIZE) {
> +		down_write(&MSDOS_I(inode)->truncate_lock);
>  		truncate_setsize(inode, attr->ia_size);
>  		fat_truncate_blocks(inode, attr->ia_size);
> +		up_write(&MSDOS_I(inode)->truncate_lock);
>  	}
>  
>  	setattr_copy(inode, attr);
>
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

OGAWA Hirofumi

2011-Jun-21 16:09 UTC

head link

Re: [PATCH 1/8] far: remove i_alloc_sem abuse

OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> writes:
> Christoph Hellwig <hch@infradead.org> writes:
>
>> Add a new rw_semaphore to protect bmap against truncate.  Previous
>> i_alloc_sem was abused for this, but it''s going away in this
series.
>
> In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using
> ->i_mutex to call ->bmap. So, this was chosen instead.
>
> I''m not checking current version yet though, if shmfs had change,
we can
> use ->i_mutex.
It was swapfile. And unfortunately it doesn''t have change.
> BTW, this patch looks good to me.-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-21 16:09 UTC

head link

Re: [PATCH 1/8] far: remove i_alloc_sem abuse

On Wed, Jun 22, 2011 at 12:57:43AM +0900, OGAWA Hirofumi
wrote:> Christoph Hellwig <hch@infradead.org> writes:
> 
> > Add a new rw_semaphore to protect bmap against truncate.  Previous
> > i_alloc_sem was abused for this, but it''s going away in this
series.
> 
> In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using
> ->i_mutex to call ->bmap. So, this was chosen instead.
> 
> I''m not checking current version yet though, if shmfs had change,
we can
> use ->i_mutex.
I tried that first, but it gives an instant deadlock when using a
swapfile on fat.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lukas Czerner

2011-Jun-21 16:34 UTC

head link

Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse

On Mon, 20 Jun 2011, Christoph Hellwig wrote:
> Add a new rw_semaphore to protect page_mkwrite against truncate.  Previous
> i_alloc_sem was abused for this, but it''s going away in this
series.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/ext4/inode.c
> ==================================================================> ---
linux-2.6.orig/fs/ext4/inode.c	2011-06-20 14:25:33.779248128 +0200
> +++ linux-2.6/fs/ext4/inode.c	2011-06-20 14:27:53.025907745 +0200
> @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry,
>  			if (attr->ia_size > sbi->s_bitmap_maxbytes)
>  				return -EFBIG;
>  		}
> +		down_write(&(EXT4_I(inode)->truncate_lock));
>  	}
>  
>  	if (S_ISREG(inode->i_mode) &&
> @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry,
>  			ext4_truncate(inode);
>  	}
>  
> +	if (attr->ia_valid & ATTR_SIZE)
> +		up_write(&(EXT4_I(inode)->truncate_lock));
> +
Hi Christoph,

this looks like a bug fix, so we should rather do it in a separate
commit. Could you send it separately, or at least mention that in commit
description ?

Otherwise the patch looks good.

Thanks!
-Lukas
>  	if (!rc) {
>  		setattr_copy(inode, attr);
>  		mark_inode_dirty(inode);
> @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str
>  	struct address_space *mapping = inode->i_mapping;
>  
>  	/*
> -	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> -	 * get i_mutex because we are already holding mmap_sem.
> +	 * Get truncate_lock to stop truncates messing with the inode. We
> +	 * cannot get i_mutex because we are already holding mmap_sem.
>  	 */
> -	down_read(&inode->i_alloc_sem);
> +	down_read(&(EXT4_I(inode)->truncate_lock));
>  	size = i_size_read(inode);
>  	if (page->mapping != mapping || size <= page_offset(page)
>  	    || !PageUptodate(page)) {
> @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str
>  	lock_page(page);
>  	wait_on_page_writeback(page);
>  	if (PageMappedToDisk(page)) {
> -		up_read(&inode->i_alloc_sem);
> +		up_read(&(EXT4_I(inode)->truncate_lock));
>  		return VM_FAULT_LOCKED;
>  	}
>  
> @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str
>  	if (page_has_buffers(page)) {
>  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>  					ext4_bh_unmapped)) {
> -			up_read(&inode->i_alloc_sem);
> +			up_read(&(EXT4_I(inode)->truncate_lock));
>  			return VM_FAULT_LOCKED;
>  		}
>  	}
> @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str
>  	 */
>  	lock_page(page);
>  	wait_on_page_writeback(page);
> -	up_read(&inode->i_alloc_sem);
> +	up_read(&(EXT4_I(inode)->truncate_lock));
>  	return VM_FAULT_LOCKED;
>  out_unlock:
>  	if (ret)
>  		ret = VM_FAULT_SIGBUS;
> -	up_read(&inode->i_alloc_sem);
> +	up_read(&(EXT4_I(inode)->truncate_lock));
>  	return ret;
>  }
> Index: linux-2.6/fs/ext4/super.c
> ==================================================================> ---
linux-2.6.orig/fs/ext4/super.c	2011-06-20 14:25:33.795914793 +0200
> +++ linux-2.6/fs/ext4/super.c	2011-06-20 14:27:06.269243443 +0200
> @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st
>  	ei->i_datasync_tid = 0;
>  	atomic_set(&ei->i_ioend_count, 0);
>  	atomic_set(&ei->i_aiodio_unwritten, 0);
> +	init_rwsem(&ei->truncate_lock);
>  
>  	return &ei->vfs_inode;
>  }
> Index: linux-2.6/fs/ext4/ext4.h
> ==================================================================> ---
linux-2.6.orig/fs/ext4/ext4.h	2011-06-20 14:25:33.752581461 +0200
> +++ linux-2.6/fs/ext4/ext4.h	2011-06-20 14:27:06.272576777 +0200
> @@ -844,6 +844,9 @@ struct ext4_inode_info {
>  	/* on-disk additional length */
>  	__u16 i_extra_isize;
>  
> +	/* protect page_mkwrite against truncates */
> +	struct rw_semaphore truncate_lock;
> +
>  #ifdef CONFIG_QUOTA
>  	/* quota space reservation, managed internally by quota code */
>  	qsize_t i_reserved_quota;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lukas Czerner

2011-Jun-21 16:48 UTC

head link

Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse

On Tue, 21 Jun 2011, Lukas Czerner wrote:
> On Mon, 20 Jun 2011, Christoph Hellwig wrote:
> 
> > Add a new rw_semaphore to protect page_mkwrite against truncate. 
Previous
> > i_alloc_sem was abused for this, but it''s going away in this
series.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > 
> > Index: linux-2.6/fs/ext4/inode.c
> > ==================================================================>
> --- linux-2.6.orig/fs/ext4/inode.c	2011-06-20 14:25:33.779248128 +0200
> > +++ linux-2.6/fs/ext4/inode.c	2011-06-20 14:27:53.025907745 +0200
> > @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry,
> >  			if (attr->ia_size > sbi->s_bitmap_maxbytes)
> >  				return -EFBIG;
> >  		}
> > +		down_write(&(EXT4_I(inode)->truncate_lock));
> >  	}
> >  
> >  	if (S_ISREG(inode->i_mode) &&
> > @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry,
> >  			ext4_truncate(inode);
> >  	}
> >  
> > +	if (attr->ia_valid & ATTR_SIZE)
> > +		up_write(&(EXT4_I(inode)->truncate_lock));
> > +
> 
> Hi Christoph,
> 
> this looks like a bug fix, so we should rather do it in a separate
> commit. Could you send it separately, or at least mention that in commit
> description ?
That''s stupid note, it is not a bugfix of course. Ignore it, sorry for
the noise.
> 
> Otherwise the patch looks good.
> 
> Thanks!
> -Lukas
> 
> >  	if (!rc) {
> >  		setattr_copy(inode, attr);
> >  		mark_inode_dirty(inode);
> > @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str
> >  	struct address_space *mapping = inode->i_mapping;
> >  
> >  	/*
> > -	 * Get i_alloc_sem to stop truncates messing with the inode. We
cannot
> > -	 * get i_mutex because we are already holding mmap_sem.
> > +	 * Get truncate_lock to stop truncates messing with the inode. We
> > +	 * cannot get i_mutex because we are already holding mmap_sem.
> >  	 */
> > -	down_read(&inode->i_alloc_sem);
> > +	down_read(&(EXT4_I(inode)->truncate_lock));
> >  	size = i_size_read(inode);
> >  	if (page->mapping != mapping || size <= page_offset(page)
> >  	    || !PageUptodate(page)) {
> > @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str
> >  	lock_page(page);
> >  	wait_on_page_writeback(page);
> >  	if (PageMappedToDisk(page)) {
> > -		up_read(&inode->i_alloc_sem);
> > +		up_read(&(EXT4_I(inode)->truncate_lock));
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str
> >  	if (page_has_buffers(page)) {
> >  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> >  					ext4_bh_unmapped)) {
> > -			up_read(&inode->i_alloc_sem);
> > +			up_read(&(EXT4_I(inode)->truncate_lock));
> >  			return VM_FAULT_LOCKED;
> >  		}
> >  	}
> > @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str
> >  	 */
> >  	lock_page(page);
> >  	wait_on_page_writeback(page);
> > -	up_read(&inode->i_alloc_sem);
> > +	up_read(&(EXT4_I(inode)->truncate_lock));
> >  	return VM_FAULT_LOCKED;
> >  out_unlock:
> >  	if (ret)
> >  		ret = VM_FAULT_SIGBUS;
> > -	up_read(&inode->i_alloc_sem);
> > +	up_read(&(EXT4_I(inode)->truncate_lock));
> >  	return ret;
> >  }
> > Index: linux-2.6/fs/ext4/super.c
> > ==================================================================>
> --- linux-2.6.orig/fs/ext4/super.c	2011-06-20 14:25:33.795914793 +0200
> > +++ linux-2.6/fs/ext4/super.c	2011-06-20 14:27:06.269243443 +0200
> > @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st
> >  	ei->i_datasync_tid = 0;
> >  	atomic_set(&ei->i_ioend_count, 0);
> >  	atomic_set(&ei->i_aiodio_unwritten, 0);
> > +	init_rwsem(&ei->truncate_lock);
> >  
> >  	return &ei->vfs_inode;
> >  }
> > Index: linux-2.6/fs/ext4/ext4.h
> > ==================================================================>
> --- linux-2.6.orig/fs/ext4/ext4.h	2011-06-20 14:25:33.752581461 +0200
> > +++ linux-2.6/fs/ext4/ext4.h	2011-06-20 14:27:06.272576777 +0200
> > @@ -844,6 +844,9 @@ struct ext4_inode_info {
> >  	/* on-disk additional length */
> >  	__u16 i_extra_isize;
> >  
> > +	/* protect page_mkwrite against truncates */
> > +	struct rw_semaphore truncate_lock;
> > +
> >  #ifdef CONFIG_QUOTA
> >  	/* quota space reservation, managed internally by quota code */
> >  	qsize_t i_reserved_quota;
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe
linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
-- 
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-21 17:16 UTC

head link

Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse

On Tue, Jun 21, 2011 at 06:34:50PM +0200, Lukas Czerner
wrote:> this looks like a bug fix, so we should rather do it in a separate
> commit. Could you send it separately, or at least mention that in commit
> description ?
What looks like a bugfix?

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kara

2011-Jun-21 23:54 UTC

head link

Re: [PATCH 0/8] remove i_alloc_sem

On Mon 20-06-11 16:15:33, Christoph Hellwig wrote:> i_alloc_sem has always been a bit of an odd "lock". 
It''s the only remaining
> rw_semaphore that can be released by a different thread than the one that
> locked it, and it''s use case in the core direct I/O code is more
like a
> counter given that the writers already have external serialization.
> 
> This series removes it in favour of a simpler counter scheme, thus getting
> rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
> same time shrinking the size of struct inode by 160 bytes on 64-bit
systems.
> 
> The only nasty bit is that two filesystems (fat and ext4) have started
> abusing the lock for their own purposes.  I''ve added a new
rw_semaphore  ext4 abuse should be gone when Ted merges my rewrite of
ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
it?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-22 09:39 UTC

head link

Re: [PATCH 0/8] remove i_alloc_sem

On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara
wrote:>   ext4 abuse should be gone when Ted merges my rewrite of
> ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
> it?
So how should we coordinate merging the two? 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ted Ts''o

2011-Jun-22 14:22 UTC

head link

Re: [PATCH 0/8] remove i_alloc_sem

On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara
wrote:> ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
> it?
So assuming I fix the refcounting issue in fs/ext4/page_io.c (which I
will do not dropping the page''s refcount until after the workqueue
finishes its job), does your patch need changing, or is it a matter of
fixing the commit description?

In any case, this dragged got out, and it''s late in -rc cycle for 3.0,
so I was planning on queuing this for the next merge window.  (Sorry
for that, this was mostly due to my not having enough time to really
dive into the issues involved.)

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kara

2011-Jun-22 18:13 UTC

head link

Re: [PATCH 0/8] remove i_alloc_sem

On Wed 22-06-11 10:22:35, Ted Tso wrote:> On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:
> > ext4_page_mkwrite()... Ted, what happened to that patch. Should I
resend
> > it?
> 
> So assuming I fix the refcounting issue in fs/ext4/page_io.c (which I
> will do not dropping the page''s refcount until after the workqueue
> finishes its job), does your patch need changing, or is it a matter of
> fixing the commit description?  Looking into patchwork (http://patchwork.ozlabs.org/patch/97924/) I see
the discussion of two issues (page refcounting and page_mkwrite got mixed).
I see the patch just needs fixing commit description because it''s no
longer
accurate after "stable page under writeback" changes went it.
I''ll send you
a patch with updated changelog in a minute.
> In any case, this dragged got out, and it''s late in -rc cycle for
3.0,
> so I was planning on queuing this for the next merge window.  (Sorry
> for that, this was mostly due to my not having enough time to really
> dive into the issues involved.)  No problem. Just we have to somehow coordinate with Christoph... Either
he can avoid touching ext4 and merge his patch set after you merge my patch
or he can take my patch instead of his ext4 change. Since my patch touches
only ext4_page_mkwrite() there''s not a high risk of collision. The
latter
would seem easier for both of you to me but I don''t really care.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Jun-23 10:36 UTC

head link

Re: [PATCH 0/8] remove i_alloc_sem

On Wed, Jun 22, 2011 at 08:13:42PM +0200, Jan Kara
wrote:>   No problem. Just we have to somehow coordinate with Christoph... Either
> he can avoid touching ext4 and merge his patch set after you merge my patch
> or he can take my patch instead of his ext4 change. Since my patch touches
> only ext4_page_mkwrite() there''s not a high risk of collision. The
latter
> would seem easier for both of you to me but I don''t really care.
For now I''ll just take your repost into the next version of my series.
If you guys have any better idea for sorting out the merge issues I''m
happy to reshuffle it again.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Joel Becker

2011-Jul-01 02:58 UTC

head link

Re: [PATCH 4/8] fs: kill i_alloc_sem

On Mon, Jun 20, 2011 at 06:18:57PM -0400, Christoph Hellwig
wrote:> On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote:
> > 	Are we guaranteed that all allocation changes are locked out by
> > i_dio_count>0?  I don''t think we are.  The ocfs2 code very
strongly
> > assumes the state of a file''s allocation when it holds
i_alloc_sem.  I
> > feel like we lose that here. 
> 
> You aren''t, neither with the old i_alloc_sem code, nor with the
1:1
> replacement using i_dio_count.
> 
> Do a quick grep who gets i_alloc_sem exclusively (down_write):
it''s
> really just the truncate code, and it''s cut & paste duplicates
in ntfs
> and reiserfs.
	Sorry, I confused this with our ip_alloc_sem.  I was tired.

Joel

-- 

Life''s Little Instruction Book #24

	"Drink champagne for no reason at all."

			http://www.jlbec.org/
			jlbec@evilplan.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jun 2011 - [PATCH 0/8] remove i_alloc_sem

[PATCH 0/8] remove i_alloc_sem

[PATCH 1/8] far: remove i_alloc_sem abuse

[PATCH 2/8] ext4: remove i_alloc_sem abuse

[PATCH 3/8] fs: simpler handling of zero sized reads in __blockdev_direct_IO

[PATCH 4/8] fs: kill i_alloc_sem

[PATCH 5/8] fs: move inode_dio_wait calls into ->setattr

[PATCH 6/8] fs: always maintain i_dio_count

[PATCH 7/8] btrfs: wait for direct I/O requests in truncate

[PATCH 8/8] rw_semaphore: remove up/down_read_non_owner

Re: [PATCH 0/8] remove i_alloc_sem

Re: [PATCH 6/8] fs: always maintain i_dio_count

Re: [PATCH 4/8] fs: kill i_alloc_sem

Re: [PATCH 4/8] fs: kill i_alloc_sem

Re: [PATCH 6/8] fs: always maintain i_dio_count

Re: [PATCH 4/8] fs: kill i_alloc_sem

Re: [PATCH 4/8] fs: kill i_alloc_sem

Re: [PATCH 1/8] far: remove i_alloc_sem abuse

Re: [PATCH 1/8] far: remove i_alloc_sem abuse

Re: [PATCH 1/8] far: remove i_alloc_sem abuse

Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse

Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse

Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse

Re: [PATCH 0/8] remove i_alloc_sem

Re: [PATCH 0/8] remove i_alloc_sem

Re: [PATCH 0/8] remove i_alloc_sem

Re: [PATCH 0/8] remove i_alloc_sem

Re: [PATCH 0/8] remove i_alloc_sem

Re: [PATCH 4/8] fs: kill i_alloc_sem