i_alloc_sem has always been a bit of an odd "lock". It''s the only remaining rw_semaphore that can be released by a different thread than the one that locked it, and it''s use case in the core direct I/O code is more like a counter given that the writers already have external serialization. This series removes it in favour of a simpler counter scheme, thus getting rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the same time shrinking the size of struct inode by 160 bytes on 64-bit systems. The only nasty bit is that two filesystems (fat and ext4) have started abusing the lock for their own purposes. I''ve added a new rw_semaphore to their filesystem-specific inode structures to keep the current behaviour, but I suspect the maintainers might have smarted ideas to archive the same goal. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Add a new rw_semaphore to protect bmap against truncate. Previous i_alloc_sem was abused for this, but it''s going away in this series. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/fat/inode.c ==================================================================--- linux-2.6.orig/fs/fat/inode.c 2011-06-20 21:28:19.707963855 +0200 +++ linux-2.6/fs/fat/inode.c 2011-06-20 21:29:25.031293882 +0200 @@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address sector_t blocknr; /* fat_get_cluster() assumes the requested blocknr isn''t truncated. */ - down_read(&mapping->host->i_alloc_sem); + down_read(&MSDOS_I(mapping->host)->truncate_lock); blocknr = generic_block_bmap(mapping, block, fat_get_block); - up_read(&mapping->host->i_alloc_sem); + up_read(&MSDOS_I(mapping->host)->truncate_lock); return blocknr; } @@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS); if (!ei) return NULL; + + init_rwsem(&ei->truncate_lock); return &ei->vfs_inode; } Index: linux-2.6/fs/fat/fat.h ==================================================================--- linux-2.6.orig/fs/fat/fat.h 2011-06-20 21:28:19.724630522 +0200 +++ linux-2.6/fs/fat/fat.h 2011-06-20 21:29:25.034627215 +0200 @@ -109,6 +109,7 @@ struct msdos_inode_info { int i_attrs; /* unused attribute bits */ loff_t i_pos; /* on-disk position of directory entry or 0 */ struct hlist_node i_fat_hash; /* hash by i_location */ + struct rw_semaphore truncate_lock; /* protect bmap against truncate */ struct inode vfs_inode; }; Index: linux-2.6/fs/fat/file.c ==================================================================--- linux-2.6.orig/fs/fat/file.c 2011-06-20 21:28:19.744630521 +0200 +++ linux-2.6/fs/fat/file.c 2011-06-20 21:29:54.501292390 +0200 @@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s } if (attr->ia_valid & ATTR_SIZE) { + down_write(&MSDOS_I(inode)->truncate_lock); truncate_setsize(inode, attr->ia_size); fat_truncate_blocks(inode, attr->ia_size); + up_write(&MSDOS_I(inode)->truncate_lock); } setattr_copy(inode, attr); -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Add a new rw_semaphore to protect page_mkwrite against truncate. Previous i_alloc_sem was abused for this, but it''s going away in this series. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/ext4/inode.c ==================================================================--- linux-2.6.orig/fs/ext4/inode.c 2011-06-20 14:25:33.779248128 +0200 +++ linux-2.6/fs/ext4/inode.c 2011-06-20 14:27:53.025907745 +0200 @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry, if (attr->ia_size > sbi->s_bitmap_maxbytes) return -EFBIG; } + down_write(&(EXT4_I(inode)->truncate_lock)); } if (S_ISREG(inode->i_mode) && @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry, ext4_truncate(inode); } + if (attr->ia_valid & ATTR_SIZE) + up_write(&(EXT4_I(inode)->truncate_lock)); + if (!rc) { setattr_copy(inode, attr); mark_inode_dirty(inode); @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str struct address_space *mapping = inode->i_mapping; /* - * Get i_alloc_sem to stop truncates messing with the inode. We cannot - * get i_mutex because we are already holding mmap_sem. + * Get truncate_lock to stop truncates messing with the inode. We + * cannot get i_mutex because we are already holding mmap_sem. */ - down_read(&inode->i_alloc_sem); + down_read(&(EXT4_I(inode)->truncate_lock)); size = i_size_read(inode); if (page->mapping != mapping || size <= page_offset(page) || !PageUptodate(page)) { @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str lock_page(page); wait_on_page_writeback(page); if (PageMappedToDisk(page)) { - up_read(&inode->i_alloc_sem); + up_read(&(EXT4_I(inode)->truncate_lock)); return VM_FAULT_LOCKED; } @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { - up_read(&inode->i_alloc_sem); + up_read(&(EXT4_I(inode)->truncate_lock)); return VM_FAULT_LOCKED; } } @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str */ lock_page(page); wait_on_page_writeback(page); - up_read(&inode->i_alloc_sem); + up_read(&(EXT4_I(inode)->truncate_lock)); return VM_FAULT_LOCKED; out_unlock: if (ret) ret = VM_FAULT_SIGBUS; - up_read(&inode->i_alloc_sem); + up_read(&(EXT4_I(inode)->truncate_lock)); return ret; } Index: linux-2.6/fs/ext4/super.c ==================================================================--- linux-2.6.orig/fs/ext4/super.c 2011-06-20 14:25:33.795914793 +0200 +++ linux-2.6/fs/ext4/super.c 2011-06-20 14:27:06.269243443 +0200 @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st ei->i_datasync_tid = 0; atomic_set(&ei->i_ioend_count, 0); atomic_set(&ei->i_aiodio_unwritten, 0); + init_rwsem(&ei->truncate_lock); return &ei->vfs_inode; } Index: linux-2.6/fs/ext4/ext4.h ==================================================================--- linux-2.6.orig/fs/ext4/ext4.h 2011-06-20 14:25:33.752581461 +0200 +++ linux-2.6/fs/ext4/ext4.h 2011-06-20 14:27:06.272576777 +0200 @@ -844,6 +844,9 @@ struct ext4_inode_info { /* on-disk additional length */ __u16 i_extra_isize; + /* protect page_mkwrite against truncates */ + struct rw_semaphore truncate_lock; + #ifdef CONFIG_QUOTA /* quota space reservation, managed internally by quota code */ qsize_t i_reserved_quota; -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-20 20:15 UTC
[PATCH 3/8] fs: simpler handling of zero sized reads in __blockdev_direct_IO
Reject zero sized reads as soon as we know our I/O length, and don''t borther with locks or allocations that might have to be cleaned up otherwise. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/direct-io.c ==================================================================--- linux-2.6.orig/fs/direct-io.c 2011-06-11 12:10:10.205165161 +0200 +++ linux-2.6/fs/direct-io.c 2011-06-11 12:12:49.161823781 +0200 @@ -1200,6 +1200,10 @@ __blockdev_direct_IO(int rw, struct kioc } } + /* watch out for a 0 len io from a tricksy fs */ + if (rw == READ && end == offset) + return 0; + dio = kmalloc(sizeof(*dio), GFP_KERNEL); retval = -ENOMEM; if (!dio) @@ -1213,8 +1217,7 @@ __blockdev_direct_IO(int rw, struct kioc dio->flags = flags; if (dio->flags & DIO_LOCKING) { - /* watch out for a 0 len io from a tricksy fs */ - if (rw == READ && end > offset) { + if (rw == READ) { struct address_space *mapping iocb->ki_filp->f_mapping; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
i_alloc_sem is a rather special rw_semaphore. It''s the last one that may be released by a non-owner, and it''s write side is always mirrored by real exclusion. It''s intended use it to wait for all pending direct I/O requests to finish before starting a truncate. Replace it with a hand-grown construct: - exclusion for truncates is already guaranteed by i_mutex, so it can simply fall way - the reader side is replaced by an i_dio_count member in struct inode that counts the number of pending direct I/O requests. Truncate can''t proceed as long as it''s non-zero - when i_dio_count reaches non-zero we wake up a pending truncate using wake_up_bit on a new bit in i_flags - new references to i_dio_count can''t appear while we are waiting for it to read zero because the direct I/O count always needs i_mutex (or an equivalent like XFS''s i_iolock) for starting a new operation. This scheme is much simpler, and saves the space of a spinlock_t and a struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit system). Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/direct-io.c ==================================================================--- linux-2.6.orig/fs/direct-io.c 2011-06-20 14:55:31.000000000 +0200 +++ linux-2.6/fs/direct-io.c 2011-06-20 14:55:34.602490284 +0200 @@ -136,6 +136,27 @@ struct dio { }; /* + * Wait for outstanding DIO requests to finish. Must be locked against + * increments of i_dio_count by i_mutex. + */ +void inode_dio_wait(struct inode *inode) +{ + might_sleep(); + while (atomic_read(&inode->i_dio_count)) { + wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait, + TASK_UNINTERRUPTIBLE); + } +} +EXPORT_SYMBOL_GPL(inode_dio_wait); + +void inode_dio_wake(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_dio_count)) + wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); +} +EXPORT_SYMBOL_GPL(inode_dio_wake); + +/* * How many pages are in the queue? */ static inline unsigned dio_pages_present(struct dio *dio) @@ -254,9 +275,7 @@ static ssize_t dio_complete(struct dio * } if (dio->flags & DIO_LOCKING) - /* lockdep: non-owner release */ - up_read_non_owner(&dio->inode->i_alloc_sem); - + inode_dio_wake(dio->inode); return ret; } @@ -980,9 +999,6 @@ out: return ret; } -/* - * Releases both i_mutex and i_alloc_sem - */ static ssize_t direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, @@ -1146,15 +1162,14 @@ direct_io_worker(int rw, struct kiocb *i * For writes this function is called under i_mutex and returns with * i_mutex held, for reads, i_mutex is not held on entry, but it is * taken and dropped again before returning. - * For reads and writes i_alloc_sem is taken in shared mode and released - * on I/O completion (which may happen asynchronously after returning to - * the caller). + * The i_dio_count counter keeps track of the number of outstanding + * direct I/O requests, and truncate waits for it to reach zero. + * New references to i_dio_count must only be grabbed with i_mutex + * held. * * - if the flags value does NOT contain DIO_LOCKING we don''t use any * internal locking but rather rely on the filesystem to synchronize * direct I/O reads/writes versus each other and truncate. - * For reads and writes both i_mutex and i_alloc_sem are not held on - * entry and are never taken. */ ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, @@ -1234,10 +1249,9 @@ __blockdev_direct_IO(int rw, struct kioc } /* - * Will be released at I/O completion, possibly in a - * different thread. + * Will be decremented at I/O completion time. */ - down_read_non_owner(&inode->i_alloc_sem); + atomic_inc(&inode->i_dio_count); } /* Index: linux-2.6/mm/filemap.c ==================================================================--- linux-2.6.orig/mm/filemap.c 2011-06-20 14:19:27.019266696 +0200 +++ linux-2.6/mm/filemap.c 2011-06-20 14:55:34.605823617 +0200 @@ -78,9 +78,6 @@ * ->i_mutex (generic_file_buffered_write) * ->mmap_sem (fault_in_pages_readable->do_page_fault) * - * ->i_mutex - * ->i_alloc_sem (various) - * * inode_wb_list_lock * sb_lock (fs/fs-writeback.c) * ->mapping->tree_lock (__sync_single_inode) Index: linux-2.6/mm/rmap.c ==================================================================--- linux-2.6.orig/mm/rmap.c 2011-06-20 14:19:27.000000000 +0200 +++ linux-2.6/mm/rmap.c 2011-06-20 14:55:34.605823617 +0200 @@ -21,7 +21,6 @@ * Lock ordering in mm: * * inode->i_mutex (while writing or truncating, not reading or faulting) - * inode->i_alloc_sem (vmtruncate_range) * mm->mmap_sem * page->flags PG_locked (lock_page) * mapping->i_mmap_mutex Index: linux-2.6/fs/attr.c ==================================================================--- linux-2.6.orig/fs/attr.c 2011-06-20 14:19:26.000000000 +0200 +++ linux-2.6/fs/attr.c 2011-06-20 14:55:34.609156951 +0200 @@ -233,16 +233,13 @@ int notify_change(struct dentry * dentry return error; if (ia_valid & ATTR_SIZE) - down_write(&dentry->d_inode->i_alloc_sem); + inode_dio_wait(inode); if (inode->i_op->setattr) error = inode->i_op->setattr(dentry, attr); else error = simple_setattr(dentry, attr); - if (ia_valid & ATTR_SIZE) - up_write(&dentry->d_inode->i_alloc_sem); - if (!error) fsnotify_change(dentry, ia_valid); Index: linux-2.6/fs/ntfs/file.c ==================================================================--- linux-2.6.orig/fs/ntfs/file.c 2011-06-20 14:19:26.000000000 +0200 +++ linux-2.6/fs/ntfs/file.c 2011-06-20 14:55:34.609156951 +0200 @@ -1832,9 +1832,8 @@ static ssize_t ntfs_file_buffered_write( * fails again. */ if (unlikely(NInoTruncateFailed(ni))) { - down_write(&vi->i_alloc_sem); + inode_dio_wait(vi); err = ntfs_truncate(vi); - up_write(&vi->i_alloc_sem); if (err || NInoTruncateFailed(ni)) { if (!err) err = -EIO; Index: linux-2.6/fs/reiserfs/xattr.c ==================================================================--- linux-2.6.orig/fs/reiserfs/xattr.c 2011-06-20 14:19:26.000000000 +0200 +++ linux-2.6/fs/reiserfs/xattr.c 2011-06-20 14:55:34.612490285 +0200 @@ -555,11 +555,10 @@ reiserfs_xattr_set_handle(struct reiserf reiserfs_write_unlock(inode->i_sb); mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_XATTR); - down_write(&dentry->d_inode->i_alloc_sem); + inode_dio_wait(dentry->d_inode); reiserfs_write_lock(inode->i_sb); err = reiserfs_setattr(dentry, &newattrs); - up_write(&dentry->d_inode->i_alloc_sem); mutex_unlock(&dentry->d_inode->i_mutex); } else update_ctime(inode); Index: linux-2.6/include/linux/fs.h ==================================================================--- linux-2.6.orig/include/linux/fs.h 2011-06-20 14:19:27.000000000 +0200 +++ linux-2.6/include/linux/fs.h 2011-06-20 14:55:34.615823619 +0200 @@ -776,7 +776,7 @@ struct inode { struct timespec i_ctime; blkcnt_t i_blocks; unsigned short i_bytes; - struct rw_semaphore i_alloc_sem; + atomic_t i_dio_count; const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct file_lock *i_flock; struct address_space *i_mapping; @@ -1692,6 +1692,10 @@ struct super_operations { * set during data writeback, and cleared with a wakeup * on the bit address once it is done. * + * I_REFERENCED Marks the inode as recently references on the LRU list. + * + * I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit(). + * * Q: What is the difference between I_WILL_FREE and I_FREEING? */ #define I_DIRTY_SYNC (1 << 0) @@ -1705,6 +1709,8 @@ struct super_operations { #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) #define I_REFERENCED (1 << 8) +#define __I_DIO_WAKEUP 9 +#define I_DIO_WAKEUP (1 << I_DIO_WAKEUP) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1815,7 +1821,6 @@ struct file_system_type { struct lock_class_key i_lock_key; struct lock_class_key i_mutex_key; struct lock_class_key i_mutex_dir_key; - struct lock_class_key i_alloc_sem_key; }; extern struct dentry *mount_ns(struct file_system_type *fs_type, int flags, @@ -2367,6 +2372,8 @@ enum { }; void dio_end_io(struct bio *bio, int error); +void inode_dio_wait(struct inode *inode); +void inode_dio_wake(struct inode *inode); ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, Index: linux-2.6/mm/memory.c ==================================================================--- linux-2.6.orig/mm/memory.c 2011-06-20 14:19:27.000000000 +0200 +++ linux-2.6/mm/memory.c 2011-06-20 14:55:34.619156952 +0200 @@ -2811,12 +2811,11 @@ int vmtruncate_range(struct inode *inode return -ENOSYS; mutex_lock(&inode->i_mutex); - down_write(&inode->i_alloc_sem); + inode_dio_wait(inode); unmap_mapping_range(mapping, offset, (end - offset), 1); truncate_inode_pages_range(mapping, offset, end); unmap_mapping_range(mapping, offset, (end - offset), 1); inode->i_op->truncate_range(inode, offset, end); - up_write(&inode->i_alloc_sem); mutex_unlock(&inode->i_mutex); return 0; Index: linux-2.6/fs/inode.c ==================================================================--- linux-2.6.orig/fs/inode.c 2011-06-20 14:19:26.000000000 +0200 +++ linux-2.6/fs/inode.c 2011-06-20 14:55:34.625823618 +0200 @@ -176,8 +176,7 @@ int inode_init_always(struct super_block mutex_init(&inode->i_mutex); lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key); - init_rwsem(&inode->i_alloc_sem); - lockdep_set_class(&inode->i_alloc_sem, &sb->s_type->i_alloc_sem_key); + atomic_set(&inode->i_dio_count, 0); mapping->a_ops = &empty_aops; mapping->host = inode; Index: linux-2.6/fs/ntfs/inode.c ==================================================================--- linux-2.6.orig/fs/ntfs/inode.c 2011-06-20 14:19:26.000000000 +0200 +++ linux-2.6/fs/ntfs/inode.c 2011-06-20 14:55:34.629156951 +0200 @@ -2357,12 +2357,7 @@ static const char *es = " Leaving incon * * Returns 0 on success or -errno on error. * - * Called with ->i_mutex held. In all but one case ->i_alloc_sem is held for - * writing. The only case in the kernel where ->i_alloc_sem is not held is - * mm/filemap.c::generic_file_buffered_write() where vmtruncate() is called - * with the current i_size as the offset. The analogous place in NTFS is in - * fs/ntfs/file.c::ntfs_file_buffered_write() where we call vmtruncate() again - * without holding ->i_alloc_sem. + * Called with ->i_mutex held. */ int ntfs_truncate(struct inode *vi) { @@ -2887,8 +2882,7 @@ void ntfs_truncate_vfs(struct inode *vi) * We also abort all changes of user, group, and mode as we do not implement * the NTFS ACLs yet. * - * Called with ->i_mutex held. For the ATTR_SIZE (i.e. ->truncate) case, also - * called with ->i_alloc_sem held for writing. + * Called with ->i_mutex held. */ int ntfs_setattr(struct dentry *dentry, struct iattr *attr) { Index: linux-2.6/fs/ocfs2/aops.c ==================================================================--- linux-2.6.orig/fs/ocfs2/aops.c 2011-06-20 14:19:27.000000000 +0200 +++ linux-2.6/fs/ocfs2/aops.c 2011-06-20 14:55:34.629156951 +0200 @@ -551,9 +551,8 @@ bail: /* * ocfs2_dio_end_io is called by the dio core when a dio is finished. We''re - * particularly interested in the aio/dio case. Like the core uses - * i_alloc_sem, we use the rw_lock DLM lock to protect io on one node from - * truncation on another. + * particularly interested in the aio/dio case. We use the rw_lock DLM lock + * to protect io on one node from truncation on another. */ static void ocfs2_dio_end_io(struct kiocb *iocb, loff_t offset, @@ -569,7 +568,7 @@ static void ocfs2_dio_end_io(struct kioc BUG_ON(!ocfs2_iocb_is_rw_locked(iocb)); if (ocfs2_iocb_is_sem_locked(iocb)) { - up_read(&inode->i_alloc_sem); + inode_dio_wake(inode); ocfs2_iocb_clear_sem_locked(iocb); } Index: linux-2.6/fs/ocfs2/file.c ==================================================================--- linux-2.6.orig/fs/ocfs2/file.c 2011-06-20 14:19:27.000000000 +0200 +++ linux-2.6/fs/ocfs2/file.c 2011-06-20 14:55:34.635823617 +0200 @@ -2236,9 +2236,9 @@ static ssize_t ocfs2_file_aio_write(stru ocfs2_iocb_clear_sem_locked(iocb); relock: - /* to match setattr''s i_mutex -> i_alloc_sem -> rw_lock ordering */ + /* to match setattr''s i_mutex -> rw_lock ordering */ if (direct_io) { - down_read(&inode->i_alloc_sem); + atomic_inc(&inode->i_dio_count); have_alloc_sem = 1; /* communicate with ocfs2_dio_end_io */ ocfs2_iocb_set_sem_locked(iocb); @@ -2290,7 +2290,7 @@ relock: */ if (direct_io && !can_do_direct) { ocfs2_rw_unlock(inode, rw_level); - up_read(&inode->i_alloc_sem); + inode_dio_wake(inode); have_alloc_sem = 0; rw_level = -1; @@ -2361,8 +2361,7 @@ out_dio: /* * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io * function pointer which is called when o_direct io completes so that - * it can unlock our rw lock. (it''s the clustered equivalent of - * i_alloc_sem; protects truncate from racing with pending ios). + * it can unlock our rw lock. * Unfortunately there are error cases which call end_io and others * that don''t. so we don''t have to unlock the rw_lock if either an * async dio is going to do it in the future or an end_io after an @@ -2379,7 +2378,7 @@ out: out_sems: if (have_alloc_sem) { - up_read(&inode->i_alloc_sem); + inode_dio_wake(inode); ocfs2_iocb_clear_sem_locked(iocb); } @@ -2531,8 +2530,8 @@ static ssize_t ocfs2_file_aio_read(struc * need locks to protect pending reads from racing with truncate. */ if (filp->f_flags & O_DIRECT) { - down_read(&inode->i_alloc_sem); have_alloc_sem = 1; + atomic_inc(&inode->i_dio_count); ocfs2_iocb_set_sem_locked(iocb); ret = ocfs2_rw_lock(inode, 0); @@ -2575,7 +2574,7 @@ static ssize_t ocfs2_file_aio_read(struc bail: if (have_alloc_sem) { - up_read(&inode->i_alloc_sem); + inode_dio_wake(inode); ocfs2_iocb_clear_sem_locked(iocb); } if (rw_level != -1) Index: linux-2.6/mm/madvise.c ==================================================================--- linux-2.6.orig/mm/madvise.c 2011-06-20 14:19:27.000000000 +0200 +++ linux-2.6/mm/madvise.c 2011-06-20 14:55:34.635823617 +0200 @@ -218,7 +218,7 @@ static long madvise_remove(struct vm_are endoff = (loff_t)(end - vma->vm_start - 1) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); - /* vmtruncate_range needs to take i_mutex and i_alloc_sem */ + /* vmtruncate_range needs to take i_mutex */ up_read(¤t->mm->mmap_sem); error = vmtruncate_range(mapping->host, offset, endoff); down_read(¤t->mm->mmap_sem); -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-20 20:15 UTC
[PATCH 5/8] fs: move inode_dio_wait calls into ->setattr
Let filesystems handle waiting for direct I/O requests themselves instead of doing it beforehand. This means filesystem-specific locks to prevent new dio referenes from appearing can be held. This is important to allow generalizing i_dio_count to non-DIO_LOCKING filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/ocfs2/file.c ==================================================================--- linux-2.6.orig/fs/ocfs2/file.c 2011-06-20 09:28:54.516815966 +0200 +++ linux-2.6/fs/ocfs2/file.c 2011-06-20 09:31:34.706807855 +0200 @@ -1142,6 +1142,8 @@ int ocfs2_setattr(struct dentry *dentry, if (status) goto bail_unlock; + inode_dio_wait(inode); + if (i_size_read(inode) > attr->ia_size) { if (ocfs2_should_order_data(inode)) { status = ocfs2_begin_ordered_truncate(inode, Index: linux-2.6/fs/attr.c ==================================================================--- linux-2.6.orig/fs/attr.c 2011-06-20 09:28:54.490149300 +0200 +++ linux-2.6/fs/attr.c 2011-06-20 09:29:06.000000000 +0200 @@ -232,9 +232,6 @@ int notify_change(struct dentry * dentry if (error) return error; - if (ia_valid & ATTR_SIZE) - inode_dio_wait(inode); - if (inode->i_op->setattr) error = inode->i_op->setattr(dentry, attr); else Index: linux-2.6/fs/ext2/inode.c ==================================================================--- linux-2.6.orig/fs/ext2/inode.c 2011-06-18 12:54:28.058273680 +0200 +++ linux-2.6/fs/ext2/inode.c 2011-06-20 09:29:06.500148692 +0200 @@ -1184,6 +1184,8 @@ static int ext2_setsize(struct inode *in if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) return -EPERM; + inode_dio_wait(inode); + if (mapping_is_xip(inode->i_mapping)) error = xip_truncate_page(inode->i_mapping, newsize); else if (test_opt(inode->i_sb, NOBH)) Index: linux-2.6/fs/ext3/inode.c ==================================================================--- linux-2.6.orig/fs/ext3/inode.c 2011-06-18 12:54:28.071607014 +0200 +++ linux-2.6/fs/ext3/inode.c 2011-06-20 09:29:06.500148692 +0200 @@ -3216,6 +3216,9 @@ int ext3_setattr(struct dentry *dentry, ext3_journal_stop(handle); } + if (attr->ia_valid & ATTR_SIZE) + inode_dio_wait(inode); + if (S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) { handle_t *handle; Index: linux-2.6/fs/ext4/inode.c ==================================================================--- linux-2.6.orig/fs/ext4/inode.c 2011-06-20 09:28:54.506815967 +0200 +++ linux-2.6/fs/ext4/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -5351,6 +5351,8 @@ int ext4_setattr(struct dentry *dentry, } if (attr->ia_valid & ATTR_SIZE) { + inode_dio_wait(inode); + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); Index: linux-2.6/fs/fat/file.c ==================================================================--- linux-2.6.orig/fs/fat/file.c 2011-06-18 12:54:28.118273678 +0200 +++ linux-2.6/fs/fat/file.c 2011-06-20 09:29:06.000000000 +0200 @@ -397,6 +397,8 @@ int fat_setattr(struct dentry *dentry, s * sequence. */ if (attr->ia_valid & ATTR_SIZE) { + inode_dio_wait(inode); + if (attr->ia_size > inode->i_size) { error = fat_cont_expand(inode, attr->ia_size); if (error || attr->ia_valid == ATTR_SIZE) Index: linux-2.6/fs/gfs2/bmap.c ==================================================================--- linux-2.6.orig/fs/gfs2/bmap.c 2011-06-18 12:54:28.141607009 +0200 +++ linux-2.6/fs/gfs2/bmap.c 2011-06-20 09:29:06.510148693 +0200 @@ -1224,6 +1224,8 @@ int gfs2_setattr_size(struct inode *inod if (ret) return ret; + inode_dio_wait(inode); + oldsize = inode->i_size; if (newsize >= oldsize) return do_grow(inode, newsize); Index: linux-2.6/fs/hfs/inode.c ==================================================================--- linux-2.6.orig/fs/hfs/inode.c 2011-06-18 12:54:28.154940342 +0200 +++ linux-2.6/fs/hfs/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -615,6 +615,8 @@ int hfs_inode_setattr(struct dentry *den if ((attr->ia_valid & ATTR_SIZE) && attr->ia_size != i_size_read(inode)) { + inode_dio_wait(inode); + error = vmtruncate(inode, attr->ia_size); if (error) return error; Index: linux-2.6/fs/hfsplus/inode.c ==================================================================--- linux-2.6.orig/fs/hfsplus/inode.c 2011-06-18 12:54:28.168273676 +0200 +++ linux-2.6/fs/hfsplus/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -296,6 +296,8 @@ static int hfsplus_setattr(struct dentry if ((attr->ia_valid & ATTR_SIZE) && attr->ia_size != i_size_read(inode)) { + inode_dio_wait(inode); + error = vmtruncate(inode, attr->ia_size); if (error) return error; Index: linux-2.6/fs/jfs/file.c ==================================================================--- linux-2.6.orig/fs/jfs/file.c 2011-06-18 12:54:28.191607007 +0200 +++ linux-2.6/fs/jfs/file.c 2011-06-20 09:29:06.000000000 +0200 @@ -110,6 +110,8 @@ int jfs_setattr(struct dentry *dentry, s if ((iattr->ia_valid & ATTR_SIZE) && iattr->ia_size != i_size_read(inode)) { + inode_dio_wait(inode); + rc = vmtruncate(inode, iattr->ia_size); if (rc) return rc; Index: linux-2.6/fs/nilfs2/inode.c ==================================================================--- linux-2.6.orig/fs/nilfs2/inode.c 2011-06-18 12:54:28.204940339 +0200 +++ linux-2.6/fs/nilfs2/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -778,6 +778,8 @@ int nilfs_setattr(struct dentry *dentry, if ((iattr->ia_valid & ATTR_SIZE) && iattr->ia_size != i_size_read(inode)) { + inode_dio_wait(inode); + err = vmtruncate(inode, iattr->ia_size); if (unlikely(err)) goto out_err; Index: linux-2.6/fs/reiserfs/inode.c ==================================================================--- linux-2.6.orig/fs/reiserfs/inode.c 2011-06-18 12:54:28.218273673 +0200 +++ linux-2.6/fs/reiserfs/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -3114,6 +3114,9 @@ int reiserfs_setattr(struct dentry *dent error = -EFBIG; goto out; } + + inode_dio_wait(inode); + /* fill in hole pointers in the expanding truncate case. */ if (attr->ia_size > inode->i_size) { error = generic_cont_expand_simple(inode, attr->ia_size); -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING. This these filesystems to also protect truncate against direct I/O requests by using common code. Right now the only non-DIO_LOCKING filesystem that appears to do so is XFS, which uses an opencoded variant of the i_dio_count scheme. Behaviour doesn''t change for filesystems never calling inode_dio_wait, which are all that never use DIO_LOCKING. For ext4 behaviour changes with the dioread_nonlock option, which previous was missing any protection between truncate and direct I/O reads. For ocfs2 that handcrafted i_dio_count manipulations are replaced with the common code noew available. As a result inode_dio_wake can now be made static in direct-io.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/direct-io.c ==================================================================--- linux-2.6.orig/fs/direct-io.c 2011-06-20 14:55:34.602490284 +0200 +++ linux-2.6/fs/direct-io.c 2011-06-20 14:57:24.575818051 +0200 @@ -149,12 +149,11 @@ void inode_dio_wait(struct inode *inode) } EXPORT_SYMBOL_GPL(inode_dio_wait); -void inode_dio_wake(struct inode *inode) +static inline void inode_dio_wake(struct inode *inode) { if (atomic_dec_and_test(&inode->i_dio_count)) wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); } -EXPORT_SYMBOL_GPL(inode_dio_wake); /* * How many pages are in the queue? @@ -274,8 +273,7 @@ static ssize_t dio_complete(struct dio * aio_complete(dio->iocb, ret, 0); } - if (dio->flags & DIO_LOCKING) - inode_dio_wake(dio->inode); + inode_dio_wake(dio->inode); return ret; } @@ -1162,14 +1160,16 @@ direct_io_worker(int rw, struct kiocb *i * For writes this function is called under i_mutex and returns with * i_mutex held, for reads, i_mutex is not held on entry, but it is * taken and dropped again before returning. - * The i_dio_count counter keeps track of the number of outstanding - * direct I/O requests, and truncate waits for it to reach zero. - * New references to i_dio_count must only be grabbed with i_mutex - * held. - * * - if the flags value does NOT contain DIO_LOCKING we don''t use any * internal locking but rather rely on the filesystem to synchronize * direct I/O reads/writes versus each other and truncate. + * + * To help with locking against truncate we incremented the i_dio_count + * counter before starting direct I/O, and decrement it once we are done. + * Truncate can wait for it to reach zero to provide exclusion. It is + * expected that filesystem provide exclusion between new direct I/O + * and truncates. For DIO_LOCKING filesystems this is done by i_mutex, + * but other filesystems need to take care of this on their own. */ ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, @@ -1247,14 +1247,14 @@ __blockdev_direct_IO(int rw, struct kioc goto out; } } - - /* - * Will be decremented at I/O completion time. - */ - atomic_inc(&inode->i_dio_count); } /* + * Will be decremented at I/O completion time. + */ + atomic_inc(&inode->i_dio_count); + + /* * For file extending writes updating i_size before data * writeouts complete can expose uninitialized blocks. So * even for AIO, we need to wait for i/o to complete before Index: linux-2.6/fs/ocfs2/aops.c ==================================================================--- linux-2.6.orig/fs/ocfs2/aops.c 2011-06-20 14:55:34.629156951 +0200 +++ linux-2.6/fs/ocfs2/aops.c 2011-06-20 14:56:59.259152666 +0200 @@ -567,10 +567,8 @@ static void ocfs2_dio_end_io(struct kioc /* this io''s submitter should not have unlocked this before we could */ BUG_ON(!ocfs2_iocb_is_rw_locked(iocb)); - if (ocfs2_iocb_is_sem_locked(iocb)) { - inode_dio_wake(inode); + if (ocfs2_iocb_is_sem_locked(iocb)) ocfs2_iocb_clear_sem_locked(iocb); - } ocfs2_iocb_clear_rw_locked(iocb); Index: linux-2.6/fs/ocfs2/file.c ==================================================================--- linux-2.6.orig/fs/ocfs2/file.c 2011-06-20 14:56:55.375819530 +0200 +++ linux-2.6/fs/ocfs2/file.c 2011-06-20 14:56:59.262485999 +0200 @@ -2240,7 +2240,6 @@ static ssize_t ocfs2_file_aio_write(stru relock: /* to match setattr''s i_mutex -> rw_lock ordering */ if (direct_io) { - atomic_inc(&inode->i_dio_count); have_alloc_sem = 1; /* communicate with ocfs2_dio_end_io */ ocfs2_iocb_set_sem_locked(iocb); @@ -2292,7 +2291,6 @@ relock: */ if (direct_io && !can_do_direct) { ocfs2_rw_unlock(inode, rw_level); - inode_dio_wake(inode); have_alloc_sem = 0; rw_level = -1; @@ -2379,10 +2377,8 @@ out: ocfs2_rw_unlock(inode, rw_level); out_sems: - if (have_alloc_sem) { - inode_dio_wake(inode); + if (have_alloc_sem) ocfs2_iocb_clear_sem_locked(iocb); - } mutex_unlock(&inode->i_mutex); @@ -2533,7 +2529,6 @@ static ssize_t ocfs2_file_aio_read(struc */ if (filp->f_flags & O_DIRECT) { have_alloc_sem = 1; - atomic_inc(&inode->i_dio_count); ocfs2_iocb_set_sem_locked(iocb); ret = ocfs2_rw_lock(inode, 0); @@ -2575,10 +2570,9 @@ static ssize_t ocfs2_file_aio_read(struc } bail: - if (have_alloc_sem) { - inode_dio_wake(inode); + if (have_alloc_sem) ocfs2_iocb_clear_sem_locked(iocb); - } + if (rw_level != -1) ocfs2_rw_unlock(inode, rw_level); Index: linux-2.6/include/linux/fs.h ==================================================================--- linux-2.6.orig/include/linux/fs.h 2011-06-20 14:57:08.582485528 +0200 +++ linux-2.6/include/linux/fs.h 2011-06-20 14:57:10.099152117 +0200 @@ -2373,7 +2373,6 @@ enum { void dio_end_io(struct bio *bio, int error); void inode_dio_wait(struct inode *inode); -void inode_dio_wake(struct inode *inode); ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-20 20:15 UTC
[PATCH 7/8] btrfs: wait for direct I/O requests in truncate
Wait for all direct I/O requests to finish before performing a truncate. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/btrfs/inode.c ==================================================================--- linux-2.6.orig/fs/btrfs/inode.c 2011-06-11 12:58:46.615017504 +0200 +++ linux-2.6/fs/btrfs/inode.c 2011-06-11 12:59:23.218348984 +0200 @@ -3550,6 +3550,8 @@ static int btrfs_setsize(struct inode *i loff_t oldsize = i_size_read(inode); int ret; + inode_dio_wait(inode); + if (newsize == oldsize) return 0; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-20 20:15 UTC
[PATCH 8/8] rw_semaphore: remove up/down_read_non_owner
Now that the last users is gone these can be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/include/linux/rwsem.h ==================================================================--- linux-2.6.orig/include/linux/rwsem.h 2011-06-20 14:58:15.449148809 +0200 +++ linux-2.6/include/linux/rwsem.h 2011-06-20 14:58:30.915814691 +0200 @@ -124,19 +124,9 @@ extern void downgrade_write(struct rw_se */ extern void down_read_nested(struct rw_semaphore *sem, int subclass); extern void down_write_nested(struct rw_semaphore *sem, int subclass); -/* - * Take/release a lock when not the owner will release it. - * - * [ This API should be avoided as much as possible - the - * proper abstraction for this case is completions. ] - */ -extern void down_read_non_owner(struct rw_semaphore *sem); -extern void up_read_non_owner(struct rw_semaphore *sem); #else # define down_read_nested(sem, subclass) down_read(sem) # define down_write_nested(sem, subclass) down_write(sem) -# define down_read_non_owner(sem) down_read(sem) -# define up_read_non_owner(sem) up_read(sem) #endif #endif /* _LINUX_RWSEM_H */ Index: linux-2.6/kernel/rwsem.c ==================================================================--- linux-2.6.orig/kernel/rwsem.c 2011-06-20 14:58:34.509147842 +0200 +++ linux-2.6/kernel/rwsem.c 2011-06-20 14:58:47.479147187 +0200 @@ -117,15 +117,6 @@ void down_read_nested(struct rw_semaphor EXPORT_SYMBOL(down_read_nested); -void down_read_non_owner(struct rw_semaphore *sem) -{ - might_sleep(); - - __down_read(sem); -} - -EXPORT_SYMBOL(down_read_non_owner); - void down_write_nested(struct rw_semaphore *sem, int subclass) { might_sleep(); @@ -136,13 +127,6 @@ void down_write_nested(struct rw_semapho EXPORT_SYMBOL(down_write_nested); -void up_read_non_owner(struct rw_semaphore *sem) -{ - __up_read(sem); -} - -EXPORT_SYMBOL(up_read_non_owner); - #endif -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2011 at 04:15:33PM -0400, Christoph Hellwig wrote:> This series removes it in favour of a simpler counter scheme, thus getting > rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the > same time shrinking the size of struct inode by 160 bytes on 64-bit systems.This should be 160 bits, aka 20 bytes of course, sorry. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2011 at 04:15:39PM -0400, Christoph Hellwig wrote:> Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING. > This these filesystems to also protect truncate against direct I/O requests > by using common code. Right now the only non-DIO_LOCKING filesystem that > appears to do so is XFS, which uses an opencoded variant of the i_dio_count > scheme. > > Behaviour doesn''t change for filesystems never calling inode_dio_wait, > which are all that never use DIO_LOCKING. > > For ext4 behaviour changes with the dioread_nonlock option, which previous > was missing any protection between truncate and direct I/O reads. > > For ocfs2 that handcrafted i_dio_count manipulations are replaced with > the common code noew available.Oh god you''re making the world scary. Are you guaranteeing that all allocation changes are locked out by the time we get into file_aio_write() and file_aio_read()? This is not obvious to me. Joel -- "Glory is fleeting, but obscurity is forever." - Napoleon Bonaparte http://www.jlbec.org/ jlbec@evilplan.org -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig wrote:> i_alloc_sem is a rather special rw_semaphore. It''s the last one that may > be released by a non-owner, and it''s write side is always mirrored by > real exclusion. It''s intended use it to wait for all pending direct I/O > requests to finish before starting a truncate. > > Replace it with a hand-grown construct: > > - exclusion for truncates is already guaranteed by i_mutex, so it can > simply fall way > - the reader side is replaced by an i_dio_count member in struct inode > that counts the number of pending direct I/O requests. Truncate can''t > proceed as long as it''s non-zero > - when i_dio_count reaches non-zero we wake up a pending truncate using > wake_up_bit on a new bit in i_flags > - new references to i_dio_count can''t appear while we are waiting for > it to read zero because the direct I/O count always needs i_mutex > (or an equivalent like XFS''s i_iolock) for starting a new operation. > > This scheme is much simpler, and saves the space of a spinlock_t and a > struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit > system).Are we guaranteed that all allocation changes are locked out by i_dio_count>0? I don''t think we are. The ocfs2 code very strongly assumes the state of a file''s allocation when it holds i_alloc_sem. I feel like we lose that here. Joel -- "I don''t even butter my bread; I consider that cooking." - Katherine Cebrian http://www.jlbec.org/ jlbec@evilplan.org -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote:> Are we guaranteed that all allocation changes are locked out by > i_dio_count>0? I don''t think we are. The ocfs2 code very strongly > assumes the state of a file''s allocation when it holds i_alloc_sem. I > feel like we lose that here.You aren''t, neither with the old i_alloc_sem code, nor with the 1:1 replacement using i_dio_count. Do a quick grep who gets i_alloc_sem exclusively (down_write): it''s really just the truncate code, and it''s cut & paste duplicates in ntfs and reiserfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2011 at 02:29:24PM -0700, Joel Becker wrote:> Oh god you''re making the world scary. Are you guaranteeing that > all allocation changes are locked out by the time we get into > file_aio_write() and file_aio_read()? This is not obvious to me.I have no idea how ocfs2''s internal allocator locking works, but this patch doesn''t change it. What this patch touches is exclusion between truncate and pending direct I/O requests, and even there only the implementation and not the semantics. The old and new semantics are that you may have either 1 ongoing truncate OR n (>= 0; <= ATOMIC_T_MAX) ongoing direct I/O reads or writes before that was enforced using the i_alloc_sem rw_semaphore, including non-owner releases of it from AIO code, in the new code it''s done using a combination of i_mutex which was already taken in the truncate path, and when starting new direct I/O requests, and the new i_dio_count counter.> > Joel > > -- > > "Glory is fleeting, but obscurity is forever." > - Napoleon Bonaparte > > http://www.jlbec.org/ > jlbec@evilplan.org---end quoted text--- -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig wrote:> i_alloc_sem is a rather special rw_semaphore. It''s the last one that may > be released by a non-owner, and it''s write side is always mirrored by > real exclusion. It''s intended use it to wait for all pending direct I/O > requests to finish before starting a truncate. > > Replace it with a hand-grown construct: > > - exclusion for truncates is already guaranteed by i_mutex, so it can > simply fall way > - the reader side is replaced by an i_dio_count member in struct inode > that counts the number of pending direct I/O requests. Truncate can''t > proceed as long as it''s non-zero > - when i_dio_count reaches non-zero we wake up a pending truncate using > wake_up_bit on a new bit in i_flags > - new references to i_dio_count can''t appear while we are waiting for > it to read zero because the direct I/O count always needs i_mutex > (or an equivalent like XFS''s i_iolock) for starting a new operation. > > This scheme is much simpler, and saves the space of a spinlock_t and a > struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit > system). > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > Index: linux-2.6/fs/direct-io.c > ==================================================================> --- linux-2.6.orig/fs/direct-io.c 2011-06-20 14:55:31.000000000 +0200 > +++ linux-2.6/fs/direct-io.c 2011-06-20 14:55:34.602490284 +0200 > @@ -136,6 +136,27 @@ struct dio { > }; > > /* > + * Wait for outstanding DIO requests to finish. Must be locked against > + * increments of i_dio_count by i_mutex. > + */ > +void inode_dio_wait(struct inode *inode) > +{ > + might_sleep(); > + while (atomic_read(&inode->i_dio_count)) { > + wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait, > + TASK_UNINTERRUPTIBLE); > + } > +} > +EXPORT_SYMBOL_GPL(inode_dio_wait); > + > +void inode_dio_wake(struct inode *inode) > +{ > + if (atomic_dec_and_test(&inode->i_dio_count)) > + wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); > +} > +EXPORT_SYMBOL_GPL(inode_dio_wake);Modification of inode->i_state is not safe outside the inode->i_lock. This probably needs to be implemented similar to the __I_NEW/__wait_on_freeing_inode() and __I_SYNC/inode_wait_for_writeback() pattern... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 21, 2011 at 03:40:56PM +1000, Dave Chinner wrote:> Modification of inode->i_state is not safe outside the > inode->i_lock.We never actually set the new bit in i_state, we just use it as a key for the hashed lookups. Or rather we try to, as I misunderstood how wait_on_bit works, so currently we busywait for i_dio_count to reach zero. I''ll respin a version that actually works as expected. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig <hch@infradead.org> writes:> Add a new rw_semaphore to protect bmap against truncate. Previous > i_alloc_sem was abused for this, but it''s going away in this series.In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using ->i_mutex to call ->bmap. So, this was chosen instead. I''m not checking current version yet though, if shmfs had change, we can use ->i_mutex. BTW, this patch looks good to me. Thanks.> Signed-off-by: Christoph Hellwig <hch@lst.de> > > Index: linux-2.6/fs/fat/inode.c > ==================================================================> --- linux-2.6.orig/fs/fat/inode.c 2011-06-20 21:28:19.707963855 +0200 > +++ linux-2.6/fs/fat/inode.c 2011-06-20 21:29:25.031293882 +0200 > @@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address > sector_t blocknr; > > /* fat_get_cluster() assumes the requested blocknr isn''t truncated. */ > - down_read(&mapping->host->i_alloc_sem); > + down_read(&MSDOS_I(mapping->host)->truncate_lock); > blocknr = generic_block_bmap(mapping, block, fat_get_block); > - up_read(&mapping->host->i_alloc_sem); > + up_read(&MSDOS_I(mapping->host)->truncate_lock); > > return blocknr; > } > @@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str > ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS); > if (!ei) > return NULL; > + > + init_rwsem(&ei->truncate_lock); > return &ei->vfs_inode; > } > > Index: linux-2.6/fs/fat/fat.h > ==================================================================> --- linux-2.6.orig/fs/fat/fat.h 2011-06-20 21:28:19.724630522 +0200 > +++ linux-2.6/fs/fat/fat.h 2011-06-20 21:29:25.034627215 +0200 > @@ -109,6 +109,7 @@ struct msdos_inode_info { > int i_attrs; /* unused attribute bits */ > loff_t i_pos; /* on-disk position of directory entry or 0 */ > struct hlist_node i_fat_hash; /* hash by i_location */ > + struct rw_semaphore truncate_lock; /* protect bmap against truncate */ > struct inode vfs_inode; > }; > > Index: linux-2.6/fs/fat/file.c > ==================================================================> --- linux-2.6.orig/fs/fat/file.c 2011-06-20 21:28:19.744630521 +0200 > +++ linux-2.6/fs/fat/file.c 2011-06-20 21:29:54.501292390 +0200 > @@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s > } > > if (attr->ia_valid & ATTR_SIZE) { > + down_write(&MSDOS_I(inode)->truncate_lock); > truncate_setsize(inode, attr->ia_size); > fat_truncate_blocks(inode, attr->ia_size); > + up_write(&MSDOS_I(inode)->truncate_lock); > } > > setattr_copy(inode, attr); >-- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> writes:> Christoph Hellwig <hch@infradead.org> writes: > >> Add a new rw_semaphore to protect bmap against truncate. Previous >> i_alloc_sem was abused for this, but it''s going away in this series. > > In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using > ->i_mutex to call ->bmap. So, this was chosen instead. > > I''m not checking current version yet though, if shmfs had change, we can > use ->i_mutex.It was swapfile. And unfortunately it doesn''t have change.> BTW, this patch looks good to me.-- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 22, 2011 at 12:57:43AM +0900, OGAWA Hirofumi wrote:> Christoph Hellwig <hch@infradead.org> writes: > > > Add a new rw_semaphore to protect bmap against truncate. Previous > > i_alloc_sem was abused for this, but it''s going away in this series. > > In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using > ->i_mutex to call ->bmap. So, this was chosen instead. > > I''m not checking current version yet though, if shmfs had change, we can > use ->i_mutex.I tried that first, but it gives an instant deadlock when using a swapfile on fat. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 20 Jun 2011, Christoph Hellwig wrote:> Add a new rw_semaphore to protect page_mkwrite against truncate. Previous > i_alloc_sem was abused for this, but it''s going away in this series. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > Index: linux-2.6/fs/ext4/inode.c > ==================================================================> --- linux-2.6.orig/fs/ext4/inode.c 2011-06-20 14:25:33.779248128 +0200 > +++ linux-2.6/fs/ext4/inode.c 2011-06-20 14:27:53.025907745 +0200 > @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry, > if (attr->ia_size > sbi->s_bitmap_maxbytes) > return -EFBIG; > } > + down_write(&(EXT4_I(inode)->truncate_lock)); > } > > if (S_ISREG(inode->i_mode) && > @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry, > ext4_truncate(inode); > } > > + if (attr->ia_valid & ATTR_SIZE) > + up_write(&(EXT4_I(inode)->truncate_lock)); > +Hi Christoph, this looks like a bug fix, so we should rather do it in a separate commit. Could you send it separately, or at least mention that in commit description ? Otherwise the patch looks good. Thanks! -Lukas> if (!rc) { > setattr_copy(inode, attr); > mark_inode_dirty(inode); > @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str > struct address_space *mapping = inode->i_mapping; > > /* > - * Get i_alloc_sem to stop truncates messing with the inode. We cannot > - * get i_mutex because we are already holding mmap_sem. > + * Get truncate_lock to stop truncates messing with the inode. We > + * cannot get i_mutex because we are already holding mmap_sem. > */ > - down_read(&inode->i_alloc_sem); > + down_read(&(EXT4_I(inode)->truncate_lock)); > size = i_size_read(inode); > if (page->mapping != mapping || size <= page_offset(page) > || !PageUptodate(page)) { > @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str > lock_page(page); > wait_on_page_writeback(page); > if (PageMappedToDisk(page)) { > - up_read(&inode->i_alloc_sem); > + up_read(&(EXT4_I(inode)->truncate_lock)); > return VM_FAULT_LOCKED; > } > > @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str > if (page_has_buffers(page)) { > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > ext4_bh_unmapped)) { > - up_read(&inode->i_alloc_sem); > + up_read(&(EXT4_I(inode)->truncate_lock)); > return VM_FAULT_LOCKED; > } > } > @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str > */ > lock_page(page); > wait_on_page_writeback(page); > - up_read(&inode->i_alloc_sem); > + up_read(&(EXT4_I(inode)->truncate_lock)); > return VM_FAULT_LOCKED; > out_unlock: > if (ret) > ret = VM_FAULT_SIGBUS; > - up_read(&inode->i_alloc_sem); > + up_read(&(EXT4_I(inode)->truncate_lock)); > return ret; > } > Index: linux-2.6/fs/ext4/super.c > ==================================================================> --- linux-2.6.orig/fs/ext4/super.c 2011-06-20 14:25:33.795914793 +0200 > +++ linux-2.6/fs/ext4/super.c 2011-06-20 14:27:06.269243443 +0200 > @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st > ei->i_datasync_tid = 0; > atomic_set(&ei->i_ioend_count, 0); > atomic_set(&ei->i_aiodio_unwritten, 0); > + init_rwsem(&ei->truncate_lock); > > return &ei->vfs_inode; > } > Index: linux-2.6/fs/ext4/ext4.h > ==================================================================> --- linux-2.6.orig/fs/ext4/ext4.h 2011-06-20 14:25:33.752581461 +0200 > +++ linux-2.6/fs/ext4/ext4.h 2011-06-20 14:27:06.272576777 +0200 > @@ -844,6 +844,9 @@ struct ext4_inode_info { > /* on-disk additional length */ > __u16 i_extra_isize; > > + /* protect page_mkwrite against truncates */ > + struct rw_semaphore truncate_lock; > + > #ifdef CONFIG_QUOTA > /* quota space reservation, managed internally by quota code */ > qsize_t i_reserved_quota; > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Jun 2011, Lukas Czerner wrote:> On Mon, 20 Jun 2011, Christoph Hellwig wrote: > > > Add a new rw_semaphore to protect page_mkwrite against truncate. Previous > > i_alloc_sem was abused for this, but it''s going away in this series. > > > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > > > Index: linux-2.6/fs/ext4/inode.c > > ==================================================================> > --- linux-2.6.orig/fs/ext4/inode.c 2011-06-20 14:25:33.779248128 +0200 > > +++ linux-2.6/fs/ext4/inode.c 2011-06-20 14:27:53.025907745 +0200 > > @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry, > > if (attr->ia_size > sbi->s_bitmap_maxbytes) > > return -EFBIG; > > } > > + down_write(&(EXT4_I(inode)->truncate_lock)); > > } > > > > if (S_ISREG(inode->i_mode) && > > @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry, > > ext4_truncate(inode); > > } > > > > + if (attr->ia_valid & ATTR_SIZE) > > + up_write(&(EXT4_I(inode)->truncate_lock)); > > + > > Hi Christoph, > > this looks like a bug fix, so we should rather do it in a separate > commit. Could you send it separately, or at least mention that in commit > description ?That''s stupid note, it is not a bugfix of course. Ignore it, sorry for the noise.> > Otherwise the patch looks good. > > Thanks! > -Lukas > > > if (!rc) { > > setattr_copy(inode, attr); > > mark_inode_dirty(inode); > > @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str > > struct address_space *mapping = inode->i_mapping; > > > > /* > > - * Get i_alloc_sem to stop truncates messing with the inode. We cannot > > - * get i_mutex because we are already holding mmap_sem. > > + * Get truncate_lock to stop truncates messing with the inode. We > > + * cannot get i_mutex because we are already holding mmap_sem. > > */ > > - down_read(&inode->i_alloc_sem); > > + down_read(&(EXT4_I(inode)->truncate_lock)); > > size = i_size_read(inode); > > if (page->mapping != mapping || size <= page_offset(page) > > || !PageUptodate(page)) { > > @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str > > lock_page(page); > > wait_on_page_writeback(page); > > if (PageMappedToDisk(page)) { > > - up_read(&inode->i_alloc_sem); > > + up_read(&(EXT4_I(inode)->truncate_lock)); > > return VM_FAULT_LOCKED; > > } > > > > @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str > > if (page_has_buffers(page)) { > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, > > ext4_bh_unmapped)) { > > - up_read(&inode->i_alloc_sem); > > + up_read(&(EXT4_I(inode)->truncate_lock)); > > return VM_FAULT_LOCKED; > > } > > } > > @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str > > */ > > lock_page(page); > > wait_on_page_writeback(page); > > - up_read(&inode->i_alloc_sem); > > + up_read(&(EXT4_I(inode)->truncate_lock)); > > return VM_FAULT_LOCKED; > > out_unlock: > > if (ret) > > ret = VM_FAULT_SIGBUS; > > - up_read(&inode->i_alloc_sem); > > + up_read(&(EXT4_I(inode)->truncate_lock)); > > return ret; > > } > > Index: linux-2.6/fs/ext4/super.c > > ==================================================================> > --- linux-2.6.orig/fs/ext4/super.c 2011-06-20 14:25:33.795914793 +0200 > > +++ linux-2.6/fs/ext4/super.c 2011-06-20 14:27:06.269243443 +0200 > > @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st > > ei->i_datasync_tid = 0; > > atomic_set(&ei->i_ioend_count, 0); > > atomic_set(&ei->i_aiodio_unwritten, 0); > > + init_rwsem(&ei->truncate_lock); > > > > return &ei->vfs_inode; > > } > > Index: linux-2.6/fs/ext4/ext4.h > > ==================================================================> > --- linux-2.6.orig/fs/ext4/ext4.h 2011-06-20 14:25:33.752581461 +0200 > > +++ linux-2.6/fs/ext4/ext4.h 2011-06-20 14:27:06.272576777 +0200 > > @@ -844,6 +844,9 @@ struct ext4_inode_info { > > /* on-disk additional length */ > > __u16 i_extra_isize; > > > > + /* protect page_mkwrite against truncates */ > > + struct rw_semaphore truncate_lock; > > + > > #ifdef CONFIG_QUOTA > > /* quota space reservation, managed internally by quota code */ > > qsize_t i_reserved_quota; > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > >-- -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 21, 2011 at 06:34:50PM +0200, Lukas Czerner wrote:> this looks like a bug fix, so we should rather do it in a separate > commit. Could you send it separately, or at least mention that in commit > description ?What looks like a bugfix? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 20-06-11 16:15:33, Christoph Hellwig wrote:> i_alloc_sem has always been a bit of an odd "lock". It''s the only remaining > rw_semaphore that can be released by a different thread than the one that > locked it, and it''s use case in the core direct I/O code is more like a > counter given that the writers already have external serialization. > > This series removes it in favour of a simpler counter scheme, thus getting > rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the > same time shrinking the size of struct inode by 160 bytes on 64-bit systems. > > The only nasty bit is that two filesystems (fat and ext4) have started > abusing the lock for their own purposes. I''ve added a new rw_semaphoreext4 abuse should be gone when Ted merges my rewrite of ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend it? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:> ext4 abuse should be gone when Ted merges my rewrite of > ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend > it?So how should we coordinate merging the two? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:> ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend > it?So assuming I fix the refcounting issue in fs/ext4/page_io.c (which I will do not dropping the page''s refcount until after the workqueue finishes its job), does your patch need changing, or is it a matter of fixing the commit description? In any case, this dragged got out, and it''s late in -rc cycle for 3.0, so I was planning on queuing this for the next merge window. (Sorry for that, this was mostly due to my not having enough time to really dive into the issues involved.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 22-06-11 10:22:35, Ted Tso wrote:> On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote: > > ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend > > it? > > So assuming I fix the refcounting issue in fs/ext4/page_io.c (which I > will do not dropping the page''s refcount until after the workqueue > finishes its job), does your patch need changing, or is it a matter of > fixing the commit description?Looking into patchwork (http://patchwork.ozlabs.org/patch/97924/) I see the discussion of two issues (page refcounting and page_mkwrite got mixed). I see the patch just needs fixing commit description because it''s no longer accurate after "stable page under writeback" changes went it. I''ll send you a patch with updated changelog in a minute.> In any case, this dragged got out, and it''s late in -rc cycle for 3.0, > so I was planning on queuing this for the next merge window. (Sorry > for that, this was mostly due to my not having enough time to really > dive into the issues involved.)No problem. Just we have to somehow coordinate with Christoph... Either he can avoid touching ext4 and merge his patch set after you merge my patch or he can take my patch instead of his ext4 change. Since my patch touches only ext4_page_mkwrite() there''s not a high risk of collision. The latter would seem easier for both of you to me but I don''t really care. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 22, 2011 at 08:13:42PM +0200, Jan Kara wrote:> No problem. Just we have to somehow coordinate with Christoph... Either > he can avoid touching ext4 and merge his patch set after you merge my patch > or he can take my patch instead of his ext4 change. Since my patch touches > only ext4_page_mkwrite() there''s not a high risk of collision. The latter > would seem easier for both of you to me but I don''t really care.For now I''ll just take your repost into the next version of my series. If you guys have any better idea for sorting out the merge issues I''m happy to reshuffle it again. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2011 at 06:18:57PM -0400, Christoph Hellwig wrote:> On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote: > > Are we guaranteed that all allocation changes are locked out by > > i_dio_count>0? I don''t think we are. The ocfs2 code very strongly > > assumes the state of a file''s allocation when it holds i_alloc_sem. I > > feel like we lose that here. > > You aren''t, neither with the old i_alloc_sem code, nor with the 1:1 > replacement using i_dio_count. > > Do a quick grep who gets i_alloc_sem exclusively (down_write): it''s > really just the truncate code, and it''s cut & paste duplicates in ntfs > and reiserfs.Sorry, I confused this with our ip_alloc_sem. I was tired. Joel -- Life''s Little Instruction Book #24 "Drink champagne for no reason at all." http://www.jlbec.org/ jlbec@evilplan.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html