i_alloc_sem has always been a bit of an odd "lock". It''s the only remaining rw_semaphore that can be released by a different thread than the one that locked it, and it''s use case in the core direct I/O code is more like a counter given that the writers already have external serialization. This series removes it in favour of a simpler counter scheme, thus getting rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the same time shrinking the size of struct inode by 160 bytes on 64-bit systems. The only nasty bit is that two filesystems (fat and ext4) have started abusing the lock for their own purposes. I''ve added a new rw_semaphore to the fat node structures to keep the current behaviour, and merged a patch from Jan Kara to remove the i_alloc_sem abuse from ext4. changes from v1: - update the fat patch description - replace my ext4 truncate_lock patch with Jan''s rewrite of ext4_page_mkwrite - do not use wait_on_bit, but replace it with an opencoded hashed waitqueue - rename inode_dio_wake to inode_dio_done - add kerneldoc comments for inode_dio_wait and inode_dio_done - simplify the blockdev_direct_IO prototype - move the i_dio_count decrement into the ->end_io handler if present to make i_dio_count useful for filesystems delaying AIO completion - reorder the patch series - patches 1 to 5 are the meat, the rest is additonal tidyups in that area required for future improvements -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Add a new rw_semaphore to protect bmap against truncate. Previous i_alloc_sem was abused for this, but it''s going away in this series. Note that we can''t simply use i_mutex, given that the swapon code calls ->bmap under it. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/fat/inode.c ==================================================================--- linux-2.6.orig/fs/fat/inode.c 2011-06-20 21:28:19.707963855 +0200 +++ linux-2.6/fs/fat/inode.c 2011-06-20 21:29:25.031293882 +0200 @@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address sector_t blocknr; /* fat_get_cluster() assumes the requested blocknr isn''t truncated. */ - down_read(&mapping->host->i_alloc_sem); + down_read(&MSDOS_I(mapping->host)->truncate_lock); blocknr = generic_block_bmap(mapping, block, fat_get_block); - up_read(&mapping->host->i_alloc_sem); + up_read(&MSDOS_I(mapping->host)->truncate_lock); return blocknr; } @@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS); if (!ei) return NULL; + + init_rwsem(&ei->truncate_lock); return &ei->vfs_inode; } Index: linux-2.6/fs/fat/fat.h ==================================================================--- linux-2.6.orig/fs/fat/fat.h 2011-06-20 21:28:19.724630522 +0200 +++ linux-2.6/fs/fat/fat.h 2011-06-20 21:29:25.034627215 +0200 @@ -109,6 +109,7 @@ struct msdos_inode_info { int i_attrs; /* unused attribute bits */ loff_t i_pos; /* on-disk position of directory entry or 0 */ struct hlist_node i_fat_hash; /* hash by i_location */ + struct rw_semaphore truncate_lock; /* protect bmap against truncate */ struct inode vfs_inode; }; Index: linux-2.6/fs/fat/file.c ==================================================================--- linux-2.6.orig/fs/fat/file.c 2011-06-20 21:28:19.744630521 +0200 +++ linux-2.6/fs/fat/file.c 2011-06-20 21:29:54.501292390 +0200 @@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s } if (attr->ia_valid & ATTR_SIZE) { + down_write(&MSDOS_I(inode)->truncate_lock); truncate_setsize(inode, attr->ia_size); fat_truncate_blocks(inode, attr->ia_size); + up_write(&MSDOS_I(inode)->truncate_lock); } setattr_copy(inode, attr); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-24 18:29 UTC
[PATCH 2/9] ext4: Rewrite ext4_page_mkwrite() to use generic helpers
From: Jan Kara <jack@suse.cz> Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This removes the need of using i_alloc_sem to avoid races with truncate which seems to be the wrong locking order according to lock ordering documented in mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to be problematic because we can decide to flush delay-allocated blocks which will acquire s_umount semaphore - again creating unpleasant lock dependency if not directly a deadlock. Also add a check for frozen filesystem so that we don''t busyloop in page fault when the filesystem is frozen. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 106 ++++++++++++++++++++++++++++-------------------------- 1 files changed, 55 insertions(+), 51 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e3126c0..bd30976 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5843,80 +5843,84 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) struct page *page = vmf->page; loff_t size; unsigned long len; - int ret = -EINVAL; - void *fsdata; + int ret; struct file *file = vma->vm_file; struct inode *inode = file->f_path.dentry->d_inode; struct address_space *mapping = inode->i_mapping; + handle_t *handle; + get_block_t *get_block; + int retries = 0; /* - * Get i_alloc_sem to stop truncates messing with the inode. We cannot - * get i_mutex because we are already holding mmap_sem. + * This check is racy but catches the common case. We rely on + * __block_page_mkwrite() to do a reliable check. */ - down_read(&inode->i_alloc_sem); - size = i_size_read(inode); - if (page->mapping != mapping || size <= page_offset(page) - || !PageUptodate(page)) { - /* page got truncated from under us? */ - goto out_unlock; + vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE); + /* Delalloc case is easy... */ + if (test_opt(inode->i_sb, DELALLOC) && + !ext4_should_journal_data(inode) && + !ext4_nonda_switch(inode->i_sb)) { + do { + ret = __block_page_mkwrite(vma, vmf, + ext4_da_get_block_prep); + } while (ret == -ENOSPC && + ext4_should_retry_alloc(inode->i_sb, &retries)); + goto out_ret; } - ret = 0; lock_page(page); - wait_on_page_writeback(page); - if (PageMappedToDisk(page)) { - up_read(&inode->i_alloc_sem); - return VM_FAULT_LOCKED; + size = i_size_read(inode); + /* Page got truncated from under us? */ + if (page->mapping != mapping || page_offset(page) > size) { + unlock_page(page); + ret = VM_FAULT_NOPAGE; + goto out; } if (page->index == size >> PAGE_CACHE_SHIFT) len = size & ~PAGE_CACHE_MASK; else len = PAGE_CACHE_SIZE; - /* - * return if we have all the buffers mapped. This avoid - * the need to call write_begin/write_end which does a - * journal_start/journal_stop which can block and take - * long time + * Return if we have all the buffers mapped. This avoids the need to do + * journal_start/journal_stop which can block and take a long time */ if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { - up_read(&inode->i_alloc_sem); - return VM_FAULT_LOCKED; + /* Wait so that we don''t change page under IO */ + wait_on_page_writeback(page); + ret = VM_FAULT_LOCKED; + goto out; } } unlock_page(page); - /* - * OK, we need to fill the hole... Do write_begin write_end - * to do block allocation/reservation.We are not holding - * inode.i__mutex here. That allow * parallel write_begin, - * write_end call. lock_page prevent this from happening - * on the same page though - */ - ret = mapping->a_ops->write_begin(file, mapping, page_offset(page), - len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata); - if (ret < 0) - goto out_unlock; - ret = mapping->a_ops->write_end(file, mapping, page_offset(page), - len, len, page, fsdata); - if (ret < 0) - goto out_unlock; - ret = 0; - - /* - * write_begin/end might have created a dirty page and someone - * could wander in and start the IO. Make sure that hasn''t - * happened. - */ - lock_page(page); - wait_on_page_writeback(page); - up_read(&inode->i_alloc_sem); - return VM_FAULT_LOCKED; -out_unlock: - if (ret) + /* OK, we need to fill the hole... */ + if (ext4_should_dioread_nolock(inode)) + get_block = ext4_get_block_write; + else + get_block = ext4_get_block; +retry_alloc: + handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode)); + if (IS_ERR(handle)) { ret = VM_FAULT_SIGBUS; - up_read(&inode->i_alloc_sem); + goto out; + } + ret = __block_page_mkwrite(vma, vmf, get_block); + if (!ret && ext4_should_journal_data(inode)) { + if (walk_page_buffers(handle, page_buffers(page), 0, + PAGE_CACHE_SIZE, NULL, do_journal_get_write_access)) { + unlock_page(page); + ret = VM_FAULT_SIGBUS; + goto out; + } + ext4_set_inode_state(inode, EXT4_STATE_JDATA); + } + ext4_journal_stop(handle); + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry_alloc; +out_ret: + ret = block_page_mkwrite_return(ret); +out: return ret; } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-24 18:29 UTC
[PATCH 3/9] fs: simplify handling of zero sized reads in __blockdev_direct_IO
Reject zero sized reads as soon as we know our I/O length, and don''t borther with locks or allocations that might have to be cleaned up otherwise. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/direct-io.c ==================================================================--- linux-2.6.orig/fs/direct-io.c 2011-06-24 14:30:22.488402525 +0200 +++ linux-2.6/fs/direct-io.c 2011-06-24 15:13:16.711605526 +0200 @@ -1200,6 +1200,10 @@ __blockdev_direct_IO(int rw, struct kioc } } + /* watch out for a 0 len io from a tricksy fs */ + if (rw == READ && end == offset) + return 0; + dio = kmalloc(sizeof(*dio), GFP_KERNEL); retval = -ENOMEM; if (!dio) @@ -1213,8 +1217,7 @@ __blockdev_direct_IO(int rw, struct kioc dio->flags = flags; if (dio->flags & DIO_LOCKING) { - /* watch out for a 0 len io from a tricksy fs */ - if (rw == READ && end > offset) { + if (rw == READ) { struct address_space *mapping iocb->ki_filp->f_mapping; -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
i_alloc_sem is a rather special rw_semaphore. It''s the last one that may be released by a non-owner, and it''s write side is always mirrored by real exclusion. It''s intended use it to wait for all pending direct I/O requests to finish before starting a truncate. Replace it with a hand-grown construct: - exclusion for truncates is already guaranteed by i_mutex, so it can simply fall way - the reader side is replaced by an i_dio_count member in struct inode that counts the number of pending direct I/O requests. Truncate can''t proceed as long as it''s non-zero - when i_dio_count reaches non-zero we wake up a pending truncate using wake_up_bit on a new bit in i_flags - new references to i_dio_count can''t appear while we are waiting for it to read zero because the direct I/O count always needs i_mutex (or an equivalent like XFS''s i_iolock) for starting a new operation. This scheme is much simpler, and saves the space of a spinlock_t and a struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit system). Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/direct-io.c ==================================================================--- linux-2.6.orig/fs/direct-io.c 2011-06-24 15:13:16.711605526 +0200 +++ linux-2.6/fs/direct-io.c 2011-06-24 15:18:33.021589512 +0200 @@ -135,6 +135,50 @@ struct dio { struct page *pages[DIO_PAGES]; /* page buffer */ }; +static void __inode_dio_wait(struct inode *inode) +{ + wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); + DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); + + do { + prepare_to_wait(wq, &q.wait, TASK_UNINTERRUPTIBLE); + if (atomic_read(&inode->i_dio_count)) + schedule(); + } while (atomic_read(&inode->i_dio_count)); + finish_wait(wq, &q.wait); +} + +/** + * inode_dio_wait - wait for outstanding DIO requests to finish + * @inode: inode to wait for + * + * Waits for all pending direct I/O requests to finish so that we can + * proceed with a truncate or equivalent operation. + * + * Must be called under a lock that serializes taking new references + * to i_dio_count, usually by inode->i_mutex. + */ +void inode_dio_wait(struct inode *inode) +{ + if (atomic_read(&inode->i_dio_count)) + __inode_dio_wait(inode); +} +EXPORT_SYMBOL_GPL(inode_dio_wait); + +/* + * inode_dio_done - signal finish of a direct I/O requests + * @inode: inode the direct I/O happens on + * + * This is called once we''ve finished processing a direct I/O request, + * and is used to wake up callers waiting for direct I/O to be quiesced. + */ +void inode_dio_done(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_dio_count)) + wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); +} +EXPORT_SYMBOL_GPL(inode_dio_done); + /* * How many pages are in the queue? */ @@ -254,9 +298,7 @@ static ssize_t dio_complete(struct dio * } if (dio->flags & DIO_LOCKING) - /* lockdep: non-owner release */ - up_read_non_owner(&dio->inode->i_alloc_sem); - + inode_dio_done(dio->inode); return ret; } @@ -980,9 +1022,6 @@ out: return ret; } -/* - * Releases both i_mutex and i_alloc_sem - */ static ssize_t direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, @@ -1146,15 +1185,14 @@ direct_io_worker(int rw, struct kiocb *i * For writes this function is called under i_mutex and returns with * i_mutex held, for reads, i_mutex is not held on entry, but it is * taken and dropped again before returning. - * For reads and writes i_alloc_sem is taken in shared mode and released - * on I/O completion (which may happen asynchronously after returning to - * the caller). + * The i_dio_count counter keeps track of the number of outstanding + * direct I/O requests, and truncate waits for it to reach zero. + * New references to i_dio_count must only be grabbed with i_mutex + * held. * * - if the flags value does NOT contain DIO_LOCKING we don''t use any * internal locking but rather rely on the filesystem to synchronize * direct I/O reads/writes versus each other and truncate. - * For reads and writes both i_mutex and i_alloc_sem are not held on - * entry and are never taken. */ ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, @@ -1234,10 +1272,9 @@ __blockdev_direct_IO(int rw, struct kioc } /* - * Will be released at I/O completion, possibly in a - * different thread. + * Will be decremented at I/O completion time. */ - down_read_non_owner(&inode->i_alloc_sem); + atomic_inc(&inode->i_dio_count); } /* Index: linux-2.6/mm/filemap.c ==================================================================--- linux-2.6.orig/mm/filemap.c 2011-06-24 15:13:16.804938855 +0200 +++ linux-2.6/mm/filemap.c 2011-06-24 15:14:56.364933813 +0200 @@ -78,9 +78,6 @@ * ->i_mutex (generic_file_buffered_write) * ->mmap_sem (fault_in_pages_readable->do_page_fault) * - * ->i_mutex - * ->i_alloc_sem (various) - * * inode_wb_list_lock * sb_lock (fs/fs-writeback.c) * ->mapping->tree_lock (__sync_single_inode) Index: linux-2.6/mm/rmap.c ==================================================================--- linux-2.6.orig/mm/rmap.c 2011-06-24 15:13:16.818272187 +0200 +++ linux-2.6/mm/rmap.c 2011-06-24 15:14:56.368267154 +0200 @@ -21,7 +21,6 @@ * Lock ordering in mm: * * inode->i_mutex (while writing or truncating, not reading or faulting) - * inode->i_alloc_sem (vmtruncate_range) * mm->mmap_sem * page->flags PG_locked (lock_page) * mapping->i_mmap_mutex Index: linux-2.6/fs/attr.c ==================================================================--- linux-2.6.orig/fs/attr.c 2011-06-24 15:13:16.721605526 +0200 +++ linux-2.6/fs/attr.c 2011-06-24 15:14:56.368267154 +0200 @@ -233,16 +233,13 @@ int notify_change(struct dentry * dentry return error; if (ia_valid & ATTR_SIZE) - down_write(&dentry->d_inode->i_alloc_sem); + inode_dio_wait(inode); if (inode->i_op->setattr) error = inode->i_op->setattr(dentry, attr); else error = simple_setattr(dentry, attr); - if (ia_valid & ATTR_SIZE) - up_write(&dentry->d_inode->i_alloc_sem); - if (!error) fsnotify_change(dentry, ia_valid); Index: linux-2.6/fs/ntfs/file.c ==================================================================--- linux-2.6.orig/fs/ntfs/file.c 2011-06-24 15:13:16.734938859 +0200 +++ linux-2.6/fs/ntfs/file.c 2011-06-24 15:14:56.371600489 +0200 @@ -1832,9 +1832,8 @@ static ssize_t ntfs_file_buffered_write( * fails again. */ if (unlikely(NInoTruncateFailed(ni))) { - down_write(&vi->i_alloc_sem); + inode_dio_wait(vi); err = ntfs_truncate(vi); - up_write(&vi->i_alloc_sem); if (err || NInoTruncateFailed(ni)) { if (!err) err = -EIO; Index: linux-2.6/fs/reiserfs/xattr.c ==================================================================--- linux-2.6.orig/fs/reiserfs/xattr.c 2011-06-24 15:13:16.758272190 +0200 +++ linux-2.6/fs/reiserfs/xattr.c 2011-06-24 15:14:56.374933821 +0200 @@ -555,11 +555,10 @@ reiserfs_xattr_set_handle(struct reiserf reiserfs_write_unlock(inode->i_sb); mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_XATTR); - down_write(&dentry->d_inode->i_alloc_sem); + inode_dio_wait(dentry->d_inode); reiserfs_write_lock(inode->i_sb); err = reiserfs_setattr(dentry, &newattrs); - up_write(&dentry->d_inode->i_alloc_sem); mutex_unlock(&dentry->d_inode->i_mutex); } else update_ctime(inode); Index: linux-2.6/include/linux/fs.h ==================================================================--- linux-2.6.orig/include/linux/fs.h 2011-06-24 15:13:16.858272186 +0200 +++ linux-2.6/include/linux/fs.h 2011-06-24 15:14:56.378267151 +0200 @@ -776,7 +776,7 @@ struct inode { struct timespec i_ctime; blkcnt_t i_blocks; unsigned short i_bytes; - struct rw_semaphore i_alloc_sem; + atomic_t i_dio_count; const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct file_lock *i_flock; struct address_space *i_mapping; @@ -1692,6 +1692,10 @@ struct super_operations { * set during data writeback, and cleared with a wakeup * on the bit address once it is done. * + * I_REFERENCED Marks the inode as recently references on the LRU list. + * + * I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit(). + * * Q: What is the difference between I_WILL_FREE and I_FREEING? */ #define I_DIRTY_SYNC (1 << 0) @@ -1705,6 +1709,8 @@ struct super_operations { #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) #define I_REFERENCED (1 << 8) +#define __I_DIO_WAKEUP 9 +#define I_DIO_WAKEUP (1 << I_DIO_WAKEUP) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1815,7 +1821,6 @@ struct file_system_type { struct lock_class_key i_lock_key; struct lock_class_key i_mutex_key; struct lock_class_key i_mutex_dir_key; - struct lock_class_key i_alloc_sem_key; }; extern struct dentry *mount_ns(struct file_system_type *fs_type, int flags, @@ -2367,6 +2372,8 @@ enum { }; void dio_end_io(struct bio *bio, int error); +void inode_dio_wait(struct inode *inode); +void inode_dio_done(struct inode *inode); ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, Index: linux-2.6/mm/memory.c ==================================================================--- linux-2.6.orig/mm/memory.c 2011-06-24 15:13:16.828272188 +0200 +++ linux-2.6/mm/memory.c 2011-06-24 15:14:56.381600482 +0200 @@ -2811,12 +2811,11 @@ int vmtruncate_range(struct inode *inode return -ENOSYS; mutex_lock(&inode->i_mutex); - down_write(&inode->i_alloc_sem); + inode_dio_wait(inode); unmap_mapping_range(mapping, offset, (end - offset), 1); truncate_inode_pages_range(mapping, offset, end); unmap_mapping_range(mapping, offset, (end - offset), 1); inode->i_op->truncate_range(inode, offset, end); - up_write(&inode->i_alloc_sem); mutex_unlock(&inode->i_mutex); return 0; Index: linux-2.6/fs/inode.c ==================================================================--- linux-2.6.orig/fs/inode.c 2011-06-24 15:13:16.771605525 +0200 +++ linux-2.6/fs/inode.c 2011-06-24 15:14:56.381600482 +0200 @@ -176,8 +176,7 @@ int inode_init_always(struct super_block mutex_init(&inode->i_mutex); lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key); - init_rwsem(&inode->i_alloc_sem); - lockdep_set_class(&inode->i_alloc_sem, &sb->s_type->i_alloc_sem_key); + atomic_set(&inode->i_dio_count, 0); mapping->a_ops = &empty_aops; mapping->host = inode; Index: linux-2.6/fs/ntfs/inode.c ==================================================================--- linux-2.6.orig/fs/ntfs/inode.c 2011-06-24 15:13:16.744938858 +0200 +++ linux-2.6/fs/ntfs/inode.c 2011-06-24 15:14:56.384933812 +0200 @@ -2357,12 +2357,7 @@ static const char *es = " Leaving incon * * Returns 0 on success or -errno on error. * - * Called with ->i_mutex held. In all but one case ->i_alloc_sem is held for - * writing. The only case in the kernel where ->i_alloc_sem is not held is - * mm/filemap.c::generic_file_buffered_write() where vmtruncate() is called - * with the current i_size as the offset. The analogous place in NTFS is in - * fs/ntfs/file.c::ntfs_file_buffered_write() where we call vmtruncate() again - * without holding ->i_alloc_sem. + * Called with ->i_mutex held. */ int ntfs_truncate(struct inode *vi) { @@ -2887,8 +2882,7 @@ void ntfs_truncate_vfs(struct inode *vi) * We also abort all changes of user, group, and mode as we do not implement * the NTFS ACLs yet. * - * Called with ->i_mutex held. For the ATTR_SIZE (i.e. ->truncate) case, also - * called with ->i_alloc_sem held for writing. + * Called with ->i_mutex held. */ int ntfs_setattr(struct dentry *dentry, struct iattr *attr) { Index: linux-2.6/fs/ocfs2/aops.c ==================================================================--- linux-2.6.orig/fs/ocfs2/aops.c 2011-06-24 15:13:16.781605524 +0200 +++ linux-2.6/fs/ocfs2/aops.c 2011-06-24 15:14:56.388267143 +0200 @@ -551,9 +551,8 @@ bail: /* * ocfs2_dio_end_io is called by the dio core when a dio is finished. We''re - * particularly interested in the aio/dio case. Like the core uses - * i_alloc_sem, we use the rw_lock DLM lock to protect io on one node from - * truncation on another. + * particularly interested in the aio/dio case. We use the rw_lock DLM lock + * to protect io on one node from truncation on another. */ static void ocfs2_dio_end_io(struct kiocb *iocb, loff_t offset, @@ -569,7 +568,7 @@ static void ocfs2_dio_end_io(struct kioc BUG_ON(!ocfs2_iocb_is_rw_locked(iocb)); if (ocfs2_iocb_is_sem_locked(iocb)) { - up_read(&inode->i_alloc_sem); + inode_dio_done(inode); ocfs2_iocb_clear_sem_locked(iocb); } Index: linux-2.6/fs/ocfs2/file.c ==================================================================--- linux-2.6.orig/fs/ocfs2/file.c 2011-06-24 15:13:16.794938856 +0200 +++ linux-2.6/fs/ocfs2/file.c 2011-06-24 15:14:56.391600477 +0200 @@ -2236,9 +2236,9 @@ static ssize_t ocfs2_file_aio_write(stru ocfs2_iocb_clear_sem_locked(iocb); relock: - /* to match setattr''s i_mutex -> i_alloc_sem -> rw_lock ordering */ + /* to match setattr''s i_mutex -> rw_lock ordering */ if (direct_io) { - down_read(&inode->i_alloc_sem); + atomic_inc(&inode->i_dio_count); have_alloc_sem = 1; /* communicate with ocfs2_dio_end_io */ ocfs2_iocb_set_sem_locked(iocb); @@ -2290,7 +2290,7 @@ relock: */ if (direct_io && !can_do_direct) { ocfs2_rw_unlock(inode, rw_level); - up_read(&inode->i_alloc_sem); + inode_dio_done(inode); have_alloc_sem = 0; rw_level = -1; @@ -2361,8 +2361,7 @@ out_dio: /* * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io * function pointer which is called when o_direct io completes so that - * it can unlock our rw lock. (it''s the clustered equivalent of - * i_alloc_sem; protects truncate from racing with pending ios). + * it can unlock our rw lock. * Unfortunately there are error cases which call end_io and others * that don''t. so we don''t have to unlock the rw_lock if either an * async dio is going to do it in the future or an end_io after an @@ -2379,7 +2378,7 @@ out: out_sems: if (have_alloc_sem) { - up_read(&inode->i_alloc_sem); + inode_dio_done(inode); ocfs2_iocb_clear_sem_locked(iocb); } @@ -2531,8 +2530,8 @@ static ssize_t ocfs2_file_aio_read(struc * need locks to protect pending reads from racing with truncate. */ if (filp->f_flags & O_DIRECT) { - down_read(&inode->i_alloc_sem); have_alloc_sem = 1; + atomic_inc(&inode->i_dio_count); ocfs2_iocb_set_sem_locked(iocb); ret = ocfs2_rw_lock(inode, 0); @@ -2575,7 +2574,7 @@ static ssize_t ocfs2_file_aio_read(struc bail: if (have_alloc_sem) { - up_read(&inode->i_alloc_sem); + inode_dio_done(inode); ocfs2_iocb_clear_sem_locked(iocb); } if (rw_level != -1) Index: linux-2.6/mm/madvise.c ==================================================================--- linux-2.6.orig/mm/madvise.c 2011-06-24 15:13:16.841605521 +0200 +++ linux-2.6/mm/madvise.c 2011-06-24 15:14:56.394933812 +0200 @@ -218,7 +218,7 @@ static long madvise_remove(struct vm_are endoff = (loff_t)(end - vma->vm_start - 1) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); - /* vmtruncate_range needs to take i_mutex and i_alloc_sem */ + /* vmtruncate_range needs to take i_mutex */ up_read(¤t->mm->mmap_sem); error = vmtruncate_range(mapping->host, offset, endoff); down_read(¤t->mm->mmap_sem); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-24 18:29 UTC
[PATCH 5/9] rw_semaphore: remove up/down_read_non_owner
Now that the last users is gone these can be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/include/linux/rwsem.h ==================================================================--- linux-2.6.orig/include/linux/rwsem.h 2011-06-24 14:30:21.571735905 +0200 +++ linux-2.6/include/linux/rwsem.h 2011-06-24 15:02:14.854972359 +0200 @@ -124,19 +124,9 @@ extern void downgrade_write(struct rw_se */ extern void down_read_nested(struct rw_semaphore *sem, int subclass); extern void down_write_nested(struct rw_semaphore *sem, int subclass); -/* - * Take/release a lock when not the owner will release it. - * - * [ This API should be avoided as much as possible - the - * proper abstraction for this case is completions. ] - */ -extern void down_read_non_owner(struct rw_semaphore *sem); -extern void up_read_non_owner(struct rw_semaphore *sem); #else # define down_read_nested(sem, subclass) down_read(sem) # define down_write_nested(sem, subclass) down_write(sem) -# define down_read_non_owner(sem) down_read(sem) -# define up_read_non_owner(sem) up_read(sem) #endif #endif /* _LINUX_RWSEM_H */ Index: linux-2.6/kernel/rwsem.c ==================================================================--- linux-2.6.orig/kernel/rwsem.c 2011-06-24 14:30:21.588402571 +0200 +++ linux-2.6/kernel/rwsem.c 2011-06-24 15:02:14.854972359 +0200 @@ -117,15 +117,6 @@ void down_read_nested(struct rw_semaphor EXPORT_SYMBOL(down_read_nested); -void down_read_non_owner(struct rw_semaphore *sem) -{ - might_sleep(); - - __down_read(sem); -} - -EXPORT_SYMBOL(down_read_non_owner); - void down_write_nested(struct rw_semaphore *sem, int subclass) { might_sleep(); @@ -136,13 +127,6 @@ void down_write_nested(struct rw_semapho EXPORT_SYMBOL(down_write_nested); -void up_read_non_owner(struct rw_semaphore *sem) -{ - __up_read(sem); -} - -EXPORT_SYMBOL(up_read_non_owner); - #endif -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-24 18:29 UTC
[PATCH 6/9] fs: move inode_dio_wait calls into ->setattr
Let filesystems handle waiting for direct I/O requests themselves instead of doing it beforehand. This means filesystem-specific locks to prevent new dio referenes from appearing can be held. This is important to allow generalizing i_dio_count to non-DIO_LOCKING filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/ocfs2/file.c ==================================================================--- linux-2.6.orig/fs/ocfs2/file.c 2011-06-20 09:28:54.516815966 +0200 +++ linux-2.6/fs/ocfs2/file.c 2011-06-20 09:31:34.706807855 +0200 @@ -1142,6 +1142,8 @@ int ocfs2_setattr(struct dentry *dentry, if (status) goto bail_unlock; + inode_dio_wait(inode); + if (i_size_read(inode) > attr->ia_size) { if (ocfs2_should_order_data(inode)) { status = ocfs2_begin_ordered_truncate(inode, Index: linux-2.6/fs/attr.c ==================================================================--- linux-2.6.orig/fs/attr.c 2011-06-20 09:28:54.490149300 +0200 +++ linux-2.6/fs/attr.c 2011-06-20 09:29:06.000000000 +0200 @@ -232,9 +232,6 @@ int notify_change(struct dentry * dentry if (error) return error; - if (ia_valid & ATTR_SIZE) - inode_dio_wait(inode); - if (inode->i_op->setattr) error = inode->i_op->setattr(dentry, attr); else Index: linux-2.6/fs/ext2/inode.c ==================================================================--- linux-2.6.orig/fs/ext2/inode.c 2011-06-18 12:54:28.058273680 +0200 +++ linux-2.6/fs/ext2/inode.c 2011-06-20 09:29:06.500148692 +0200 @@ -1184,6 +1184,8 @@ static int ext2_setsize(struct inode *in if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) return -EPERM; + inode_dio_wait(inode); + if (mapping_is_xip(inode->i_mapping)) error = xip_truncate_page(inode->i_mapping, newsize); else if (test_opt(inode->i_sb, NOBH)) Index: linux-2.6/fs/ext3/inode.c ==================================================================--- linux-2.6.orig/fs/ext3/inode.c 2011-06-18 12:54:28.071607014 +0200 +++ linux-2.6/fs/ext3/inode.c 2011-06-20 09:29:06.500148692 +0200 @@ -3216,6 +3216,9 @@ int ext3_setattr(struct dentry *dentry, ext3_journal_stop(handle); } + if (attr->ia_valid & ATTR_SIZE) + inode_dio_wait(inode); + if (S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) { handle_t *handle; Index: linux-2.6/fs/ext4/inode.c ==================================================================--- linux-2.6.orig/fs/ext4/inode.c 2011-06-20 09:28:54.506815967 +0200 +++ linux-2.6/fs/ext4/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -5351,6 +5351,8 @@ int ext4_setattr(struct dentry *dentry, } if (attr->ia_valid & ATTR_SIZE) { + inode_dio_wait(inode); + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); Index: linux-2.6/fs/fat/file.c ==================================================================--- linux-2.6.orig/fs/fat/file.c 2011-06-18 12:54:28.118273678 +0200 +++ linux-2.6/fs/fat/file.c 2011-06-20 09:29:06.000000000 +0200 @@ -397,6 +397,8 @@ int fat_setattr(struct dentry *dentry, s * sequence. */ if (attr->ia_valid & ATTR_SIZE) { + inode_dio_wait(inode); + if (attr->ia_size > inode->i_size) { error = fat_cont_expand(inode, attr->ia_size); if (error || attr->ia_valid == ATTR_SIZE) Index: linux-2.6/fs/gfs2/bmap.c ==================================================================--- linux-2.6.orig/fs/gfs2/bmap.c 2011-06-18 12:54:28.141607009 +0200 +++ linux-2.6/fs/gfs2/bmap.c 2011-06-20 09:29:06.510148693 +0200 @@ -1224,6 +1224,8 @@ int gfs2_setattr_size(struct inode *inod if (ret) return ret; + inode_dio_wait(inode); + oldsize = inode->i_size; if (newsize >= oldsize) return do_grow(inode, newsize); Index: linux-2.6/fs/hfs/inode.c ==================================================================--- linux-2.6.orig/fs/hfs/inode.c 2011-06-18 12:54:28.154940342 +0200 +++ linux-2.6/fs/hfs/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -615,6 +615,8 @@ int hfs_inode_setattr(struct dentry *den if ((attr->ia_valid & ATTR_SIZE) && attr->ia_size != i_size_read(inode)) { + inode_dio_wait(inode); + error = vmtruncate(inode, attr->ia_size); if (error) return error; Index: linux-2.6/fs/hfsplus/inode.c ==================================================================--- linux-2.6.orig/fs/hfsplus/inode.c 2011-06-18 12:54:28.168273676 +0200 +++ linux-2.6/fs/hfsplus/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -296,6 +296,8 @@ static int hfsplus_setattr(struct dentry if ((attr->ia_valid & ATTR_SIZE) && attr->ia_size != i_size_read(inode)) { + inode_dio_wait(inode); + error = vmtruncate(inode, attr->ia_size); if (error) return error; Index: linux-2.6/fs/jfs/file.c ==================================================================--- linux-2.6.orig/fs/jfs/file.c 2011-06-18 12:54:28.191607007 +0200 +++ linux-2.6/fs/jfs/file.c 2011-06-20 09:29:06.000000000 +0200 @@ -110,6 +110,8 @@ int jfs_setattr(struct dentry *dentry, s if ((iattr->ia_valid & ATTR_SIZE) && iattr->ia_size != i_size_read(inode)) { + inode_dio_wait(inode); + rc = vmtruncate(inode, iattr->ia_size); if (rc) return rc; Index: linux-2.6/fs/nilfs2/inode.c ==================================================================--- linux-2.6.orig/fs/nilfs2/inode.c 2011-06-18 12:54:28.204940339 +0200 +++ linux-2.6/fs/nilfs2/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -778,6 +778,8 @@ int nilfs_setattr(struct dentry *dentry, if ((iattr->ia_valid & ATTR_SIZE) && iattr->ia_size != i_size_read(inode)) { + inode_dio_wait(inode); + err = vmtruncate(inode, iattr->ia_size); if (unlikely(err)) goto out_err; Index: linux-2.6/fs/reiserfs/inode.c ==================================================================--- linux-2.6.orig/fs/reiserfs/inode.c 2011-06-18 12:54:28.218273673 +0200 +++ linux-2.6/fs/reiserfs/inode.c 2011-06-20 09:29:06.000000000 +0200 @@ -3114,6 +3114,9 @@ int reiserfs_setattr(struct dentry *dent error = -EFBIG; goto out; } + + inode_dio_wait(inode); + /* fill in hole pointers in the expanding truncate case. */ if (attr->ia_size > inode->i_size) { error = generic_cont_expand_simple(inode, attr->ia_size); -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING. This these filesystems to also protect truncate against direct I/O requests by using common code. Right now the only non-DIO_LOCKING filesystem that appears to do so is XFS, which uses an opencoded variant of the i_dio_count scheme. Behaviour doesn''t change for filesystems never calling inode_dio_wait. For ext4 behaviour changes when using the dioread_nonlock option, which previously was missing any protection between truncate and direct I/O reads. For ocfs2 that handcrafted i_dio_count manipulations are replaced with the common code now enable. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/direct-io.c ==================================================================--- linux-2.6.orig/fs/direct-io.c 2011-06-24 15:18:52.000000000 +0200 +++ linux-2.6/fs/direct-io.c 2011-06-24 15:22:25.341577750 +0200 @@ -297,8 +297,7 @@ static ssize_t dio_complete(struct dio * aio_complete(dio->iocb, ret, 0); } - if (dio->flags & DIO_LOCKING) - inode_dio_done(dio->inode); + inode_dio_done(dio->inode); return ret; } @@ -1185,14 +1184,16 @@ direct_io_worker(int rw, struct kiocb *i * For writes this function is called under i_mutex and returns with * i_mutex held, for reads, i_mutex is not held on entry, but it is * taken and dropped again before returning. - * The i_dio_count counter keeps track of the number of outstanding - * direct I/O requests, and truncate waits for it to reach zero. - * New references to i_dio_count must only be grabbed with i_mutex - * held. - * * - if the flags value does NOT contain DIO_LOCKING we don''t use any * internal locking but rather rely on the filesystem to synchronize * direct I/O reads/writes versus each other and truncate. + * + * To help with locking against truncate we incremented the i_dio_count + * counter before starting direct I/O, and decrement it once we are done. + * Truncate can wait for it to reach zero to provide exclusion. It is + * expected that filesystem provide exclusion between new direct I/O + * and truncates. For DIO_LOCKING filesystems this is done by i_mutex, + * but other filesystems need to take care of this on their own. */ ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, @@ -1270,14 +1271,14 @@ __blockdev_direct_IO(int rw, struct kioc goto out; } } - - /* - * Will be decremented at I/O completion time. - */ - atomic_inc(&inode->i_dio_count); } /* + * Will be decremented at I/O completion time. + */ + atomic_inc(&inode->i_dio_count); + + /* * For file extending writes updating i_size before data * writeouts complete can expose uninitialized blocks. So * even for AIO, we need to wait for i/o to complete before Index: linux-2.6/fs/ocfs2/aops.c ==================================================================--- linux-2.6.orig/fs/ocfs2/aops.c 2011-06-24 15:18:52.000000000 +0200 +++ linux-2.6/fs/ocfs2/aops.c 2011-06-24 15:22:02.918245553 +0200 @@ -567,10 +567,8 @@ static void ocfs2_dio_end_io(struct kioc /* this io''s submitter should not have unlocked this before we could */ BUG_ON(!ocfs2_iocb_is_rw_locked(iocb)); - if (ocfs2_iocb_is_sem_locked(iocb)) { - inode_dio_done(inode); + if (ocfs2_iocb_is_sem_locked(iocb)) ocfs2_iocb_clear_sem_locked(iocb); - } ocfs2_iocb_clear_rw_locked(iocb); Index: linux-2.6/fs/ocfs2/file.c ==================================================================--- linux-2.6.orig/fs/ocfs2/file.c 2011-06-24 15:18:53.268255154 +0200 +++ linux-2.6/fs/ocfs2/file.c 2011-06-24 15:20:41.668249665 +0200 @@ -2240,7 +2240,6 @@ static ssize_t ocfs2_file_aio_write(stru relock: /* to match setattr''s i_mutex -> rw_lock ordering */ if (direct_io) { - atomic_inc(&inode->i_dio_count); have_alloc_sem = 1; /* communicate with ocfs2_dio_end_io */ ocfs2_iocb_set_sem_locked(iocb); @@ -2292,7 +2291,6 @@ relock: */ if (direct_io && !can_do_direct) { ocfs2_rw_unlock(inode, rw_level); - inode_dio_done(inode); have_alloc_sem = 0; rw_level = -1; @@ -2379,10 +2377,8 @@ out: ocfs2_rw_unlock(inode, rw_level); out_sems: - if (have_alloc_sem) { - inode_dio_done(inode); + if (have_alloc_sem) ocfs2_iocb_clear_sem_locked(iocb); - } mutex_unlock(&inode->i_mutex); @@ -2533,7 +2529,6 @@ static ssize_t ocfs2_file_aio_read(struc */ if (filp->f_flags & O_DIRECT) { have_alloc_sem = 1; - atomic_inc(&inode->i_dio_count); ocfs2_iocb_set_sem_locked(iocb); ret = ocfs2_rw_lock(inode, 0); @@ -2575,10 +2570,9 @@ static ssize_t ocfs2_file_aio_read(struc } bail: - if (have_alloc_sem) { - inode_dio_done(inode); + if (have_alloc_sem) ocfs2_iocb_clear_sem_locked(iocb); - } + if (rw_level != -1) ocfs2_rw_unlock(inode, rw_level); -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-24 18:29 UTC
[PATCH 8/9] fs: simplify the blockdev_direct_IO prototype
Simple filesystems always pass inode->i_sb_bdev as the block device argument, and never need a end_io handler. Let''s simply things for them and for my grepping activity by dropping these arguments. The only thing not falling into that scheme is ext4, which passes and end_io handler without needing special flags (yet), but given how messy the direct I/O code there is use of __blockdev_direct_IO in one instead of two out of three cases isn''t going to make a large difference anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/ext2/inode.c ==================================================================--- linux-2.6.orig/fs/ext2/inode.c 2011-06-24 15:27:33.131562166 +0200 +++ linux-2.6/fs/ext2/inode.c 2011-06-24 15:43:31.164846996 +0200 @@ -843,8 +843,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc struct inode *inode = mapping->host; ssize_t ret; - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, - iov, offset, nr_segs, ext2_get_block, NULL); + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + ext2_get_block); if (ret < 0 && (rw & WRITE)) ext2_write_failed(mapping, offset + iov_length(iov, nr_segs)); return ret; Index: linux-2.6/fs/ext3/inode.c ==================================================================--- linux-2.6.orig/fs/ext3/inode.c 2011-06-24 15:27:33.151562165 +0200 +++ linux-2.6/fs/ext3/inode.c 2011-06-24 15:28:09.048226915 +0200 @@ -1816,9 +1816,8 @@ static ssize_t ext3_direct_IO(int rw, st } retry: - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, - ext3_get_block, NULL); + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + ext3_get_block); /* * In case of error extending write may have instantiated a few * blocks outside i_size. Trim these off again. Index: linux-2.6/fs/ext4/inode.c ==================================================================--- linux-2.6.orig/fs/ext4/inode.c 2011-06-24 15:27:33.171562165 +0200 +++ linux-2.6/fs/ext4/inode.c 2011-06-24 15:32:40.694879881 +0200 @@ -3501,10 +3501,8 @@ retry: offset, nr_segs, ext4_get_block, NULL, NULL, 0); else { - ret = blockdev_direct_IO(rw, iocb, inode, - inode->i_sb->s_bdev, iov, - offset, nr_segs, - ext4_get_block, NULL); + ret = blockdev_direct_IO(rw, iocb, inode, iov, + offset, nr_segs, ext4_get_block); if (unlikely((rw & WRITE) && ret < 0)) { loff_t isize = i_size_read(inode); @@ -3748,11 +3746,13 @@ static ssize_t ext4_ext_direct_IO(int rw EXT4_I(inode)->cur_aio_dio = iocb->private; } - ret = blockdev_direct_IO(rw, iocb, inode, + ret = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, ext4_get_block_write, - ext4_end_io_dio); + ext4_end_io_dio, + NULL, + DIO_LOCKING | DIO_SKIP_HOLES); if (iocb->private) EXT4_I(inode)->cur_aio_dio = NULL; /* Index: linux-2.6/fs/fat/inode.c ==================================================================--- linux-2.6.orig/fs/fat/inode.c 2011-06-24 15:27:33.188228830 +0200 +++ linux-2.6/fs/fat/inode.c 2011-06-24 15:32:48.341546189 +0200 @@ -211,8 +211,8 @@ static ssize_t fat_direct_IO(int rw, str * FAT need to use the DIO_LOCKING for avoiding the race * condition of fat_get_block() and ->truncate(). */ - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, - iov, offset, nr_segs, fat_get_block, NULL); + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + fat_get_block); if (ret < 0 && (rw & WRITE)) fat_write_failed(mapping, offset + iov_length(iov, nr_segs)); Index: linux-2.6/fs/hfs/inode.c ==================================================================--- linux-2.6.orig/fs/hfs/inode.c 2011-06-24 15:27:33.228228829 +0200 +++ linux-2.6/fs/hfs/inode.c 2011-06-24 15:29:45.218222143 +0200 @@ -123,8 +123,8 @@ static ssize_t hfs_direct_IO(int rw, str struct inode *inode = file->f_path.dentry->d_inode->i_mapping->host; ssize_t ret; - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, hfs_get_block, NULL); + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + hfs_get_block); /* * In case of error extending write may have instantiated a few Index: linux-2.6/fs/hfsplus/inode.c ==================================================================--- linux-2.6.orig/fs/hfsplus/inode.c 2011-06-24 15:27:33.244895494 +0200 +++ linux-2.6/fs/hfsplus/inode.c 2011-06-24 15:29:59.911554734 +0200 @@ -119,8 +119,8 @@ static ssize_t hfsplus_direct_IO(int rw, struct inode *inode = file->f_path.dentry->d_inode->i_mapping->host; ssize_t ret; - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, hfsplus_get_block, NULL); + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + hfsplus_get_block); /* * In case of error extending write may have instantiated a few Index: linux-2.6/fs/jfs/inode.c ==================================================================--- linux-2.6.orig/fs/jfs/inode.c 2011-06-24 15:27:33.264895492 +0200 +++ linux-2.6/fs/jfs/inode.c 2011-06-24 15:30:11.701554144 +0200 @@ -329,8 +329,8 @@ static ssize_t jfs_direct_IO(int rw, str struct inode *inode = file->f_mapping->host; ssize_t ret; - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, jfs_get_block, NULL); + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + jfs_get_block); /* * In case of error extending write may have instantiated a few Index: linux-2.6/fs/nilfs2/inode.c ==================================================================--- linux-2.6.orig/fs/nilfs2/inode.c 2011-06-24 15:27:33.284895493 +0200 +++ linux-2.6/fs/nilfs2/inode.c 2011-06-24 15:30:24.968220135 +0200 @@ -259,8 +259,8 @@ nilfs_direct_IO(int rw, struct kiocb *io return 0; /* Needs synchronization with the cleaner */ - size = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, nilfs_get_block, NULL); + size = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + nilfs_get_block); /* * In case of error extending write may have instantiated a few Index: linux-2.6/fs/reiserfs/inode.c ==================================================================--- linux-2.6.orig/fs/reiserfs/inode.c 2011-06-24 15:27:33.324895489 +0200 +++ linux-2.6/fs/reiserfs/inode.c 2011-06-24 15:30:38.311552796 +0200 @@ -3068,9 +3068,8 @@ static ssize_t reiserfs_direct_IO(int rw struct inode *inode = file->f_mapping->host; ssize_t ret; - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, - reiserfs_get_blocks_direct_io, NULL); + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs, + reiserfs_get_blocks_direct_io); /* * In case of error extending write may have instantiated a few Index: linux-2.6/include/linux/fs.h ==================================================================--- linux-2.6.orig/include/linux/fs.h 2011-06-24 15:27:33.361562155 +0200 +++ linux-2.6/include/linux/fs.h 2011-06-24 15:46:57.914836526 +0200 @@ -2381,12 +2381,11 @@ ssize_t __blockdev_direct_IO(int rw, str dio_submit_t submit_io, int flags); static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb, - struct inode *inode, struct block_device *bdev, const struct iovec *iov, - loff_t offset, unsigned long nr_segs, get_block_t get_block, - dio_iodone_t end_io) + struct inode *inode, const struct iovec *iov, loff_t offset, + unsigned long nr_segs, get_block_t get_block) { - return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, - nr_segs, get_block, end_io, NULL, + return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, + offset, nr_segs, get_block, NULL, NULL, DIO_LOCKING | DIO_SKIP_HOLES); } #endif -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Jun-24 18:29 UTC
[PATCH 9/9] fs: move inode_dio_done to the end_io handler
For filesystems that delay their end_io processing we should keep our i_dio_count until the the processing is done. Enable this by moving the inode_dio_done call to the end_io handler if one exist. Note that the actual move to the workqueue for ext4 and XFS is not done in this patch yet, but left to the filesystem maintainers. At least for XFS it''s not needed yet either as XFS has an internal equivalent to i_dio_count. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/direct-io.c ==================================================================--- linux-2.6.orig/fs/direct-io.c 2011-06-24 15:27:14.124896461 +0200 +++ linux-2.6/fs/direct-io.c 2011-06-24 15:47:03.358169584 +0200 @@ -293,11 +293,12 @@ static ssize_t dio_complete(struct dio * if (dio->end_io && dio->result) { dio->end_io(dio->iocb, offset, transferred, dio->map_bh.b_private, ret, is_async); - } else if (is_async) { - aio_complete(dio->iocb, ret, 0); + } else { + if (is_async) + aio_complete(dio->iocb, ret, 0); + inode_dio_done(dio->inode); } - inode_dio_done(dio->inode); return ret; } Index: linux-2.6/fs/ext4/inode.c ==================================================================--- linux-2.6.orig/fs/ext4/inode.c 2011-06-24 15:47:13.111502423 +0200 +++ linux-2.6/fs/ext4/inode.c 2011-06-24 15:50:13.471493302 +0200 @@ -3573,6 +3573,7 @@ static void ext4_end_io_dio(struct kiocb ssize_t size, void *private, int ret, bool is_async) { + struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode; ext4_io_end_t *io_end = iocb->private; struct workqueue_struct *wq; unsigned long flags; @@ -3594,6 +3595,7 @@ static void ext4_end_io_dio(struct kiocb out: if (is_async) aio_complete(iocb, ret, 0); + inode_dio_done(inode); return; } @@ -3614,6 +3616,9 @@ out: /* queue the work to convert unwritten extents to written */ queue_work(wq, &io_end->work); iocb->private = NULL; + + /* XXX: probably should move into the real I/O completion handler */ + inode_dio_done(inode); } static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) Index: linux-2.6/fs/ocfs2/aops.c ==================================================================--- linux-2.6.orig/fs/ocfs2/aops.c 2011-06-24 15:49:26.731495659 +0200 +++ linux-2.6/fs/ocfs2/aops.c 2011-06-24 15:49:48.324827901 +0200 @@ -577,6 +577,7 @@ static void ocfs2_dio_end_io(struct kioc if (is_async) aio_complete(iocb, ret, 0); + inode_dio_done(inode); } /* Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c ==================================================================--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c 2011-06-24 15:48:25.581498754 +0200 +++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c 2011-06-24 15:51:00.874824252 +0200 @@ -1339,6 +1339,9 @@ xfs_end_io_direct_write( } else { xfs_finish_ioend_sync(ioend); } + + /* XXX: probably should move into the real I/O completion handler */ + inode_dio_done(ioend->io_inode); } STATIC ssize_t -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> This scheme is much simpler, and saves the space of a spinlock_t and a > struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit > system).And I still haven''t fixed that typo, damn. Updated in local version now to make sure it won''t be missed next time. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Maybe Matching Threads
- [RFC][PATCH 2/2] Btrfs: implement unlocked dio write
- [PATCH 1/1] Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem.
- [PATCH 1/5] fs: allow short direct-io reads to be completed via buffered IO V2
- [PATCH] ocfs2: avoid direct write if we fall back to buffered v2
- Ext3 and LFS - possible? fatal?