Josef Bacik
2010-May-21 17:03 UTC
[PATCH 1/6] fs: allow short direct-io reads to be completed via buffered IO
This is similar to what already happens in the write case. If we have a short read while doing O_DIRECT, instead of just returning, fallthrough and try to read the rest via buffered IO. BTRFS needs this because if we encounter a compressed or inline extent during DIO, we need to fallback on buffered. If the extent is compressed we need to read the entire thing into memory and de-compress it into the users pages. I have tested this with fsx and everything works great. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- V1->V2: Check to see if our current ppos is >= i_size after a short DIO read, just in case it was actually a short read and we need to just return. mm/filemap.c | 36 +++++++++++++++++++++++++++++++----- 1 files changed, 31 insertions(+), 5 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 140ebda..829ac9c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1263,7 +1263,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, { struct file *filp = iocb->ki_filp; ssize_t retval; - unsigned long seg; + unsigned long seg = 0; size_t count; loff_t *ppos = &iocb->ki_pos; @@ -1290,21 +1290,47 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, retval = mapping->a_ops->direct_IO(READ, iocb, iov, pos, nr_segs); } - if (retval > 0) + if (retval > 0) { *ppos = pos + retval; - if (retval) { + count -= retval; + } + + /* + * Btrfs can have a short DIO read if we encounter + * compressed extents, so if there was an error, or if + * we''ve already read everything we wanted to, or if + * there was a short read because we hit EOF, go ahead + * and return. Otherwise fallthrough to buffered io for + * the rest of the read. + */ + if (retval < 0 || !count || *ppos >= size) { file_accessed(filp); goto out; } } } + count = retval; for (seg = 0; seg < nr_segs; seg++) { read_descriptor_t desc; + loff_t offset = 0; + + /* + * If we did a short DIO read we need to skip the section of the + * iov that we''ve already read data into. + */ + if (count) { + if (count > iov[seg].iov_len) { + count -= iov[seg].iov_len; + continue; + } + offset = count; + count = 0; + } desc.written = 0; - desc.arg.buf = iov[seg].iov_base; - desc.count = iov[seg].iov_len; + desc.arg.buf = iov[seg].iov_base + offset; + desc.count = iov[seg].iov_len - offset; if (desc.count == 0) continue; desc.error = 0; -- 1.6.6.1
Josef Bacik
2010-May-21 17:03 UTC
[PATCH 2/6] direct-io: add a hook for the fs to provide its own submit_bio function
Because BTRFS can do RAID and such, we need our own submit hook so we can setup the bio''s in the correct fashion, and handle checksum errors properly. So there are a few changes here 1) The submit_io hook. This is straightforward, just call this instead of submit_bio. 2) Allow the fs to return -ENOTBLK for reads. Usually this has only worked for writes, since writes can fallback onto buffered IO. But BTRFS needs the option of falling back on buffered IO if it encounters a compressed extent, since we need to read the entire extent in and decompress it. So if we get -ENOTBLK back from get_block we''ll return back and fallback on buffered just like the write case. I''ve tested these changes with fsx and everything seems to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- V1->V2: -Changed dio_end_io to EXPORT_SYMBOL_GPL -Removed the own_submit blockdev dio helper -Removed the boundary change V2->V3: -Keep track of the logical offset in the current bio V3->V4: -Kerneldoc dio_end_io -switch if (!dio->submit_io) statement fs/direct-io.c | 42 +++++++++++++++++++++++++++++++++++++----- include/linux/fs.h | 11 ++++++++--- 2 files changed, 45 insertions(+), 8 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index e82adc2..5949947 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -82,6 +82,8 @@ struct dio { int reap_counter; /* rate limit reaping */ get_block_t *get_block; /* block mapping function */ dio_iodone_t *end_io; /* IO completion function */ + dio_submit_t *submit_io; /* IO submition function */ + loff_t logical_offset_in_bio; /* current first logical block in bio */ sector_t final_block_in_bio; /* current final block in bio + 1 */ sector_t next_block_for_io; /* next block to be put under IO, in dio_blocks units */ @@ -96,6 +98,7 @@ struct dio { unsigned cur_page_offset; /* Offset into it, in bytes */ unsigned cur_page_len; /* Nr of bytes at cur_page_offset */ sector_t cur_page_block; /* Where it starts */ + loff_t cur_page_fs_offset; /* Offset in file */ /* BIO completion state */ spinlock_t bio_lock; /* protects BIO fields below */ @@ -300,6 +303,26 @@ static void dio_bio_end_io(struct bio *bio, int error) spin_unlock_irqrestore(&dio->bio_lock, flags); } +/** + * dio_end_io - handle the end io action for the given bio + * @bio: The direct io bio thats being completed + * @error: Error if there was one + * + * This is meant to be called by any filesystem that uses their own dio_submit_t + * so that the DIO specific endio actions are dealt with after the filesystem + * has done it''s completion work. + */ +void dio_end_io(struct bio *bio, int error) +{ + struct dio *dio = bio->bi_private; + + if (dio->is_async) + dio_bio_end_aio(bio, error); + else + dio_bio_end_io(bio, error); +} +EXPORT_SYMBOL_GPL(dio_end_io); + static int dio_bio_alloc(struct dio *dio, struct block_device *bdev, sector_t first_sector, int nr_vecs) @@ -316,6 +339,7 @@ dio_bio_alloc(struct dio *dio, struct block_device *bdev, bio->bi_end_io = dio_bio_end_io; dio->bio = bio; + dio->logical_offset_in_bio = dio->cur_page_fs_offset; return 0; } @@ -340,10 +364,15 @@ static void dio_bio_submit(struct dio *dio) if (dio->is_async && dio->rw == READ) bio_set_pages_dirty(bio); - submit_bio(dio->rw, bio); + if (dio->submit_io) + dio->submit_io(dio->rw, bio, dio->inode, + dio->logical_offset_in_bio); + else + submit_bio(dio->rw, bio); dio->bio = NULL; dio->boundary = 0; + dio->logical_offset_in_bio = 0; } /* @@ -701,6 +730,7 @@ submit_page_section(struct dio *dio, struct page *page, dio->cur_page_offset = offset; dio->cur_page_len = len; dio->cur_page_block = blocknr; + dio->cur_page_fs_offset = dio->block_in_file << dio->blkbits; out: return ret; } @@ -935,7 +965,7 @@ static ssize_t direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, unsigned blkbits, get_block_t get_block, dio_iodone_t end_io, - struct dio *dio) + dio_submit_t submit_io, struct dio *dio) { unsigned long user_addr; unsigned long flags; @@ -952,6 +982,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, dio->get_block = get_block; dio->end_io = end_io; + dio->submit_io = submit_io; dio->final_block_in_bio = -1; dio->next_block_for_io = -1; @@ -1008,7 +1039,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, } } /* end iovec loop */ - if (ret == -ENOTBLK && (rw & WRITE)) { + if (ret == -ENOTBLK) { /* * The remaining part of the request will be * be handled by buffered I/O when we return @@ -1110,7 +1141,7 @@ ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, - int flags) + dio_submit_t submit_io, int flags) { int seg; size_t size; @@ -1197,7 +1228,8 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, (end > i_size_read(inode))); retval = direct_io_worker(rw, iocb, inode, iov, offset, - nr_segs, blkbits, get_block, end_io, dio); + nr_segs, blkbits, get_block, end_io, + submit_io, dio); /* * In case of error extending write may have instantiated a few diff --git a/include/linux/fs.h b/include/linux/fs.h index 4079ef9..f5dc079 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2250,10 +2250,15 @@ static inline int xip_truncate_page(struct address_space *mapping, loff_t from) #endif #ifdef CONFIG_BLOCK +struct bio; +typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode, + loff_t file_offset); +void dio_end_io(struct bio *bio, int error); + ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, - int lock_type); + dio_submit_t submit_io, int lock_type); enum { /* need locking between buffered and direct access */ @@ -2269,7 +2274,7 @@ static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb, dio_iodone_t end_io) { return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, - nr_segs, get_block, end_io, + nr_segs, get_block, end_io, NULL, DIO_LOCKING | DIO_SKIP_HOLES); } @@ -2279,7 +2284,7 @@ static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb, dio_iodone_t end_io) { return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, - nr_segs, get_block, end_io, 0); + nr_segs, get_block, end_io, NULL, 0); } #endif -- 1.6.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-May-21 17:03 UTC
[PATCH 3/6] direct-io: do not merge logically non-contiguous requests
Btrfs cannot handle having logically non-contiguous requests submitted. For example if you have Logical: [0-4095][HOLE][8192-12287] Physical: [0-4095] [4096-8191] Normally the DIO code would put these into the same BIO''s. The problem is we need to know exactly what offset is associated with what BIO so we can do our checksumming and unlocking properly, so putting them in the same BIO doesn''t work. So add another check where we submit the current BIO if the physical blocks are not contigous OR the logical blocks are not contiguous. Signed-off-by: Josef Bacik <josef@redhat.com> --- V1->V2 -Be more verbose in the in-code comment fs/direct-io.c | 20 ++++++++++++++++++-- 1 files changed, 18 insertions(+), 2 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index 5949947..da111aa 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -632,10 +632,26 @@ static int dio_send_cur_page(struct dio *dio) int ret = 0; if (dio->bio) { + loff_t cur_offset = dio->block_in_file << dio->blkbits; + loff_t bio_next_offset = dio->logical_offset_in_bio + + dio->bio->bi_size; + /* - * See whether this new request is contiguous with the old + * See whether this new request is contiguous with the old. + * + * Btrfs cannot handl having logically non-contiguous requests + * submitted. For exmple if you have + * + * Logical: [0-4095][HOLE][8192-12287] + * Phyiscal: [0-4095] [4096-8181] + * + * We cannot submit those pages together as one BIO. So if our + * current logical offset in the file does not equal what would + * be the next logical offset in the bio, submit the bio we + * have. */ - if (dio->final_block_in_bio != dio->cur_page_block) + if (dio->final_block_in_bio != dio->cur_page_block || + cur_offset != bio_next_offset) dio_bio_submit(dio); /* * Submit now if the underlying fs is about to perform a -- 1.6.6.1
Christoph said he''d rather everybody use __blockdev_direct_IO directly instead of having a bunch of random helper functions, so thats what this patch does. It''s a basic change, I''ve tested it with xfstests on ext4 and xfs. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/block_dev.c | 5 +++-- fs/ext4/inode.c | 8 ++++---- fs/gfs2/aops.c | 6 +++--- fs/ocfs2/aops.c | 8 +++----- fs/xfs/linux-2.6/xfs_aops.c | 7 +++---- include/linux/fs.h | 9 --------- 6 files changed, 16 insertions(+), 27 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 6dcee88..0f42cbc 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -172,8 +172,9 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; - return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode), - iov, offset, nr_segs, blkdev_get_blocks, NULL); + return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, + offset, nr_segs, blkdev_get_blocks, NULL, + NULL, 0); } int __sync_blockdev(struct block_device *bdev, int wait) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 81d6054..8f37762 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3494,10 +3494,10 @@ static ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, retry: if (rw == READ && ext4_should_dioread_nolock(inode)) - ret = blockdev_direct_IO_no_locking(rw, iocb, inode, - inode->i_sb->s_bdev, iov, - offset, nr_segs, - ext4_get_block, NULL); + ret = __blockdev_direct_IO(rw, iocb, inode, + inode->i_sb->s_bdev, iov, offset, + nr_segs, ext4_get_block, NULL, NULL, + 0); else ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 0c1d0b8..45b23b0 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -1039,9 +1039,9 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb, if (rv != 1) goto out; /* dio not valid, fall back to buffered i/o */ - rv = blockdev_direct_IO_no_locking(rw, iocb, inode, inode->i_sb->s_bdev, - iov, offset, nr_segs, - gfs2_get_block_direct, NULL); + rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, + iov, offset, nr_segs, gfs2_get_block_direct, + NULL, NULL, 0); out: gfs2_glock_dq_m(1, &gh); gfs2_holder_uninit(&gh); diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 21441dd..f2e53a9 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -669,11 +669,9 @@ static ssize_t ocfs2_direct_IO(int rw, if (i_size_read(inode) <= offset) return 0; - ret = blockdev_direct_IO_no_locking(rw, iocb, inode, - inode->i_sb->s_bdev, iov, offset, - nr_segs, - ocfs2_direct_IO_get_blocks, - ocfs2_dio_end_io); + ret = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, + offset, nr_segs, ocfs2_direct_IO_get_blocks, + ocfs2_dio_end_io, NULL, 0); mlog_exit(ret); return ret; diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index 0f8b996..fcc14d8 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -1617,10 +1617,9 @@ xfs_vm_direct_IO( iocb->private = xfs_alloc_ioend(inode, rw == WRITE ? IOMAP_UNWRITTEN : IOMAP_READ); - ret = blockdev_direct_IO_no_locking(rw, iocb, inode, bdev, iov, - offset, nr_segs, - xfs_get_blocks_direct, - xfs_end_io_direct); + ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, nr_segs, + xfs_get_blocks_direct, xfs_end_io_direct, + NULL, 0); if (unlikely(ret != -EIOCBQUEUED && iocb->private)) xfs_destroy_ioend(iocb->private); diff --git a/include/linux/fs.h b/include/linux/fs.h index f5dc079..5b7a8bd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2277,15 +2277,6 @@ static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb, nr_segs, get_block, end_io, NULL, DIO_LOCKING | DIO_SKIP_HOLES); } - -static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb, - struct inode *inode, struct block_device *bdev, const struct iovec *iov, - loff_t offset, unsigned long nr_segs, get_block_t get_block, - dio_iodone_t end_io) -{ - return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, - nr_segs, get_block, end_io, NULL, 0); -} #endif extern const struct file_operations generic_ro_fops; -- 1.6.6.1
This provides basic DIO support for reading and writing. It does not do the work to recover from mismatching checksums, that will come later. A few design changes have been made from Jim''s code (sorry Jim!) 1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO code in order to account for all of BTRFS''s oddities, but thanks to that work it seems like the best bet is to just ignore compression and such and just opt to fallback on buffered IO. 2) Fallback on buffered IO for compressed or inline extents. Jim''s code did it''s own buffering to make dio with compressed extents work. Now we just fallback onto normal buffered IO. 3) Use ordered extents for the writes so that all of the lock_extent() lookup_ordered() type checks continue to work. 4) Do the lock_extent() lookup_ordered() loop in readpage so we don''t race with DIO writes. I''ve tested this with fsx and everything works great. This patch depends on my dio and filemap.c patches to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- fs/btrfs/ctree.h | 2 + fs/btrfs/file-item.c | 25 ++- fs/btrfs/file.c | 69 +++++++- fs/btrfs/inode.c | 486 ++++++++++++++++++++++++++++++++++++++++++++--- fs/btrfs/ordered-data.c | 75 +++++++- fs/btrfs/ordered-data.h | 9 +- 6 files changed, 630 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 746a724..36994e8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2257,6 +2257,8 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 len); int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, struct bio *bio, u32 *dst); +int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode, + struct bio *bio, u64 logical_offset, u32 *dst); int btrfs_insert_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, u64 pos, diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 54a2550..34ea718 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -149,13 +149,14 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans, } -int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, - struct bio *bio, u32 *dst) +static int __btrfs_lookup_bio_sums(struct btrfs_root *root, + struct inode *inode, struct bio *bio, + u64 logical_offset, u32 *dst, int dio) { u32 sum; struct bio_vec *bvec = bio->bi_io_vec; int bio_index = 0; - u64 offset; + u64 offset = 0; u64 item_start_offset = 0; u64 item_last_offset = 0; u64 disk_bytenr; @@ -174,8 +175,11 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, WARN_ON(bio->bi_vcnt <= 0); disk_bytenr = (u64)bio->bi_sector << 9; + if (dio) + offset = logical_offset; while (bio_index < bio->bi_vcnt) { - offset = page_offset(bvec->bv_page) + bvec->bv_offset; + if (!dio) + offset = page_offset(bvec->bv_page) + bvec->bv_offset; ret = btrfs_find_ordered_sum(inode, offset, disk_bytenr, &sum); if (ret == 0) goto found; @@ -238,6 +242,7 @@ found: else set_state_private(io_tree, offset, sum); disk_bytenr += bvec->bv_len; + offset += bvec->bv_len; bio_index++; bvec++; } @@ -245,6 +250,18 @@ found: return 0; } +int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, + struct bio *bio, u32 *dst) +{ + return __btrfs_lookup_bio_sums(root, inode, bio, 0, dst, 0); +} + +int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode, + struct bio *bio, u64 offset, u32 *dst) +{ + return __btrfs_lookup_bio_sums(root, inode, bio, offset, dst, 1); +} + int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, struct list_head *list) { diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 29ff749..dace07b 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -823,6 +823,47 @@ again: return 0; } +/* Copied from read-write.c */ +static void wait_on_retry_sync_kiocb(struct kiocb *iocb) +{ + set_current_state(TASK_UNINTERRUPTIBLE); + if (!kiocbIsKicked(iocb)) + schedule(); + else + kiocbClearKicked(iocb); + __set_current_state(TASK_RUNNING); +} + +/* + * Just a copy of what do_sync_write does. + */ +static ssize_t __btrfs_direct_write(struct file *file, const char __user *buf, + size_t count, loff_t pos, loff_t *ppos) +{ + struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; + unsigned long nr_segs = 1; + struct kiocb kiocb; + ssize_t ret; + + init_sync_kiocb(&kiocb, file); + kiocb.ki_pos = pos; + kiocb.ki_left = count; + kiocb.ki_nbytes = count; + + while (1) { + ret = generic_file_direct_write(&kiocb, &iov, &nr_segs, pos, + ppos, count, count); + if (ret != -EIOCBRETRY) + break; + wait_on_retry_sync_kiocb(&kiocb); + } + + if (ret == -EIOCBQUEUED) + ret = wait_on_sync_kiocb(&kiocb); + *ppos = kiocb.ki_pos; + return ret; +} + static ssize_t btrfs_file_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { @@ -839,12 +880,11 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, unsigned long first_index; unsigned long last_index; int will_write; + int buffered = 0; will_write = ((file->f_flags & O_DSYNC) || IS_SYNC(inode) || (file->f_flags & O_DIRECT)); - nrptrs = min((count + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE, - PAGE_CACHE_SIZE / (sizeof(struct page *))); pinned[0] = NULL; pinned[1] = NULL; @@ -875,13 +915,34 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, goto out; file_update_time(file); + BTRFS_I(inode)->sequence++; + + if (unlikely(file->f_flags & O_DIRECT)) { + num_written = __btrfs_direct_write(file, buf, count, pos, + ppos); + pos += num_written; + count -= num_written; + + /* We''ve written everything we wanted to, exit */ + if (num_written < 0 || !count) + goto out; + /* + * We are going to do buffered for the rest of the range, so we + * need to make sure to invalidate the buffered pages when we''re + * done. + */ + buffered = 1; + buf += num_written; + } + + nrptrs = min((count + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE, + PAGE_CACHE_SIZE / (sizeof(struct page *))); pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); /* generic_write_checks can change our pos */ start_pos = pos; - BTRFS_I(inode)->sequence++; first_index = pos >> PAGE_CACHE_SHIFT; last_index = (pos + count) >> PAGE_CACHE_SHIFT; @@ -1023,7 +1084,7 @@ out_nolock: btrfs_end_transaction(trans, root); } } - if (file->f_flags & O_DIRECT) { + if (file->f_flags & O_DIRECT && buffered) { invalidate_mapping_pages(inode->i_mapping, start_pos >> PAGE_CACHE_SHIFT, (start_pos + num_written - 1) >> PAGE_CACHE_SHIFT); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2bfdc64..89a9455 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -697,6 +697,38 @@ retry: return 0; } +static u64 get_extent_allocation_hint(struct inode *inode, u64 start, + u64 num_bytes) +{ + struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; + struct extent_map *em; + u64 alloc_hint = 0; + + read_lock(&em_tree->lock); + em = search_extent_mapping(em_tree, start, num_bytes); + if (em) { + /* + * if block start isn''t an actual block number then find the + * first block in this inode and use that as a hint. If that + * block is also bogus then just don''t worry about it. + */ + if (em->block_start >= EXTENT_MAP_LAST_BYTE) { + free_extent_map(em); + em = search_extent_mapping(em_tree, 0, 0); + if (em && em->block_start < EXTENT_MAP_LAST_BYTE) + alloc_hint = em->block_start; + if (em) + free_extent_map(em); + } else { + alloc_hint = em->block_start; + free_extent_map(em); + } + } + read_unlock(&em_tree->lock); + + return alloc_hint; +} + /* * when extent_io.c finds a delayed allocation range in the file, * the call backs end up in this code. The basic idea is to @@ -769,29 +801,7 @@ static noinline int cow_file_range(struct inode *inode, BUG_ON(disk_num_bytes > btrfs_super_total_bytes(&root->fs_info->super_copy)); - - read_lock(&BTRFS_I(inode)->extent_tree.lock); - em = search_extent_mapping(&BTRFS_I(inode)->extent_tree, - start, num_bytes); - if (em) { - /* - * if block start isn''t an actual block number then find the - * first block in this inode and use that as a hint. If that - * block is also bogus then just don''t worry about it. - */ - if (em->block_start >= EXTENT_MAP_LAST_BYTE) { - free_extent_map(em); - em = search_extent_mapping(em_tree, 0, 0); - if (em && em->block_start < EXTENT_MAP_LAST_BYTE) - alloc_hint = em->block_start; - if (em) - free_extent_map(em); - } else { - alloc_hint = em->block_start; - free_extent_map(em); - } - } - read_unlock(&BTRFS_I(inode)->extent_tree.lock); + alloc_hint = get_extent_allocation_hint(inode, start, num_bytes); btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0); while (disk_num_bytes > 0) { @@ -4875,11 +4885,439 @@ out: return em; } +static struct extent_map *btrfs_new_extent_direct(struct inode *inode, + u64 start, u64 len) +{ + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_trans_handle *trans; + struct extent_map *em; + struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; + struct btrfs_key ins; + u64 alloc_hint; + int ret; + + btrfs_drop_extent_cache(inode, start, start + len - 1, 0); + + trans = btrfs_start_transaction(root, 1); + if (!trans) + return ERR_PTR(-ENOMEM); + + alloc_hint = get_extent_allocation_hint(inode, start, len); + ret = btrfs_reserve_extent(trans, root, len, root->sectorsize, 0, + alloc_hint, (u64)-1, &ins, 1); + if (ret) { + em = ERR_PTR(ret); + goto out; + } + + em = alloc_extent_map(GFP_NOFS); + if (!em) { + em = ERR_PTR(-ENOMEM); + goto out; + } + + em->start = start; + em->orig_start = em->start; + em->len = ins.offset; + + em->block_start = ins.objectid; + em->block_len = ins.offset; + em->bdev = root->fs_info->fs_devices->latest_bdev; + set_bit(EXTENT_FLAG_PINNED, &em->flags); + + while (1) { + write_lock(&em_tree->lock); + ret = add_extent_mapping(em_tree, em); + write_unlock(&em_tree->lock); + if (ret != -EEXIST) + break; + btrfs_drop_extent_cache(inode, start, start + em->len - 1, 0); + } + + ret = btrfs_add_ordered_extent_dio(inode, start, ins.objectid, + ins.offset, ins.offset, 0); + if (ret) { + btrfs_free_reserved_extent(root, ins.objectid, ins.offset); + em = ERR_PTR(ret); + } +out: + btrfs_end_transaction(trans, root); + return em; +} + +static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) +{ + struct extent_map *em; + struct btrfs_root *root = BTRFS_I(inode)->root; + u64 start = iblock << inode->i_blkbits; + u64 len = bh_result->b_size; + + em = btrfs_get_extent(inode, NULL, 0, start, len, 0); + if (IS_ERR(em)) + return PTR_ERR(em); + + /* + * Ok for INLINE and COMPRESSED extents we need to fallback on buffered + * io. INLINE is special, and we could probably kludge it in here, but + * it''s still buffered so for safety lets just fall back to the generic + * buffered path. + * + * For COMPRESSED we _have_ to read the entire extent in so we can + * decompress it, so there will be buffering required no matter what we + * do, so go ahead and fallback to buffered. + * + * We return -ENOTBLK because thats what makes DIO go ahead and go back + * to buffered IO. Don''t blame me, this is the price we pay for using + * the generic code. + */ + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) || + em->block_start == EXTENT_MAP_INLINE) { + free_extent_map(em); + return -ENOTBLK; + } + + /* Just a good old fashioned hole, return */ + if (!create && (em->block_start == EXTENT_MAP_HOLE || + test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { + free_extent_map(em); + /* DIO will do one hole at a time, so just unlock a sector */ + unlock_extent(&BTRFS_I(inode)->io_tree, start, + start + root->sectorsize - 1, GFP_NOFS); + return 0; + } + + /* + * We don''t allocate a new extent in the following cases + * + * 1) The inode is marked as NODATACOW. In this case we''ll just use the + * existing extent. + * 2) The extent is marked as PREALLOC. We''re good to go here and can + * just use the extent. + * + */ + if (!create) + goto map; + + if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags) || + ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) && + em->block_start != EXTENT_MAP_HOLE)) { + u64 block_start; + int type; + int ret; + + if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) + type = BTRFS_ORDERED_PREALLOC; + else + type = BTRFS_ORDERED_NOCOW; + len = min(len, em->block_len - (start - em->start)); + block_start = em->block_start + (start - em->start); + ret = btrfs_add_ordered_extent_dio(inode, start, + start, len, len, type); + if (ret) { + free_extent_map(em); + return ret; + } + } else { + free_extent_map(em); + em = btrfs_new_extent_direct(inode, start, len); + if (IS_ERR(em)) + return PTR_ERR(em); + len = min(len, em->block_len); + } + /* We always just use one extent worth of information */ + spin_lock(&BTRFS_I(inode)->accounting_lock); + BTRFS_I(inode)->outstanding_extents++; + spin_unlock(&BTRFS_I(inode)->accounting_lock); + unlock_extent(&BTRFS_I(inode)->io_tree, start, start + len - 1, + GFP_NOFS); +map: + bh_result->b_blocknr = (em->block_start + (start - em->start)) >> + inode->i_blkbits; + bh_result->b_size = em->len - (start - em->start); + bh_result->b_bdev = em->bdev; + set_buffer_mapped(bh_result); + if (create && !test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) + set_buffer_new(bh_result); + + free_extent_map(em); + + return 0; +} + +struct btrfs_dio_private { + struct inode *inode; + u64 logical_offset; + u64 disk_bytenr; + u64 bytes; + u32 *csums; + void *private; +}; + +static void btrfs_endio_direct_read(struct bio *bio, int err) +{ + struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; + struct bio_vec *bvec = bio->bi_io_vec; + struct btrfs_dio_private *dip = bio->bi_private; + struct inode *inode = dip->inode; + struct btrfs_root *root = BTRFS_I(inode)->root; + u64 start; + u32 *private = dip->csums; + + start = dip->logical_offset; + do { + if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) { + struct page *page = bvec->bv_page; + char *kaddr; + u32 csum = ~(u32)0; + unsigned long flags; + + local_irq_save(flags); + kaddr = kmap_atomic(page, KM_IRQ0); + csum = btrfs_csum_data(root, kaddr + bvec->bv_offset, + csum, bvec->bv_len); + btrfs_csum_final(csum, (char *)&csum); + kunmap_atomic(kaddr, KM_IRQ0); + local_irq_restore(flags); + + flush_dcache_page(bvec->bv_page); + if (csum != *private) { + printk(KERN_ERR "btrfs csum failed ino %lu off" + " %llu csum %u private %u\n", + inode->i_ino, (unsigned long long)start, + csum, *private); + err = -EIO; + } + } + + start += bvec->bv_len; + private++; + bvec++; + } while (bvec <= bvec_end); + + unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset, + dip->logical_offset + dip->bytes - 1, GFP_NOFS); + bio->bi_private = dip->private; + + kfree(dip->csums); + kfree(dip); + dio_end_io(bio, err); +} + +static void btrfs_endio_direct_write(struct bio *bio, int err) +{ + struct btrfs_dio_private *dip = bio->bi_private; + struct inode *inode = dip->inode; + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_trans_handle *trans; + struct btrfs_ordered_extent *ordered = NULL; + struct extent_state *cached_state = NULL; + int ret; + + if (err) + goto out_done; + + ret = btrfs_dec_test_ordered_pending(inode, &ordered, + dip->logical_offset, dip->bytes); + if (!ret) + goto out_done; + BUG_ON(!ordered); + + trans = btrfs_join_transaction(root, 1); + if (!trans) { + err = -ENOMEM; + goto out; + } + + if (test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) { + ret = btrfs_ordered_update_i_size(inode, 0, ordered); + if (!ret) + ret = btrfs_update_inode(trans, root, inode); + err = ret; + goto out; + } + + lock_extent_bits(&BTRFS_I(inode)->io_tree, ordered->file_offset, + ordered->file_offset + ordered->len - 1, 0, + &cached_state, GFP_NOFS); + + if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered->flags)) { + ret = btrfs_mark_extent_written(trans, inode, + ordered->file_offset, + ordered->file_offset + + ordered->len); + if (ret) { + err = ret; + goto out_unlock; + } + } else { + ret = insert_reserved_file_extent(trans, inode, + ordered->file_offset, + ordered->start, + ordered->disk_len, + ordered->len, + ordered->len, + 0, 0, 0, + BTRFS_FILE_EXTENT_REG); + unpin_extent_cache(&BTRFS_I(inode)->extent_tree, + ordered->file_offset, ordered->len); + if (ret) { + err = ret; + WARN_ON(1); + goto out_unlock; + } + } + + add_pending_csums(trans, inode, ordered->file_offset, &ordered->list); + btrfs_ordered_update_i_size(inode, 0, ordered); + btrfs_update_inode(trans, root, inode); +out_unlock: + unlock_extent_cached(&BTRFS_I(inode)->io_tree, ordered->file_offset, + ordered->file_offset + ordered->len - 1, + &cached_state, GFP_NOFS); +out: + btrfs_end_transaction(trans, root); + btrfs_put_ordered_extent(ordered); + btrfs_put_ordered_extent(ordered); +out_done: + bio->bi_private = dip->private; + + kfree(dip->csums); + kfree(dip); + dio_end_io(bio, err); +} + +static void btrfs_submit_direct(int rw, struct bio *bio, struct inode *inode, + loff_t file_offset) +{ + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_dio_private *dip; + struct bio_vec *bvec = bio->bi_io_vec; + u64 start; + int skip_sum; + int write = rw & (1 << BIO_RW); + int ret = 0; + + skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; + + dip = kmalloc(sizeof(*dip), GFP_NOFS); + if (!dip) { + ret = -ENOMEM; + goto free_ordered; + } + dip->csums = NULL; + + if (!skip_sum) { + dip->csums = kmalloc(sizeof(u32) * bio->bi_vcnt, GFP_NOFS); + if (!dip->csums) { + ret = -ENOMEM; + goto free_ordered; + } + } + + dip->private = bio->bi_private; + dip->inode = inode; + dip->logical_offset = file_offset; + + start = dip->logical_offset; + dip->bytes = 0; + do { + dip->bytes += bvec->bv_len; + bvec++; + } while (bvec <= (bio->bi_io_vec + bio->bi_vcnt - 1)); + + dip->disk_bytenr = bio->bi_sector << 9; + bio->bi_private = dip; + + if (write) + bio->bi_end_io = btrfs_endio_direct_write; + else + bio->bi_end_io = btrfs_endio_direct_read; + + ret = btrfs_bio_wq_end_io(root->fs_info, bio, 0); + if (ret) + goto out_err; + + if (write && !skip_sum) + btrfs_csum_one_bio(root, inode, bio, dip->logical_offset, 1); + else if (!skip_sum) + btrfs_lookup_bio_sums_dio(root, inode, bio, + dip->logical_offset, dip->csums); + + ret = btrfs_map_bio(root, rw, bio, 0, 0); + if (ret) + goto out_err; + return; +out_err: + kfree(dip->csums); + kfree(dip); +free_ordered: + /* + * If this is a write, we need to clean up the reserved space and kill + * the ordered extent. + */ + if (write) { + struct btrfs_ordered_extent *ordered; + ordered = btrfs_lookup_ordered_extent(inode, + dip->logical_offset); + if (!test_bit(BTRFS_ORDERED_PREALLOC, &ordered->flags) && + !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) + btrfs_free_reserved_extent(root, ordered->start, + ordered->disk_len); + btrfs_put_ordered_extent(ordered); + btrfs_put_ordered_extent(ordered); + } + bio_endio(bio, ret); +} + static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) { - return -EINVAL; + struct file *file = iocb->ki_filp; + struct inode *inode = file->f_mapping->host; + struct btrfs_ordered_extent *ordered; + u64 lockstart, lockend; + ssize_t ret; + + lockstart = offset; + lockend = offset + iov_length(iov, nr_segs) - 1; + while (1) { + lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, + GFP_NOFS); + /* + * We''re concerned with the entire range that we''re going to be + * doing DIO to, so we need to make sure theres no ordered + * extents in this range. + */ + ordered = btrfs_lookup_ordered_range(inode, lockstart, + lockend - lockstart + 1); + if (!ordered) + break; + unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend, + GFP_NOFS); + btrfs_start_ordered_extent(inode, ordered, 1); + btrfs_put_ordered_extent(ordered); + cond_resched(); + } + + ret = __blockdev_direct_IO(rw, iocb, inode, NULL, iov, offset, nr_segs, + btrfs_get_blocks_direct, NULL, + btrfs_submit_direct, 0); + + if (ret < 0 && ret != -EIOCBQUEUED) { + unlock_extent(&BTRFS_I(inode)->io_tree, offset, + offset + iov_length(iov, nr_segs) - 1, GFP_NOFS); + } else if (ret >= 0 && ret < iov_length(iov, nr_segs)) { + /* + * We''re falling back to buffered, unlock the section we didn''t + * do IO on. + */ + unlock_extent(&BTRFS_I(inode)->io_tree, offset + ret, + offset + iov_length(iov, nr_segs) - 1, GFP_NOFS); + } + + return ret; } static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index a127c0e..7a06e97 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -124,6 +124,15 @@ static int offset_in_entry(struct btrfs_ordered_extent *entry, u64 file_offset) return 1; } +static int range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset, + u64 len) +{ + if (file_offset + len <= entry->file_offset || + entry->file_offset + entry->len <= file_offset) + return 0; + return 1; +} + /* * look find the first ordered struct that has this offset, otherwise * the first one less than this offset @@ -161,8 +170,9 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, * The tree is given a single reference on the ordered extent that was * inserted. */ -int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, int type) +static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, + int type, int dio) { struct btrfs_ordered_inode_tree *tree; struct rb_node *node; @@ -182,6 +192,9 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) set_bit(type, &entry->flags); + if (dio) + set_bit(BTRFS_ORDERED_DIRECT, &entry->flags); + /* one ref for the tree */ atomic_set(&entry->refs, 1); init_waitqueue_head(&entry->wait); @@ -203,6 +216,20 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, return 0; } +int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type) +{ + return __btrfs_add_ordered_extent(inode, file_offset, start, len, + disk_len, type, 0); +} + +int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type) +{ + return __btrfs_add_ordered_extent(inode, file_offset, start, len, + disk_len, type, 1); +} + /* * Add a struct btrfs_ordered_sum into the list of checksums to be inserted * when an ordered extent is finished. If the list covers more than one @@ -491,7 +518,8 @@ void btrfs_start_ordered_extent(struct inode *inode, * start IO on any dirty ones so the wait doesn''t stall waiting * for pdflush to find them */ - filemap_fdatawrite_range(inode->i_mapping, start, end); + if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) + filemap_fdatawrite_range(inode->i_mapping, start, end); if (wait) { wait_event(entry->wait, test_bit(BTRFS_ORDERED_COMPLETE, &entry->flags)); @@ -588,6 +616,47 @@ out: return entry; } +/* Since the DIO code tries to lock a wide area we need to look for any ordered + * extents that exist in the range, rather than just the start of the range. + */ +struct btrfs_ordered_extent *btrfs_lookup_ordered_range(struct inode *inode, + u64 file_offset, + u64 len) +{ + struct btrfs_ordered_inode_tree *tree; + struct rb_node *node; + struct btrfs_ordered_extent *entry = NULL; + + tree = &BTRFS_I(inode)->ordered_tree; + spin_lock(&tree->lock); + node = tree_search(tree, file_offset); + if (!node) { + node = tree_search(tree, file_offset + len); + if (!node) + goto out; + } + + while (1) { + entry = rb_entry(node, struct btrfs_ordered_extent, rb_node); + if (range_overlaps(entry, file_offset, len)) + break; + + if (entry->file_offset >= file_offset + len) { + entry = NULL; + break; + } + entry = NULL; + node = rb_next(node); + if (!node) + break; + } +out: + if (entry) + atomic_inc(&entry->refs); + spin_unlock(&tree->lock); + return entry; +} + /* * lookup and return any extent before ''file_offset''. NULL is returned * if none is found diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index c82f76a..8ac3654 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -72,6 +72,8 @@ struct btrfs_ordered_sum { #define BTRFS_ORDERED_PREALLOC 4 /* set when writing to prealloced extent */ +#define BTRFS_ORDERED_DIRECT 5 /* set when we''re doing DIO with this extent */ + struct btrfs_ordered_extent { /* logical offset in the file */ u64 file_offset; @@ -140,7 +142,9 @@ int btrfs_dec_test_ordered_pending(struct inode *inode, struct btrfs_ordered_extent **cached, u64 file_offset, u64 io_size); int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, - u64 start, u64 len, u64 disk_len, int tyep); + u64 start, u64 len, u64 disk_len, int type); +int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type); int btrfs_add_ordered_sum(struct inode *inode, struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); @@ -151,6 +155,9 @@ void btrfs_start_ordered_extent(struct inode *inode, int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len); struct btrfs_ordered_extent * btrfs_lookup_first_ordered_extent(struct inode * inode, u64 file_offset); +struct btrfs_ordered_extent *btrfs_lookup_ordered_range(struct inode *inode, + u64 file_offset, + u64 len); int btrfs_ordered_update_i_size(struct inode *inode, u64 offset, struct btrfs_ordered_extent *ordered); int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr, u32 *sum); -- 1.6.6.1
In order for AIO to work, we need to implement aio_write. This patch converts our btrfs_file_write to btrfs_aio_write. I''ve tested this with xfstests and nothing broke, and the AIO stuff magically started working. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- fs/btrfs/extent_io.c | 11 +++- fs/btrfs/file.c | 152 +++++++++++++++++++++++--------------------------- 2 files changed, 80 insertions(+), 83 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d2d0368..c407f1c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2020,6 +2020,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree, sector_t sector; struct extent_map *em; struct block_device *bdev; + struct btrfs_ordered_extent *ordered; int ret; int nr = 0; size_t page_offset = 0; @@ -2031,7 +2032,15 @@ static int __extent_read_full_page(struct extent_io_tree *tree, set_page_extent_mapped(page); end = page_end; - lock_extent(tree, start, end, GFP_NOFS); + while (1) { + lock_extent(tree, start, end, GFP_NOFS); + ordered = btrfs_lookup_ordered_extent(inode, start); + if (!ordered) + break; + unlock_extent(tree, start, end, GFP_NOFS); + btrfs_start_ordered_extent(inode, ordered, 1); + btrfs_put_ordered_extent(ordered); + } if (page->index == last_byte >> PAGE_CACHE_SHIFT) { char *userpage; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index dace07b..ce35431 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -46,32 +46,42 @@ static noinline int btrfs_copy_from_user(loff_t pos, int num_pages, int write_bytes, struct page **prepared_pages, - const char __user *buf) + struct iov_iter *i) { - long page_fault = 0; - int i; + size_t copied; + int pg = 0; int offset = pos & (PAGE_CACHE_SIZE - 1); - for (i = 0; i < num_pages && write_bytes > 0; i++, offset = 0) { + while (write_bytes > 0) { size_t count = min_t(size_t, PAGE_CACHE_SIZE - offset, write_bytes); - struct page *page = prepared_pages[i]; - fault_in_pages_readable(buf, count); + struct page *page = prepared_pages[pg]; +again: + if (unlikely(iov_iter_fault_in_readable(i, count))) + return -EFAULT; /* Copy data from userspace to the current page */ - kmap(page); - page_fault = __copy_from_user(page_address(page) + offset, - buf, count); + copied = iov_iter_copy_from_user(page, i, offset, count); + /* Flush processor''s dcache for this page */ flush_dcache_page(page); - kunmap(page); - buf += count; - write_bytes -= count; + iov_iter_advance(i, copied); + write_bytes -= copied; - if (page_fault) - break; + if (unlikely(copied == 0)) { + count = min_t(size_t, PAGE_CACHE_SIZE - offset, + iov_iter_single_seg_count(i)); + goto again; + } + + if (unlikely(copied < PAGE_CACHE_SIZE - offset)) { + offset += copied; + } else { + pg++; + offset = 0; + } } - return page_fault ? -EFAULT : 0; + return 0; } /* @@ -823,60 +833,24 @@ again: return 0; } -/* Copied from read-write.c */ -static void wait_on_retry_sync_kiocb(struct kiocb *iocb) -{ - set_current_state(TASK_UNINTERRUPTIBLE); - if (!kiocbIsKicked(iocb)) - schedule(); - else - kiocbClearKicked(iocb); - __set_current_state(TASK_RUNNING); -} - -/* - * Just a copy of what do_sync_write does. - */ -static ssize_t __btrfs_direct_write(struct file *file, const char __user *buf, - size_t count, loff_t pos, loff_t *ppos) -{ - struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; - unsigned long nr_segs = 1; - struct kiocb kiocb; - ssize_t ret; - - init_sync_kiocb(&kiocb, file); - kiocb.ki_pos = pos; - kiocb.ki_left = count; - kiocb.ki_nbytes = count; - - while (1) { - ret = generic_file_direct_write(&kiocb, &iov, &nr_segs, pos, - ppos, count, count); - if (ret != -EIOCBRETRY) - break; - wait_on_retry_sync_kiocb(&kiocb); - } - - if (ret == -EIOCBQUEUED) - ret = wait_on_sync_kiocb(&kiocb); - *ppos = kiocb.ki_pos; - return ret; -} - -static ssize_t btrfs_file_write(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) +static ssize_t btrfs_file_aio_write(struct kiocb *iocb, + const struct iovec *iov, + unsigned long nr_segs, loff_t pos) { - loff_t pos; + struct file *file = iocb->ki_filp; + struct inode *inode = fdentry(file)->d_inode; + struct btrfs_root *root = BTRFS_I(inode)->root; + struct page *pinned[2]; + struct page **pages = NULL; + struct iov_iter i; + loff_t *ppos = &iocb->ki_pos; loff_t start_pos; ssize_t num_written = 0; ssize_t err = 0; + size_t count; + size_t ocount; int ret = 0; - struct inode *inode = fdentry(file)->d_inode; - struct btrfs_root *root = BTRFS_I(inode)->root; - struct page **pages = NULL; int nrptrs; - struct page *pinned[2]; unsigned long first_index; unsigned long last_index; int will_write; @@ -888,7 +862,6 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, pinned[0] = NULL; pinned[1] = NULL; - pos = *ppos; start_pos = pos; vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE); @@ -902,6 +875,11 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, mutex_lock(&inode->i_mutex); + err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); + if (err) + goto out; + count = ocount; + current->backing_dev_info = inode->i_mapping->backing_dev_info; err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); if (err) @@ -918,14 +896,24 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, BTRFS_I(inode)->sequence++; if (unlikely(file->f_flags & O_DIRECT)) { - num_written = __btrfs_direct_write(file, buf, count, pos, - ppos); - pos += num_written; - count -= num_written; + ret = btrfs_check_data_free_space(root, inode, count); + if (ret) + goto out; - /* We''ve written everything we wanted to, exit */ - if (num_written < 0 || !count) + num_written = generic_file_direct_write(iocb, iov, &nr_segs, + pos, ppos, count, + ocount); + + /* All reservations for DIO are done internally */ + btrfs_free_reserved_data_space(root, inode, count); + + if (num_written < 0) { + ret = num_written; + num_written = 0; + goto out; + } else if (num_written == count) { goto out; + } /* * We are going to do buffered for the rest of the range, so we @@ -933,18 +921,20 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, * done. */ buffered = 1; - buf += num_written; + pos += num_written; } - nrptrs = min((count + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE, - PAGE_CACHE_SIZE / (sizeof(struct page *))); + iov_iter_init(&i, iov, nr_segs, count, num_written); + nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) / + PAGE_CACHE_SIZE, PAGE_CACHE_SIZE / + (sizeof(struct page *))); pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); /* generic_write_checks can change our pos */ start_pos = pos; first_index = pos >> PAGE_CACHE_SHIFT; - last_index = (pos + count) >> PAGE_CACHE_SHIFT; + last_index = (pos + iov_iter_count(&i)) >> PAGE_CACHE_SHIFT; /* * there are lots of better ways to do this, but this code @@ -961,7 +951,7 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, unlock_page(pinned[0]); } } - if ((pos + count) & (PAGE_CACHE_SIZE - 1)) { + if ((pos + iov_iter_count(&i)) & (PAGE_CACHE_SIZE - 1)) { pinned[1] = grab_cache_page(inode->i_mapping, last_index); if (!PageUptodate(pinned[1])) { ret = btrfs_readpage(NULL, pinned[1]); @@ -972,10 +962,10 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, } } - while (count > 0) { + while (iov_iter_count(&i) > 0) { size_t offset = pos & (PAGE_CACHE_SIZE - 1); - size_t write_bytes = min(count, nrptrs * - (size_t)PAGE_CACHE_SIZE - + size_t write_bytes = min(iov_iter_count(&i), + nrptrs * (size_t)PAGE_CACHE_SIZE - offset); size_t num_pages = (write_bytes + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; @@ -997,7 +987,7 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, } ret = btrfs_copy_from_user(pos, num_pages, - write_bytes, pages, buf); + write_bytes, pages, &i); if (ret) { btrfs_free_reserved_data_space(root, inode, write_bytes); @@ -1026,8 +1016,6 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, btrfs_throttle(root); } - buf += write_bytes; - count -= write_bytes; pos += write_bytes; num_written += write_bytes; @@ -1222,7 +1210,7 @@ const struct file_operations btrfs_file_operations = { .read = do_sync_read, .aio_read = generic_file_aio_read, .splice_read = generic_file_splice_read, - .write = btrfs_file_write, + .aio_write = btrfs_file_aio_write, .mmap = btrfs_file_mmap, .open = generic_file_open, .release = btrfs_release_file, -- 1.6.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Fedyk
2010-May-22 01:47 UTC
Re: [PATCH 3/6] direct-io: do not merge logically non-contiguous requests
On Fri, May 21, 2010 at 10:03 AM, Josef Bacik <josef@redhat.com> wrote:> Btrfs cannot handle having logically non-contiguous requests submitted. For > example if you have > > Logical: [0-4095][HOLE][8192-12287] > Physical: [0-4095] [4096-8191] > > Normally the DIO code would put these into the same BIO''s. The problem is we > need to know exactly what offset is associated with what BIO so we can do our > checksumming and unlocking properly, so putting them in the same BIO doesn''t > work. So add another check where we submit the current BIO if the physical > blocks are not contigous OR the logical blocks are not contiguous. > > Signed-off-by: Josef Bacik <josef@redhat.com> > --- > > V1->V2 > -Be more verbose in the in-code comment > > fs/direct-io.c | 20 ++++++++++++++++++-- > 1 files changed, 18 insertions(+), 2 deletions(-) >Btrfs has been pretty much self-contained (working well compiled against 2.6.32 for example). Is there a way that this wouldn''t just start silently breaking for people compiling the latest btrfs with dkms against older kernels? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-May-22 14:02 UTC
Re: [PATCH 3/6] direct-io: do not merge logically non-contiguous requests
On Fri, May 21, 2010 at 06:47:36PM -0700, Mike Fedyk wrote:> On Fri, May 21, 2010 at 10:03 AM, Josef Bacik <josef@redhat.com> wrote: > > Btrfs cannot handle having logically non-contiguous requests submitted. For > > example if you have > > > > Logical: [0-4095][HOLE][8192-12287] > > Physical: [0-4095] [4096-8191] > > > > Normally the DIO code would put these into the same BIO''s. The problem is we > > need to know exactly what offset is associated with what BIO so we can do our > > checksumming and unlocking properly, so putting them in the same BIO doesn''t > > work. So add another check where we submit the current BIO if the physical > > blocks are not contigous OR the logical blocks are not contiguous. > > > > Signed-off-by: Josef Bacik <josef@redhat.com> > > --- > > > > V1->V2 > > -Be more verbose in the in-code comment > > > > fs/direct-io.c | 20 ++++++++++++++++++-- > > 1 files changed, 18 insertions(+), 2 deletions(-) > > > > Btrfs has been pretty much self-contained (working well compiled > against 2.6.32 for example). Is there a way that this wouldn''t just > start silently breaking for people compiling the latest btrfs with > dkms against older kernels?Nope, you have to have these generic patches for DIO to work, so building btrfs like this will stop working with earlier kernels. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
at 2010-5-22 1:03, Josef Bacik wrote:> In order for AIO to work, we need to implement aio_write. This patch converts > our btrfs_file_write to btrfs_aio_write. I''ve tested this with xfstests and > nothing broke, and the AIO stuff magically started working. Thanks,But xfstests''s case 198(source: src/aio-dio-regress/aiodio_sparse2.c) still failed, following message outputted. -------------------- AIO write offset 0 expected 65536 got -22 AIO write offset 5242880 expected 65536 got -22 AIO write offset 10485760 expected 65536 got -22 AIO write offset 15728640 expected 65536 got -22 AIO write offset 20971520 expected 65536 got -22 AIO write offset 31457280 expected 65536 got -22 AIO write offset 36700160 expected 65536 got -22 AIO write offset 41943040 expected 65536 got -22 AIO write offset 47185920 expected 65536 got -22 AIO write offset 52428800 expected 65536 got -22 AIO write offset 57671680 expected 65536 got -22 AIO write offset 62914560 expected 65536 got -22 AIO write offset 73400320 expected 65536 got -22 AIO write offset 78643200 expected 65536 got -22 non one buffer at buf[0] => 0x00,00,00,00 non-one read at offset 0 *** WARNING *** /tmp/aaaa has not been unlinked; if you don''t rm it manually first, it may influence the next run -------------------- generic_file_direct_write()(in btrfs_file_aio_write(), fs/btrfs/file.c) returned -22, maybe it''s useful for your analysing. Thanks.> > Signed-off-by: Josef Bacik <josef@redhat.com> > --- > fs/btrfs/extent_io.c | 11 +++- > fs/btrfs/file.c | 152 +++++++++++++++++++++++--------------------------- > 2 files changed, 80 insertions(+), 83 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index d2d0368..c407f1c 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2020,6 +2020,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree, > sector_t sector; > struct extent_map *em; > struct block_device *bdev; > + struct btrfs_ordered_extent *ordered; > int ret; > int nr = 0; > size_t page_offset = 0; > @@ -2031,7 +2032,15 @@ static int __extent_read_full_page(struct extent_io_tree *tree, > set_page_extent_mapped(page); > > end = page_end; > - lock_extent(tree, start, end, GFP_NOFS); > + while (1) { > + lock_extent(tree, start, end, GFP_NOFS); > + ordered = btrfs_lookup_ordered_extent(inode, start); > + if (!ordered) > + break; > + unlock_extent(tree, start, end, GFP_NOFS); > + btrfs_start_ordered_extent(inode, ordered, 1); > + btrfs_put_ordered_extent(ordered); > + } > > if (page->index == last_byte >> PAGE_CACHE_SHIFT) { > char *userpage; > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index dace07b..ce35431 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -46,32 +46,42 @@ > static noinline int btrfs_copy_from_user(loff_t pos, int num_pages, > int write_bytes, > struct page **prepared_pages, > - const char __user *buf) > + struct iov_iter *i) > { > - long page_fault = 0; > - int i; > + size_t copied; > + int pg = 0; > int offset = pos & (PAGE_CACHE_SIZE - 1); > > - for (i = 0; i < num_pages && write_bytes > 0; i++, offset = 0) { > + while (write_bytes > 0) { > size_t count = min_t(size_t, > PAGE_CACHE_SIZE - offset, write_bytes); > - struct page *page = prepared_pages[i]; > - fault_in_pages_readable(buf, count); > + struct page *page = prepared_pages[pg]; > +again: > + if (unlikely(iov_iter_fault_in_readable(i, count))) > + return -EFAULT; > > /* Copy data from userspace to the current page */ > - kmap(page); > - page_fault = __copy_from_user(page_address(page) + offset, > - buf, count); > + copied = iov_iter_copy_from_user(page, i, offset, count); > + > /* Flush processor''s dcache for this page */ > flush_dcache_page(page); > - kunmap(page); > - buf += count; > - write_bytes -= count; > + iov_iter_advance(i, copied); > + write_bytes -= copied; > > - if (page_fault) > - break; > + if (unlikely(copied == 0)) { > + count = min_t(size_t, PAGE_CACHE_SIZE - offset, > + iov_iter_single_seg_count(i)); > + goto again; > + } > + > + if (unlikely(copied < PAGE_CACHE_SIZE - offset)) { > + offset += copied; > + } else { > + pg++; > + offset = 0; > + } > } > - return page_fault ? -EFAULT : 0; > + return 0; > } > > /* > @@ -823,60 +833,24 @@ again: > return 0; > } > > -/* Copied from read-write.c */ > -static void wait_on_retry_sync_kiocb(struct kiocb *iocb) > -{ > - set_current_state(TASK_UNINTERRUPTIBLE); > - if (!kiocbIsKicked(iocb)) > - schedule(); > - else > - kiocbClearKicked(iocb); > - __set_current_state(TASK_RUNNING); > -} > - > -/* > - * Just a copy of what do_sync_write does. > - */ > -static ssize_t __btrfs_direct_write(struct file *file, const char __user *buf, > - size_t count, loff_t pos, loff_t *ppos) > -{ > - struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; > - unsigned long nr_segs = 1; > - struct kiocb kiocb; > - ssize_t ret; > - > - init_sync_kiocb(&kiocb, file); > - kiocb.ki_pos = pos; > - kiocb.ki_left = count; > - kiocb.ki_nbytes = count; > - > - while (1) { > - ret = generic_file_direct_write(&kiocb, &iov, &nr_segs, pos, > - ppos, count, count); > - if (ret != -EIOCBRETRY) > - break; > - wait_on_retry_sync_kiocb(&kiocb); > - } > - > - if (ret == -EIOCBQUEUED) > - ret = wait_on_sync_kiocb(&kiocb); > - *ppos = kiocb.ki_pos; > - return ret; > -} > - > -static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > - size_t count, loff_t *ppos) > +static ssize_t btrfs_file_aio_write(struct kiocb *iocb, > + const struct iovec *iov, > + unsigned long nr_segs, loff_t pos) > { > - loff_t pos; > + struct file *file = iocb->ki_filp; > + struct inode *inode = fdentry(file)->d_inode; > + struct btrfs_root *root = BTRFS_I(inode)->root; > + struct page *pinned[2]; > + struct page **pages = NULL; > + struct iov_iter i; > + loff_t *ppos = &iocb->ki_pos; > loff_t start_pos; > ssize_t num_written = 0; > ssize_t err = 0; > + size_t count; > + size_t ocount; > int ret = 0; > - struct inode *inode = fdentry(file)->d_inode; > - struct btrfs_root *root = BTRFS_I(inode)->root; > - struct page **pages = NULL; > int nrptrs; > - struct page *pinned[2]; > unsigned long first_index; > unsigned long last_index; > int will_write; > @@ -888,7 +862,6 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > pinned[0] = NULL; > pinned[1] = NULL; > > - pos = *ppos; > start_pos = pos; > > vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE); > @@ -902,6 +875,11 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > > mutex_lock(&inode->i_mutex); > > + err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); > + if (err) > + goto out; > + count = ocount; > + > current->backing_dev_info = inode->i_mapping->backing_dev_info; > err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); > if (err) > @@ -918,14 +896,24 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > BTRFS_I(inode)->sequence++; > > if (unlikely(file->f_flags & O_DIRECT)) { > - num_written = __btrfs_direct_write(file, buf, count, pos, > - ppos); > - pos += num_written; > - count -= num_written; > + ret = btrfs_check_data_free_space(root, inode, count); > + if (ret) > + goto out; > > - /* We''ve written everything we wanted to, exit */ > - if (num_written < 0 || !count) > + num_written = generic_file_direct_write(iocb, iov, &nr_segs, > + pos, ppos, count, > + ocount); > + > + /* All reservations for DIO are done internally */ > + btrfs_free_reserved_data_space(root, inode, count); > + > + if (num_written < 0) { > + ret = num_written; > + num_written = 0; > + goto out; > + } else if (num_written == count) { > goto out; > + } > > /* > * We are going to do buffered for the rest of the range, so we > @@ -933,18 +921,20 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > * done. > */ > buffered = 1; > - buf += num_written; > + pos += num_written; > } > > - nrptrs = min((count + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE, > - PAGE_CACHE_SIZE / (sizeof(struct page *))); > + iov_iter_init(&i, iov, nr_segs, count, num_written); > + nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) / > + PAGE_CACHE_SIZE, PAGE_CACHE_SIZE / > + (sizeof(struct page *))); > pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); > > /* generic_write_checks can change our pos */ > start_pos = pos; > > first_index = pos >> PAGE_CACHE_SHIFT; > - last_index = (pos + count) >> PAGE_CACHE_SHIFT; > + last_index = (pos + iov_iter_count(&i)) >> PAGE_CACHE_SHIFT; > > /* > * there are lots of better ways to do this, but this code > @@ -961,7 +951,7 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > unlock_page(pinned[0]); > } > } > - if ((pos + count) & (PAGE_CACHE_SIZE - 1)) { > + if ((pos + iov_iter_count(&i)) & (PAGE_CACHE_SIZE - 1)) { > pinned[1] = grab_cache_page(inode->i_mapping, last_index); > if (!PageUptodate(pinned[1])) { > ret = btrfs_readpage(NULL, pinned[1]); > @@ -972,10 +962,10 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > } > } > > - while (count > 0) { > + while (iov_iter_count(&i) > 0) { > size_t offset = pos & (PAGE_CACHE_SIZE - 1); > - size_t write_bytes = min(count, nrptrs * > - (size_t)PAGE_CACHE_SIZE - > + size_t write_bytes = min(iov_iter_count(&i), > + nrptrs * (size_t)PAGE_CACHE_SIZE - > offset); > size_t num_pages = (write_bytes + PAGE_CACHE_SIZE - 1) >> > PAGE_CACHE_SHIFT; > @@ -997,7 +987,7 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > } > > ret = btrfs_copy_from_user(pos, num_pages, > - write_bytes, pages, buf); > + write_bytes, pages, &i); > if (ret) { > btrfs_free_reserved_data_space(root, inode, > write_bytes); > @@ -1026,8 +1016,6 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, > btrfs_throttle(root); > } > > - buf += write_bytes; > - count -= write_bytes; > pos += write_bytes; > num_written += write_bytes; > > @@ -1222,7 +1210,7 @@ const struct file_operations btrfs_file_operations = { > .read = do_sync_read, > .aio_read = generic_file_aio_read, > .splice_read = generic_file_splice_read, > - .write = btrfs_file_write, > + .aio_write = btrfs_file_aio_write, > .mmap = btrfs_file_mmap, > .open = generic_file_open, > .release = btrfs_release_file,-- Shi Weihua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 23, 2010 at 04:31:52PM +0800, Shi Weihua wrote:> at 2010-5-22 1:03, Josef Bacik wrote: > > In order for AIO to work, we need to implement aio_write. This patch converts > > our btrfs_file_write to btrfs_aio_write. I''ve tested this with xfstests and > > nothing broke, and the AIO stuff magically started working. Thanks, > > But xfstests''s case 198(source: src/aio-dio-regress/aiodio_sparse2.c) still failed, > following message outputted. > -------------------- > AIO write offset 0 expected 65536 got -22 > AIO write offset 5242880 expected 65536 got -22 > AIO write offset 10485760 expected 65536 got -22 > AIO write offset 15728640 expected 65536 got -22 > AIO write offset 20971520 expected 65536 got -22 > AIO write offset 31457280 expected 65536 got -22 > AIO write offset 36700160 expected 65536 got -22 > AIO write offset 41943040 expected 65536 got -22 > AIO write offset 47185920 expected 65536 got -22 > AIO write offset 52428800 expected 65536 got -22 > AIO write offset 57671680 expected 65536 got -22 > AIO write offset 62914560 expected 65536 got -22 > AIO write offset 73400320 expected 65536 got -22 > AIO write offset 78643200 expected 65536 got -22 > non one buffer at buf[0] => 0x00,00,00,00 > non-one read at offset 0 > *** WARNING *** /tmp/aaaa has not been unlinked; if you don''t rm it manually first, it may influence the next run > -------------------- > > generic_file_direct_write()(in btrfs_file_aio_write(), fs/btrfs/file.c) returned -22, > maybe it''s useful for your analysing.Yes, change that testcase to run -a 4096 and it will run fine. Because BTRFS doesn''t pass in a bdev to __blockdev_direct_IO it doesn''t do 512 byte aligned IO, just blocksize aligned IO. I will fix that at some later point, but its a little tricky since we have to figure out which bdev has the largest alignment (in case we have a 4k sector device and a 512 byte sector device in the same volume). Thanks, Josef
On 05/22/2010 01:03 AM, Josef Bacik wrote:> In order for AIO to work, we need to implement aio_write. This patch converts > our btrfs_file_write to btrfs_aio_write. I''ve tested this with xfstests and > nothing broke, and the AIO stuff magically started working. Thanks, > > Signed-off-by: Josef Bacik <josef@redhat.com> >Hi, Josef, I''ve tested your patch(May 22) with my tools, and one case triggered a bug which made writev operation hang up, more information is followed. - Steps to trigger it: # mount /dev/sda8 /home/btrfsdisk -o nodatacow # gcc direct-io.c -o direct-io # ./direct-io O_DIRECT writev /home/btrfsdisk/testrw 4M then on another tty, after "dmesg"... [snip] device fsid f44b0879c75c0e99-1d4b28f2d5c503ae devid 1 transid 11177 /dev/sda8 btrfs: setting nodatacow INFO: task direct-io:1399 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. direct-io D 0000000000000003 0 1399 1341 0x00000000 ffff880137c379c8 0000000000000082 ffff880137c379d8 ffffffff00000000 ffff880137c37fd8 ffff880139730000 0000000000015440 ffff880137c37fd8 0000000000015440 0000000000015440 0000000000015440 0000000000015440 Call Trace: [<ffffffffa0119d4a>] wait_extent_bit+0xe3/0x163 [btrfs] [<ffffffff8106651f>] ? autoremove_wake_function+0x0/0x39 [<ffffffffa0119e47>] lock_extent_bits+0x7d/0xa8 [btrfs] [<ffffffffa0119e88>] lock_extent+0x16/0x18 [btrfs] [<ffffffffa01025ce>] btrfs_direct_IO+0x8e/0x1be [btrfs] [<ffffffff810c7301>] generic_file_direct_write+0xed/0x16d [<ffffffffa010bb91>] btrfs_file_aio_write+0x2af/0x8d2 [btrfs] [<ffffffff81100eae>] ? try_get_mem_cgroup_from_mm+0x39/0x49 [<ffffffffa010b8e2>] ? btrfs_file_aio_write+0x0/0x8d2 [btrfs] [<ffffffff811063ed>] do_sync_readv_writev+0xc1/0x100 [<ffffffff81106120>] ? might_fault+0x21/0x23 [<ffffffff81106151>] ? copy_from_user+0x2f/0x31 [<ffffffff811c90ab>] ? security_file_permission+0x16/0x18 [<ffffffff81107145>] do_readv_writev+0xa7/0x127 [<ffffffff81107208>] vfs_writev+0x43/0x4e [<ffffffff811072f8>] sys_writev+0x4a/0x93 [<ffffffff81009c32>] system_call_fastpath+0x16/0x1b So, can you figure out if anything in your patch leads to the bug?
On 05/27/2010 11:06 AM, liubo wrote:> On 05/22/2010 01:03 AM, Josef Bacik wrote: > >> In order for AIO to work, we need to implement aio_write. This patch converts >> our btrfs_file_write to btrfs_aio_write. I''ve tested this with xfstests and >> nothing broke, and the AIO stuff magically started working. Thanks, >> >> Signed-off-by: Josef Bacik <josef@redhat.com> >> >> > > Hi, Josef, > > I''ve tested your patch(May 22) with my tools, and one case triggered a bug > which made writev operation hang up, more information is followed. > > - Steps to trigger it: > # mount /dev/sda8 /home/btrfsdisk -o nodatacow > # gcc direct-io.c -o direct-io > # ./direct-io O_DIRECT writev /home/btrfsdisk/testrw 4M > > then on another tty, after "dmesg"... > > [snip] > device fsid f44b0879c75c0e99-1d4b28f2d5c503ae devid 1 transid 11177 > /dev/sda8 > btrfs: setting nodatacow > INFO: task direct-io:1399 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > direct-io D 0000000000000003 0 1399 1341 0x00000000 > ffff880137c379c8 0000000000000082 ffff880137c379d8 ffffffff00000000 > ffff880137c37fd8 ffff880139730000 0000000000015440 ffff880137c37fd8 > 0000000000015440 0000000000015440 0000000000015440 0000000000015440 > Call Trace: > [<ffffffffa0119d4a>] wait_extent_bit+0xe3/0x163 [btrfs] > [<ffffffff8106651f>] ? autoremove_wake_function+0x0/0x39 > [<ffffffffa0119e47>] lock_extent_bits+0x7d/0xa8 [btrfs] > [<ffffffffa0119e88>] lock_extent+0x16/0x18 [btrfs] > [<ffffffffa01025ce>] btrfs_direct_IO+0x8e/0x1be [btrfs] > [<ffffffff810c7301>] generic_file_direct_write+0xed/0x16d > [<ffffffffa010bb91>] btrfs_file_aio_write+0x2af/0x8d2 [btrfs] > [<ffffffff81100eae>] ? try_get_mem_cgroup_from_mm+0x39/0x49 > [<ffffffffa010b8e2>] ? btrfs_file_aio_write+0x0/0x8d2 [btrfs] > [<ffffffff811063ed>] do_sync_readv_writev+0xc1/0x100 > [<ffffffff81106120>] ? might_fault+0x21/0x23 > [<ffffffff81106151>] ? copy_from_user+0x2f/0x31 > [<ffffffff811c90ab>] ? security_file_permission+0x16/0x18 > [<ffffffff81107145>] do_readv_writev+0xa7/0x127 > [<ffffffff81107208>] vfs_writev+0x43/0x4e > [<ffffffff811072f8>] sys_writev+0x4a/0x93 > [<ffffffff81009c32>] system_call_fastpath+0x16/0x1b > > > So, can you figure out if anything in your patch leads to the bug? > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > >Sorry, I forgot the attachment... Plz get it. Thanks, - Liubo
On Thu, May 27, 2010 at 11:06:54AM +0800, liubo wrote:> On 05/22/2010 01:03 AM, Josef Bacik wrote: > > In order for AIO to work, we need to implement aio_write. This patch converts > > our btrfs_file_write to btrfs_aio_write. I''ve tested this with xfstests and > > nothing broke, and the AIO stuff magically started working. Thanks, > > > > Signed-off-by: Josef Bacik <josef@redhat.com> > > > > Hi, Josef, > > I''ve tested your patch(May 22) with my tools, and one case triggered a bug > which made writev operation hang up, more information is followed. > > - Steps to trigger it: > # mount /dev/sda8 /home/btrfsdisk -o nodatacow > # gcc direct-io.c -o direct-io > # ./direct-io O_DIRECT writev /home/btrfsdisk/testrw 4MThanks for sending along this test program and bug report. We''ve fixed a few bugs in the O_DIRECT patches, and this is working now. The merged result is in the for-linus branch of the btrfs unstable tree. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/27/2010 08:59 PM, Chris Mason wrote:> Thanks for sending along this test program and bug report. We''ve fixed > a few bugs in the O_DIRECT patches, and this is working now. > > The merged result is in the for-linus branch of the btrfs unstable tree. > > -chris >Oh, Thanks a lot. I''ve see your great effort. -liubo
On Sat, May 22, 2010 at 3:03 AM, Josef Bacik <josef@redhat.com> wrote:> + while (1) { > + lock_extent(tree, start, end, GFP_NOFS); > + ordered = btrfs_lookup_ordered_extent(inode, start); > + if (!ordered) > + break; > + unlock_extent(tree, start, end, GFP_NOFS);Is it ok not to unlock_extent if !ordered? I don''t know if you fixed this in a later version but it stuck out to me :) -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 30, 2010 at 07:33:33PM +1000, Dmitri Nikulin wrote:> On Sat, May 22, 2010 at 3:03 AM, Josef Bacik <josef@redhat.com> wrote: > > + while (1) { > > + lock_extent(tree, start, end, GFP_NOFS); > > + ordered = btrfs_lookup_ordered_extent(inode, start); > > + if (!ordered) > > + break; > > + unlock_extent(tree, start, end, GFP_NOFS); > > Is it ok not to unlock_extent if !ordered? > I don''t know if you fixed this in a later version but it stuck out to me :)The construct is confusing. Ordered extents track things that we have allocated on disk and need to write. New ones can''t be created while we have the extent range locked. But we can''t force old ones to disk with the lock held. So, we lock then lookup and if we find nothing we can safely do our operation because no io is in progress. We unlock a little later on, or at endio time. If we find an ordered extent we drop the lock and wait for that IO to finish, then loop again. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 1, 2010 at 11:19 PM, Chris Mason <chris.mason@oracle.com> wrote:>> Is it ok not to unlock_extent if !ordered? >> I don''t know if you fixed this in a later version but it stuck out to me :) > > The construct is confusing. Ordered extents track things that we have > allocated on disk and need to write. New ones can''t be created while we > have the extent range locked. But we can''t force old ones to disk with > the lock held. > > So, we lock then lookup and if we find nothing we can safely do our > operation because no io is in progress. We unlock a little later on, or > at endio time. > > If we find an ordered extent we drop the lock and wait for that IO to > finish, then loop again.Ok, that''s fair enough. Maybe it''s worth commenting, I''m sure I''m not the only one surprised. Thanks, -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html