Hisashi Hifumi
2009-Mar-31 05:18 UTC
[RFC] [PATCH] Btrfs: improve fsync/osync write performance
Hi Chris. I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is very slow as compared to ext3/4. I used blktrace to try to investigate the cause of this. One of cause is that unplug is done by kblockd even if the I/O is issued through fsync() or write() with O_SYNC flag. kblockd''s unplug timeout is 3msec, so unplug via blockd can decrease I/O response. To increase fsync/osync write performance, speeding up unplug should be done here. Btrfs''s write I/O is issued via kernel thread, not via user application context that calls fsync(). While waiting for page writeback, wait_on_page_writeback() can not unplug I/O sometimes on Btrfs because submit_bio is not called from user application context so when submit_bio is called from kernel thread, wait_on_page_writeback() sleeps on io_schedule(). I introduced btrfs_wait_on_page_writeback() on following patch, this is replacement of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while waiting for page writeback. I did a performance test using the sysbench. # sysbench --num-threads=4 --max-requests=10000 --test=fileio --file-num=1 --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr --file-fsync-freq=5 run The result was: -2.6.29 Test execution summary: total time: 628.1047s total number of events: 10000 total time taken by event execution: 413.0834 per-request statistics: min: 0.0000s avg: 0.0413s max: 1.9075s approx. 95 percentile: 0.3712s Threads fairness: events (avg/stddev): 2500.0000/29.21 execution time (avg/stddev): 103.2708/4.04 -2.6.29-patched Test execution summary: total time: 579.8049s total number of events: 10004 total time taken by event execution: 355.3098 per-request statistics: min: 0.0000s avg: 0.0355s max: 1.7670s approx. 95 percentile: 0.3154s Threads fairness: events (avg/stddev): 2501.0000/8.03 execution time (avg/stddev): 88.8274/1.94 This patch has some effect for performance improvement. I think there are other reasons that should be fixed why fsync() or write() with O_SYNC flag is slow on Btrfs. Thanks. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> diff -Nrup linux-2.6.29.org/fs/btrfs/ctree.h linux-2.6.29.btrfs/fs/btrfs/ctree.h --- linux-2.6.29.org/fs/btrfs/ctree.h 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs/fs/btrfs/ctree.h 2009-03-24 16:48:36.000000000 +0900 @@ -1703,6 +1703,14 @@ static inline struct dentry *fdentry(str return file->f_path.dentry; } +extern void btrfs_wait_on_page_bit(struct page *page); + +static inline void btrfs_wait_on_page_writeback(struct page *page) +{ + if (PageWriteback(page)) + btrfs_wait_on_page_bit(page); +} + /* extent-tree.c */ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len); int btrfs_lookup_extent_ref(struct btrfs_trans_handle *trans, diff -Nrup linux-2.6.29.org/fs/btrfs/extent-tree.c linux-2.6.29.btrfs/fs/btrfs/extent-tree.c --- linux-2.6.29.org/fs/btrfs/extent-tree.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs/fs/btrfs/extent-tree.c 2009-03-24 15:34:12.000000000 +0900 @@ -4529,7 +4529,7 @@ again: goto out_unlock; } } - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); page_start = (u64)page->index << PAGE_CACHE_SHIFT; page_end = page_start + PAGE_CACHE_SIZE - 1; diff -Nrup linux-2.6.29.org/fs/btrfs/extent_io.c linux-2.6.29.btrfs/fs/btrfs/extent_io.c --- linux-2.6.29.org/fs/btrfs/extent_io.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs/fs/btrfs/extent_io.c 2009-03-24 15:34:30.000000000 +0900 @@ -2423,7 +2423,7 @@ retry: if (wbc->sync_mode != WB_SYNC_NONE) { if (PageWriteback(page)) flush_fn(data); - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); } if (PageWriteback(page) || diff -Nrup linux-2.6.29.org/fs/btrfs/file.c linux-2.6.29.btrfs/fs/btrfs/file.c --- linux-2.6.29.org/fs/btrfs/file.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs/fs/btrfs/file.c 2009-03-24 15:34:49.000000000 +0900 @@ -967,7 +967,7 @@ again: err = -ENOMEM; BUG_ON(1); } - wait_on_page_writeback(pages[i]); + btrfs_wait_on_page_writeback(pages[i]); } if (start_pos < inode->i_size) { struct btrfs_ordered_extent *ordered; diff -Nrup linux-2.6.29.org/fs/btrfs/inode.c linux-2.6.29.btrfs/fs/btrfs/inode.c --- linux-2.6.29.org/fs/btrfs/inode.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs/fs/btrfs/inode.c 2009-03-24 15:35:23.000000000 +0900 @@ -2733,7 +2733,7 @@ again: goto out_unlock; } } - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); lock_extent(io_tree, page_start, page_end, GFP_NOFS); set_page_extent_mapped(page); @@ -4240,7 +4240,7 @@ static void btrfs_invalidatepage(struct u64 page_start = page_offset(page); u64 page_end = page_start + PAGE_CACHE_SIZE - 1; - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); tree = &BTRFS_I(page->mapping->host)->io_tree; if (offset) { btrfs_releasepage(page, GFP_NOFS); @@ -4322,7 +4322,7 @@ again: /* page got truncated out from underneath us */ goto out_unlock; } - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); lock_extent(io_tree, page_start, page_end, GFP_NOFS); set_page_extent_mapped(page); diff -Nrup linux-2.6.29.org/fs/btrfs/ioctl.c linux-2.6.29.btrfs/fs/btrfs/ioctl.c --- linux-2.6.29.org/fs/btrfs/ioctl.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs/fs/btrfs/ioctl.c 2009-03-24 15:35:46.000000000 +0900 @@ -400,7 +400,7 @@ again: } } - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); page_start = (u64)page->index << PAGE_CACHE_SHIFT; page_end = page_start + PAGE_CACHE_SIZE - 1; diff -Nrup linux-2.6.29.org/fs/btrfs/ordered-data.c linux-2.6.29.btrfs/fs/btrfs/ordered-data.c --- linux-2.6.29.org/fs/btrfs/ordered-data.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs/fs/btrfs/ordered-data.c 2009-03-25 11:04:32.000000000 +0900 @@ -21,6 +21,7 @@ #include <linux/blkdev.h> #include <linux/writeback.h> #include <linux/pagevec.h> +#include <linux/hash.h> #include "ctree.h" #include "transaction.h" #include "btrfs_inode.h" @@ -673,6 +674,46 @@ int btrfs_fdatawrite_range(struct addres return btrfs_writepages(mapping, &wbc); } +static void process_timeout(unsigned long __data) +{ + wake_up_process((struct task_struct *)__data); +} + +static int btrfs_sync_page(void *word) +{ + struct address_space *mapping; + struct page *page; + struct timer_list timer; + + page = container_of((unsigned long *)word, struct page, flags); + + smp_mb(); + mapping = page->mapping; + if (mapping && mapping->a_ops && mapping->a_ops->sync_page) + mapping->a_ops->sync_page(page); + setup_timer(&timer, process_timeout, (unsigned long)current); + __mod_timer(&timer, jiffies + 1); + io_schedule(); + del_timer_sync(&timer); + return 0; +} + +static wait_queue_head_t *page_waitqueue(struct page *page) +{ + const struct zone *zone = page_zone(page); + + return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; +} + +void btrfs_wait_on_page_bit(struct page *page) +{ + DEFINE_WAIT_BIT(wait, &page->flags, PG_writeback); + + if (test_bit(PG_writeback, &page->flags)) + __wait_on_bit(page_waitqueue(page), &wait, btrfs_sync_page, + TASK_UNINTERRUPTIBLE); +} + /** * taken from mm/filemap.c because it isn''t exported * @@ -710,7 +751,7 @@ int btrfs_wait_on_page_writeback_range(s if (page->index > end) continue; - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); if (PageError(page)) ret = -EIO; } diff -Nrup linux-2.6.29.org/fs/btrfs/transaction.c linux-2.6.29.btrfs/fs/btrfs/transaction.c --- linux-2.6.29.org/fs/btrfs/transaction.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs/fs/btrfs/transaction.c 2009-03-24 15:37:19.000000000 +0900 @@ -352,7 +352,7 @@ int btrfs_write_and_wait_marked_extents( if (PageWriteback(page)) { if (PageDirty(page)) - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); else { unlock_page(page); page_cache_release(page); @@ -380,12 +380,12 @@ int btrfs_write_and_wait_marked_extents( continue; if (PageDirty(page)) { btree_lock_page_hook(page); - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); err = write_one_page(page, 0); if (err) werr = err; } - wait_on_page_writeback(page); + btrfs_wait_on_page_writeback(page); page_cache_release(page); cond_resched(); } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2009-Mar-31 11:27 UTC
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:> Hi Chris. > > I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is > very slow as compared to ext3/4. I used blktrace to try to investigate the > cause of this. One of cause is that unplug is done by kblockd even if the I/O is > issued through fsync() or write() with O_SYNC flag. kblockd''s unplug timeout > is 3msec, so unplug via blockd can decrease I/O response. To increase > fsync/osync write performance, speeding up unplug should be done here. >> Btrfs''s write I/O is issued via kernel thread, not via user application context > that calls fsync(). While waiting for page writeback, wait_on_page_writeback() > can not unplug I/O sometimes on Btrfs because submit_bio is not called from > user application context so when submit_bio is called from kernel thread, > wait_on_page_writeback() sleeps on io_schedule(). >This is exactly right, and one of the uglier side effects of the async helper kernel threads. I''ve been thinking for a while about a clean way to fix it.> I introduced btrfs_wait_on_page_writeback() on following patch, this is replacement > of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while > waiting for page writeback. > > I did a performance test using the sysbench. > > # sysbench --num-threads=4 --max-requests=10000 --test=fileio --file-num=1 > --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr > --file-fsync-freq=5 run > > The result was: > -2.6.29 > > Test execution summary: > total time: 628.1047s > total number of events: 10000 > total time taken by event execution: 413.0834 > per-request statistics: > min: 0.0000s > avg: 0.0413s > max: 1.9075s > approx. 95 percentile: 0.3712s > > Threads fairness: > events (avg/stddev): 2500.0000/29.21 > execution time (avg/stddev): 103.2708/4.04 > > > -2.6.29-patched > > Test execution summary: > total time: 579.8049s > total number of events: 10004 > total time taken by event execution: 355.3098 > per-request statistics: > min: 0.0000s > avg: 0.0355s > max: 1.7670s > approx. 95 percentile: 0.3154s > > Threads fairness: > events (avg/stddev): 2501.0000/8.03 > execution time (avg/stddev): 88.8274/1.94 > > > This patch has some effect for performance improvement. > > I think there are other reasons that should be fixed why fsync() or > write() with O_SYNC flag is slow on Btrfs. >Very nice. Could I trouble you to try one more experiment? The other way to fix this is to your WRITE_SYNC instead of WRITE. Could you please hardcode WRITE_SYNC in the btrfs submit_bio paths and benchmark that? It doesn''t cover as many cases as your patch, but it might have a lower overall impact. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2009-Apr-01 15:17 UTC
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote:> Hi Chris. > > I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is > very slow as compared to ext3/4. I used blktrace to try to investigate the > cause of this. One of cause is that unplug is done by kblockd even if the I/O is > issued through fsync() or write() with O_SYNC flag. kblockd''s unplug timeout > is 3msec, so unplug via blockd can decrease I/O response. To increase > fsync/osync write performance, speeding up unplug should be done here. >I realized today that all of the async thread handling btrfs does for writes gives us plenty of time to queue up IO for the block device. If that''s true, we can just unplug the block device in async helper thread and get pretty good coverage for the problem you''re describing. Could you please try the patch below and see if it performs well? I did some O_DIRECT testing on a 5 drive array, and tput jumped from 386MB/s to 450MB/s for large writes. Thanks again for digging through this problem. -chris diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dd06e18..bf377ab 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -146,7 +146,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device) unsigned long num_run = 0; unsigned long limit; - bdi = device->bdev->bd_inode->i_mapping->backing_dev_info; + bdi = blk_get_backing_dev_info(device->bdev); fs_info = device->dev_root->fs_info; limit = btrfs_async_submit_limit(fs_info); limit = limit * 2 / 3; @@ -231,6 +231,19 @@ loop_lock: if (device->pending_bios) goto loop_lock; spin_unlock(&device->io_lock); + + /* + * IO has already been through a long path to get here. Checksumming, + * async helper threads, perhaps compression. We''ve done a pretty + * good job of collecting a batch of IO and should just unplug + * the device right away. + * + * This will help anyone who is waiting on the IO, they might have + * already unplugged, but managed to do so before the bio they + * cared about found its way down here. + */ + if (bdi->unplug_io_fn) + bdi->unplug_io_fn(bdi, NULL); done: return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe
2009-Apr-01 17:01 UTC
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
On Wed, Apr 01 2009, Chris Mason wrote:> On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote: > > Hi Chris. > > > > I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is > > very slow as compared to ext3/4. I used blktrace to try to investigate the > > cause of this. One of cause is that unplug is done by kblockd even if the I/O is > > issued through fsync() or write() with O_SYNC flag. kblockd''s unplug timeout > > is 3msec, so unplug via blockd can decrease I/O response. To increase > > fsync/osync write performance, speeding up unplug should be done here. > > > > I realized today that all of the async thread handling btrfs does for > writes gives us plenty of time to queue up IO for the block device. If > that''s true, we can just unplug the block device in async helper thread > and get pretty good coverage for the problem you''re describing. > > Could you please try the patch below and see if it performs well? I did > some O_DIRECT testing on a 5 drive array, and tput jumped from 386MB/s > to 450MB/s for large writes. > > Thanks again for digging through this problem. > > -chris > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index dd06e18..bf377ab 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -146,7 +146,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device) > unsigned long num_run = 0; > unsigned long limit; > > - bdi = device->bdev->bd_inode->i_mapping->backing_dev_info; > + bdi = blk_get_backing_dev_info(device->bdev); > fs_info = device->dev_root->fs_info; > limit = btrfs_async_submit_limit(fs_info); > limit = limit * 2 / 3; > @@ -231,6 +231,19 @@ loop_lock: > if (device->pending_bios) > goto loop_lock; > spin_unlock(&device->io_lock); > + > + /* > + * IO has already been through a long path to get here. Checksumming, > + * async helper threads, perhaps compression. We''ve done a pretty > + * good job of collecting a batch of IO and should just unplug > + * the device right away. > + * > + * This will help anyone who is waiting on the IO, they might have > + * already unplugged, but managed to do so before the bio they > + * cared about found its way down here. > + */ > + if (bdi->unplug_io_fn) > + bdi->unplug_io_fn(bdi, NULL);blk_run_backing_dev(bdi, NULL); :-) -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hisashi Hifumi
2009-Apr-02 02:02 UTC
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
At 20:27 09/03/31, Chris Mason wrote:>On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote: >> Hi Chris. >> >> I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is >> very slow as compared to ext3/4. I used blktrace to try to investigate the >> cause of this. One of cause is that unplug is done by kblockd even if the >I/O is >> issued through fsync() or write() with O_SYNC flag. kblockd''s unplug timeout >> is 3msec, so unplug via blockd can decrease I/O response. To increase >> fsync/osync write performance, speeding up unplug should be done here. >> > >> Btrfs''s write I/O is issued via kernel thread, not via user application context >> that calls fsync(). While waiting for page writeback, wait_on_page_writeback() >> can not unplug I/O sometimes on Btrfs because submit_bio is not called from >> user application context so when submit_bio is called from kernel thread, >> wait_on_page_writeback() sleeps on io_schedule(). >> > >This is exactly right, and one of the uglier side effects of the async >helper kernel threads. I''ve been thinking for a while about a clean way >to fix it. > >> I introduced btrfs_wait_on_page_writeback() on following patch, this is >replacement >> of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while >> waiting for page writeback. >> >> I did a performance test using the sysbench. >> >> # sysbench --num-threads=4 --max-requests=10000 --test=fileio --file-num=1 >> --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr >> --file-fsync-freq=5 run >> >> The result was: >> -2.6.29 >> >> Test execution summary: >> total time: 628.1047s >> total number of events: 10000 >> total time taken by event execution: 413.0834 >> per-request statistics: >> min: 0.0000s >> avg: 0.0413s >> max: 1.9075s >> approx. 95 percentile: 0.3712s >> >> Threads fairness: >> events (avg/stddev): 2500.0000/29.21 >> execution time (avg/stddev): 103.2708/4.04 >> >> >> -2.6.29-patched >> >> Test execution summary: >> total time: 579.8049s >> total number of events: 10004 >> total time taken by event execution: 355.3098 >> per-request statistics: >> min: 0.0000s >> avg: 0.0355s >> max: 1.7670s >> approx. 95 percentile: 0.3154s >> >> Threads fairness: >> events (avg/stddev): 2501.0000/8.03 >> execution time (avg/stddev): 88.8274/1.94 >> >> >> This patch has some effect for performance improvement. >> >> I think there are other reasons that should be fixed why fsync() or >> write() with O_SYNC flag is slow on Btrfs. >> > >Very nice. Could I trouble you to try one more experiment? The other >way to fix this is to your WRITE_SYNC instead of WRITE. Could you >please hardcode WRITE_SYNC in the btrfs submit_bio paths and benchmark >that? > >It doesn''t cover as many cases as your patch, but it might have a lower >overall impact.Hi. I wrote hardcode WRITE_SYNC patch for btrfs submit_bio paths as shown below, and I did sysbench test. Later, I will try your unplug patch. diff -Nrup linux-2.6.29.org/fs/btrfs/disk-io.c linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c --- linux-2.6.29.org/fs/btrfs/disk-io.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c 2009-04-01 16:26:56.000000000 +0900 @@ -2068,7 +2068,7 @@ static int write_dev_supers(struct btrfs } if (i == last_barrier && do_barriers && device->barriers) { - ret = submit_bh(WRITE_BARRIER, bh); + ret = submit_bh(WRITE_BARRIER|WRITE_SYNC, bh); if (ret == -EOPNOTSUPP) { printk("btrfs: disabling barriers on dev %s\n", device->name); @@ -2076,10 +2076,10 @@ static int write_dev_supers(struct btrfs device->barriers = 0; get_bh(bh); lock_buffer(bh); - ret = submit_bh(WRITE, bh); + ret = submit_bh(WRITE_SYNC, bh); } } else { - ret = submit_bh(WRITE, bh); + ret = submit_bh(WRITE_SYNC, bh); } if (!ret && wait) { diff -Nrup linux-2.6.29.org/fs/btrfs/extent_io.c linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c --- linux-2.6.29.org/fs/btrfs/extent_io.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c 2009-04-01 14:48:08.000000000 +0900 @@ -1851,8 +1851,11 @@ static int submit_one_bio(int rw, struct if (tree->ops && tree->ops->submit_bio_hook) tree->ops->submit_bio_hook(page->mapping->host, rw, bio, mirror_num, bio_flags); - else + else { + if (rw & WRITE) + rw = WRITE_SYNC; submit_bio(rw, bio); + } if (bio_flagged(bio, BIO_EOPNOTSUPP)) ret = -EOPNOTSUPP; bio_put(bio); diff -Nrup linux-2.6.29.org/fs/btrfs/volumes.c linux-2.6.29.btrfs_sync/fs/btrfs/volumes.c --- linux-2.6.29.org/fs/btrfs/volumes.c 2009-03-24 08:12:14.000000000 +0900 +++ linux-2.6.29.btrfs_sync/fs/btrfs/volumes.c 2009-04-01 16:25:51.000000000 +0900 @@ -195,6 +195,8 @@ loop_lock: BUG_ON(atomic_read(&cur->bi_cnt) == 0); bio_get(cur); + if (cur->bi_rw & WRITE) + cur->bi_rw = WRITE_SYNC; submit_bio(cur->bi_rw, cur); bio_put(cur); num_run++; @@ -2815,8 +2817,11 @@ int btrfs_map_bio(struct btrfs_root *roo bio->bi_bdev = dev->bdev; if (async_submit) schedule_bio(root, dev, rw, bio); - else + else { + if (rw & WRITE) + rw = WRITE_SYNC; submit_bio(rw, bio); + } } else { bio->bi_bdev = root->fs_info->fs_devices->latest_bdev; bio->bi_sector = logical >> 9; # sysbench --num-threads=4 --max-requests=10000 --test=fileio --file-num=1 --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr --file-fsync-freq=5 run The result was: -2.6.29 Test execution summary: total time: 619.6822s total number of events: 10003 total time taken by event execution: 403.1020 per-request statistics: min: 0.0000s avg: 0.0403s max: 1.4584s approx. 95 percentile: 0.3761s Threads fairness: events (avg/stddev): 2500.7500/48.48 execution time (avg/stddev): 100.7755/7.92 -2.6.29-WRITE_SYNC-patched Test execution summary: total time: 596.8114s total number of events: 10004 total time taken by event execution: 396.2378 per-request statistics: min: 0.0000s avg: 0.0396s max: 1.6926s approx. 95 percentile: 0.3434s Threads fairness: events (avg/stddev): 2501.0000/58.28 execution time (avg/stddev): 99.0595/2.84 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hisashi Hifumi
2009-Apr-02 06:25 UTC
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
At 00:17 09/04/02, Chris Mason wrote:>On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote: >> Hi Chris. >> >> I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is >> very slow as compared to ext3/4. I used blktrace to try to investigate the >> cause of this. One of cause is that unplug is done by kblockd even if the >I/O is >> issued through fsync() or write() with O_SYNC flag. kblockd''s unplug timeout >> is 3msec, so unplug via blockd can decrease I/O response. To increase >> fsync/osync write performance, speeding up unplug should be done here. >> > >I realized today that all of the async thread handling btrfs does for >writes gives us plenty of time to queue up IO for the block device. If >that''s true, we can just unplug the block device in async helper thread >and get pretty good coverage for the problem you''re describing. > >Could you please try the patch below and see if it performs well? I did >some O_DIRECT testing on a 5 drive array, and tput jumped from 386MB/s >to 450MB/s for large writes. > >Thanks again for digging through this problem. > >-chris > >diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >index dd06e18..bf377ab 100644 >--- a/fs/btrfs/volumes.c >+++ b/fs/btrfs/volumes.c >@@ -146,7 +146,7 @@ static noinline int run_scheduled_bios(struct >btrfs_device *device) > unsigned long num_run = 0; > unsigned long limit; > >- bdi = device->bdev->bd_inode->i_mapping->backing_dev_info; >+ bdi = blk_get_backing_dev_info(device->bdev); > fs_info = device->dev_root->fs_info; > limit = btrfs_async_submit_limit(fs_info); > limit = limit * 2 / 3; >@@ -231,6 +231,19 @@ loop_lock: > if (device->pending_bios) > goto loop_lock; > spin_unlock(&device->io_lock); >+ >+ /* >+ * IO has already been through a long path to get here. Checksumming, >+ * async helper threads, perhaps compression. We''ve done a pretty >+ * good job of collecting a batch of IO and should just unplug >+ * the device right away. >+ * >+ * This will help anyone who is waiting on the IO, they might have >+ * already unplugged, but managed to do so before the bio they >+ * cared about found its way down here. >+ */ >+ if (bdi->unplug_io_fn) >+ bdi->unplug_io_fn(bdi, NULL); > done: > return 0; > }I tested your unplug patch. # sysbench --num-threads=4 --max-requests=10000 --test=fileio --file-num=1 --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr --file-fsync-freq=5 run -2.6.29 Test execution summary: total time: 626.9416s total number of events: 10004 total time taken by event execution: 442.5869 per-request statistics: min: 0.0000s avg: 0.0442s max: 1.4229s approx. 95 percentile: 0.3959s Threads fairness: events (avg/stddev): 2501.0000/73.43 execution time (avg/stddev): 110.6467/7.15 -2.6.29-patched Operations performed: 0 Read, 10003 Write, 1996 Other = 11999 Total Read 0b Written 39.074Mb Total transferred 39.074Mb (68.269Kb/sec) 17.07 Requests/sec executed Test execution summary: total time: 586.0944s total number of events: 10003 total time taken by event execution: 347.5348 per-request statistics: min: 0.0000s avg: 0.0347s max: 2.2546s approx. 95 percentile: 0.3090s Threads fairness: events (avg/stddev): 2500.7500/54.98 execution time (avg/stddev): 86.8837/3.06 We can get some performance improvement by this patch. What if the case write() without O_SYNC ? I am concerned about decreasing optimization effect on block layer (merge, sort) when the I/O is not fsync or write with O_SYNC. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2009-Apr-02 11:25 UTC
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
On Thu, 2009-04-02 at 15:25 +0900, Hisashi Hifumi wrote:> I tested your unplug patch. > > # sysbench --num-threads=4 --max-requests=10000 --test=fileio --file-num=1 > --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr > --file-fsync-freq=5 run > > -2.6.29 > Test execution summary: > total time: 626.9416s > total number of events: 10004 > total time taken by event execution: 442.5869 > per-request statistics: > min: 0.0000s > avg: 0.0442s > max: 1.4229s > approx. 95 percentile: 0.3959s > > Threads fairness: > events (avg/stddev): 2501.0000/73.43 > execution time (avg/stddev): 110.6467/7.15> -2.6.29-patched > Operations performed: 0 Read, 10003 Write, 1996 Other = 11999 Total > Read 0b Written 39.074Mb Total transferred 39.074Mb (68.269Kb/sec) > 17.07 Requests/sec executed > > Test execution summary: > total time: 586.0944s > total number of events: 10003 > total time taken by event execution: 347.5348 > per-request statistics: > min: 0.0000s > avg: 0.0347s > max: 2.2546s > approx. 95 percentile: 0.3090s > > Threads fairness: > events (avg/stddev): 2500.7500/54.98 > execution time (avg/stddev): 86.8837/3.06 >Very nice.> > We can get some performance improvement by this patch. > What if the case write() without O_SYNC ? > I am concerned about decreasing optimization effect on block layer (merge, sort) > when the I/O is not fsync or write with O_SYNC.The performance should still be good for normal workloads, mostly because the async threads try to collect IO already. Basically what happens is the bio is first sent to the checksumming threads, which do a bunch of checksums but still queue things in such a way that the IO is sent down in order. This takes some time. Then the bios are put into the submit bio thread pool, which wakes up a different process to send it down. It might be slightly less merging than before, but it should still give the elevator enough bios to work with. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html