Josef Bacik
2011-Apr-15 19:09 UTC
[RFC] Add a new file op for fsync to give fs''s more control
Btrfs needs to be able to control how data is submitted in the case of fsync to make it a little faster, and really we could get rid of holding the i_mutex altogether as well. So introduce a ->fsync_nolock helper that pushes the responsibility of locking the inode and doing the filemap_write_and_wait_range down into the fs so we can have better control of how we submit the io and do our locking. It looks like ext4 and probably xfs could get away with not taking the i_mutex either, so they may benefit from this as well. Really I could just change ->fsync() to do this and push everything down into all the filesystems, but I wasn''t sure how well that would be recieved, so I''m taking this approach. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This adds the fsync_nolock file operation for filesystems that need a little more control over how fsync is used. Any filesystem that uses this is responsible for their own locking and making sure the data for the range provided is actually synced out. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- Documentation/filesystems/vfs.txt | 4 ++++ fs/sync.c | 6 +++++- include/linux/fs.h | 1 + 3 files changed, 10 insertions(+), 1 deletions(-) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 80815ed..fba2064 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -758,6 +758,7 @@ struct file_operations { int (*fsync) (struct file *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); + int (*fsync_nolock) (struct file *, loff_t, loff_t, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); @@ -815,6 +816,9 @@ otherwise noted. fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file + fsync_nolock: called by the fsync(2) system call, this is used in lieu of + ->fsync if your fs does it''s own locking for fsync. + lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW commands diff --git a/fs/sync.c b/fs/sync.c index c38ec16..d0ff770 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -168,11 +168,15 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) struct address_space *mapping = file->f_mapping; int err, ret; - if (!file->f_op || !file->f_op->fsync) { + if (!file->f_op || + (!file->f_op->fsync && !file->f_op->fsync_nolock)) { ret = -EINVAL; goto out; } + if (file->f_op->fsync_nolock) + return file->f_op->fsync_nolock(file, start, end, datasync); + ret = filemap_write_and_wait_range(mapping, start, end); /* diff --git a/include/linux/fs.h b/include/linux/fs.h index 1b95af3..0764d6a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1553,6 +1553,7 @@ struct file_operations { int (*fsync) (struct file *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); + int (*fsync_nolock) (struct file *, loff_t, loff_t, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); -- 1.7.2.3 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs needs to be able to control how IO is submitted in the fsync case, so in preperation of this work convert to the ->fsync_nolock file op. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- fs/btrfs/ctree.h | 2 +- fs/btrfs/file.c | 27 ++++++++++++++++++--------- fs/btrfs/inode.c | 2 +- 3 files changed, 20 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d5f043e..b409721 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2567,7 +2567,7 @@ void btrfs_update_iflags(struct inode *inode); void btrfs_inherit_iflags(struct inode *inode, struct inode *dir); /* file.c */ -int btrfs_sync_file(struct file *file, int datasync); +int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync); int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end, int skip_pinned); int btrfs_check_file(struct btrfs_root *root, struct inode *inode); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index cd5e82e..d50eea8 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1194,19 +1194,23 @@ int btrfs_release_file(struct inode *inode, struct file *filp) * important optimization for directories because holding the mutex prevents * new operations on the dir while we write to disk. */ -int btrfs_sync_file(struct file *file, int datasync) +int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { struct dentry *dentry = file->f_path.dentry; struct inode *inode = dentry->d_inode; struct btrfs_root *root = BTRFS_I(inode)->root; + int err; int ret = 0; struct btrfs_trans_handle *trans; trace_btrfs_sync_file(file, datasync); + err = filemap_write_and_wait_range(inode->i_mapping, start, end); + + mutex_lock(&inode->i_mutex); + /* we wait first, since the writeback may change the inode */ root->log_batch++; - /* the VFS called filemap_fdatawrite for us */ btrfs_wait_ordered_range(inode, 0, (u64)-1); root->log_batch++; @@ -1215,7 +1219,7 @@ int btrfs_sync_file(struct file *file, int datasync) * and see if its already been committed */ if (!BTRFS_I(inode)->last_trans) - goto out; + goto out_lock; /* * if the last transaction that changed this file was before @@ -1226,7 +1230,7 @@ int btrfs_sync_file(struct file *file, int datasync) if (BTRFS_I(inode)->last_trans < root->fs_info->last_trans_committed) { BTRFS_I(inode)->last_trans = 0; - goto out; + goto out_lock; } /* @@ -1238,12 +1242,12 @@ int btrfs_sync_file(struct file *file, int datasync) trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); - goto out; + goto out_lock; } ret = btrfs_log_dentry_safe(trans, root, dentry); if (ret < 0) - goto out; + goto out_lock; /* we''ve logged all the items and now have a consistent * version of the file in the log. It is possible that @@ -1255,7 +1259,7 @@ int btrfs_sync_file(struct file *file, int datasync) * file again, but that will end up using the synchronization * inside btrfs_sync_log to keep things safe. */ - mutex_unlock(&dentry->d_inode->i_mutex); + mutex_unlock(&inode->i_mutex); if (ret != BTRFS_NO_LOG_SYNC) { if (ret > 0) { @@ -1270,8 +1274,13 @@ int btrfs_sync_file(struct file *file, int datasync) } else { ret = btrfs_end_transaction(trans, root); } - mutex_lock(&dentry->d_inode->i_mutex); + goto out; + +out_lock: + mutex_unlock(&inode->i_mutex); out: + if (!ret) + ret = err; return ret > 0 ? -EIO : ret; } @@ -1416,7 +1425,7 @@ const struct file_operations btrfs_file_operations = { .mmap = btrfs_file_mmap, .open = generic_file_open, .release = btrfs_release_file, - .fsync = btrfs_sync_file, + .fsync_nolock = btrfs_sync_file, .fallocate = btrfs_fallocate, .unlocked_ioctl = btrfs_ioctl, #ifdef CONFIG_COMPAT diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b1e5b11..e80b999 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7490,7 +7490,7 @@ static const struct file_operations btrfs_dir_file_operations = { .compat_ioctl = btrfs_ioctl, #endif .release = btrfs_release_file, - .fsync = btrfs_sync_file, + .fsync_nolock = btrfs_sync_file, }; static struct extent_io_ops btrfs_extent_io_ops = { -- 1.7.2.3 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Apr-15 19:24 UTC
Re: [RFC] Add a new file op for fsync to give fs''s more control
Sorry, but this is too ugly to live. If the reason for this really is good enough we''ll just need to push the filemap_write_and_wait_range and i_mutex locking into every ->fsync instance. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Apr-15 19:32 UTC
Re: [RFC] Add a new file op for fsync to give fs''s more control
On 04/15/2011 03:24 PM, Christoph Hellwig wrote:> Sorry, but this is too ugly to live. If the reason for this really is > good enough we''ll just need to push the filemap_write_and_wait_range > and i_mutex locking into every ->fsync instance. >So part of what makes small fsyncs slow in btrfs is all of our random threads to make checksumming not suck. So we submit IO which spreads it out to helper threads to do the checksumming, and then when it returns it gets handed off to endio threads that run the endio stuff. This works awesome with doing big writes and such, but if say we''re and RPM database and write a couple of kilbytes, this tends to suck because we keep handing work off to other threads and waiting, so the scheduling latencies really hurt. So we''d like to be able to say "hey this is a small amount of io, lets just do the checksumming in the current thread", and the same with handling the endio stuff. We can''t do that currently because filemap_write_and_wait_range is called before we get to fsync. We''d like to be able to control this so we can do the appropriate magic to do the submission within the fsyncings thread context in order to speed things up a bit. That plus the stuff I said about i_mutex. Is that a good enough reason to just push this down into all the filesystems? Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-Apr-15 19:34 UTC
Re: [RFC] Add a new file op for fsync to give fs''s more control
Excerpts from Christoph Hellwig''s message of 2011-04-15 15:24:12 -0400:> Sorry, but this is too ugly to live. If the reason for this really is > good enough we''ll just need to push the filemap_write_and_wait_range > and i_mutex locking into every ->fsync instance. >Which part is too ugly to live? The special op? New parameters? The unconditional taking of i_mutex hurts a lot, especially on directory fsyncs, so I''d love to get rid of it. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Apr-15 19:49 UTC
Re: [RFC] Add a new file op for fsync to give fs''s more control
On Fri, Apr 15, 2011 at 03:34:57PM -0400, Chris Mason wrote:> Excerpts from Christoph Hellwig''s message of 2011-04-15 15:24:12 -0400: > > Sorry, but this is too ugly to live. If the reason for this really is > > good enough we''ll just need to push the filemap_write_and_wait_range > > and i_mutex locking into every ->fsync instance. > > > > Which part is too ugly to live? The special op? New parameters?Two different fsync ops, when we could triviall do with one by pushing things down. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
liubo
2011-Apr-18 06:49 UTC
Re: [RFC] Add a new file op for fsync to give fs''s more control
On 04/16/2011 03:32 AM, Josef Bacik wrote:> On 04/15/2011 03:24 PM, Christoph Hellwig wrote: >> Sorry, but this is too ugly to live. If the reason for this really is >> good enough we''ll just need to push the filemap_write_and_wait_range >> and i_mutex locking into every ->fsync instance. >> > > So part of what makes small fsyncs slow in btrfs is all of our random > threads to make checksumming not suck. So we submit IO which spreads it > out to helper threads to do the checksumming, and then when it returns > it gets handed off to endio threads that run the endio stuff. This > works awesome with doing big writes and such, but if say we''re and RPM > database and write a couple of kilbytes, this tends to suck because we > keep handing work off to other threads and waiting, so the scheduling > latencies really hurt. > > So we''d like to be able to say "hey this is a small amount of io, lets > just do the checksumming in the current thread", and the same with > handling the endio stuff. We can''t do that currently because > filemap_write_and_wait_range is called before we get to fsync. We''d > like to be able to control this so we can do the appropriate magic to do > the submission within the fsyncings thread context in order to speed > things up a bit. > > That plus the stuff I said about i_mutex. Is that a good enough reason > to just push this down into all the filesystems? Thanks, >Fine with the i_mutex. I''m wandering that is it worth of doing so? I''ve tested your patch with sysbench, and there is little improvement. :( Sysbench args: sysbench --test=fileio --num-threads=1 --file-num=10240 --file-block-size=1K --file-total-size=20M --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= run 10240 files, 2Kb each ==fsync_nolock (patch): Operations performed: 0 Read, 10000 Write, 1024000 Other = 1034000 Total Read 0b Written 9.7656Mb Total transferred 9.7656Mb (35.152Kb/sec) 35.15 Requests/sec executed fsync (orig): Operations performed: 0 Read, 10000 Write, 1024000 Other = 1034000 Total Read 0b Written 9.7656Mb Total transferred 9.7656Mb (35.287Kb/sec) 35.29 Requests/sec executed == Seems that the improvement of avoiding threads interchange is not enough. BTW, I''m trying to improve the fsync performance stuff, but mainly for large files(>4G). And I found that a large file will have a tremendous amount of csum items needed to be flush into tree log during fsync(). Btrfs now uses a brute force approach to ensure to get the most uptodate copies of everything, and this results in a bad performance. To change the brute way is bugging me a lot... thanks, liubo> Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Apr-18 14:10 UTC
Re: [RFC] Add a new file op for fsync to give fs''s more control
On 04/18/2011 02:49 AM, liubo wrote:> On 04/16/2011 03:32 AM, Josef Bacik wrote: >> On 04/15/2011 03:24 PM, Christoph Hellwig wrote: >>> Sorry, but this is too ugly to live. If the reason for this really is >>> good enough we''ll just need to push the filemap_write_and_wait_range >>> and i_mutex locking into every ->fsync instance. >>> >> >> So part of what makes small fsyncs slow in btrfs is all of our random >> threads to make checksumming not suck. So we submit IO which spreads it >> out to helper threads to do the checksumming, and then when it returns >> it gets handed off to endio threads that run the endio stuff. This >> works awesome with doing big writes and such, but if say we''re and RPM >> database and write a couple of kilbytes, this tends to suck because we >> keep handing work off to other threads and waiting, so the scheduling >> latencies really hurt. >> >> So we''d like to be able to say "hey this is a small amount of io, lets >> just do the checksumming in the current thread", and the same with >> handling the endio stuff. We can''t do that currently because >> filemap_write_and_wait_range is called before we get to fsync. We''d >> like to be able to control this so we can do the appropriate magic to do >> the submission within the fsyncings thread context in order to speed >> things up a bit. >> >> That plus the stuff I said about i_mutex. Is that a good enough reason >> to just push this down into all the filesystems? Thanks, >> > > Fine with the i_mutex. > > I''m wandering that is it worth of doing so? > > I''ve tested your patch with sysbench, and there is little improvement. :( >Yeah it''s not a huge change for us, there are other places we need to work on, however things like ext4 could do well to not hold the i_mutex over a transaction commit. Just an example of how this could help us all in general, not just btrfs.> Sysbench args: > sysbench --test=fileio --num-threads=1 --file-num=10240 --file-block-size=1K --file-total-size=20M --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= run > > > 10240 files, 2Kb each > ==> fsync_nolock (patch): > Operations performed: 0 Read, 10000 Write, 1024000 Other = 1034000 Total > Read 0b Written 9.7656Mb Total transferred 9.7656Mb (35.152Kb/sec) > 35.15 Requests/sec executed > > fsync (orig): > Operations performed: 0 Read, 10000 Write, 1024000 Other = 1034000 Total > Read 0b Written 9.7656Mb Total transferred 9.7656Mb (35.287Kb/sec) > 35.29 Requests/sec executed > ==> > Seems that the improvement of avoiding threads interchange is not enough. > > BTW, I''m trying to improve the fsync performance stuff, but mainly for large files(>4G). > And I found that a large file will have a tremendous amount of csum items needed to > be flush into tree log during fsync(). Btrfs now uses a brute force approach to > ensure to get the most uptodate copies of everything, and this results in a bad > performance. To change the brute way is bugging me a lot... >Yeah there are some things that could be done for this, I''m going to be spending a while here trying to squeeze as much performance out of fsync that we can get, though first I''m going to start with small fsyncs since that will be the most practical gain at the moment (think RPM databases). Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-Apr-18 14:30 UTC
Re: [RFC] Add a new file op for fsync to give fs''s more control
Excerpts from liubo''s message of 2011-04-18 02:49:51 -0400:> On 04/16/2011 03:32 AM, Josef Bacik wrote: > > On 04/15/2011 03:24 PM, Christoph Hellwig wrote: > >> Sorry, but this is too ugly to live. If the reason for this really is > >> good enough we''ll just need to push the filemap_write_and_wait_range > >> and i_mutex locking into every ->fsync instance. > >> > > > > So part of what makes small fsyncs slow in btrfs is all of our random > > threads to make checksumming not suck. So we submit IO which spreads it > > out to helper threads to do the checksumming, and then when it returns > > it gets handed off to endio threads that run the endio stuff. This > > works awesome with doing big writes and such, but if say we''re and RPM > > database and write a couple of kilbytes, this tends to suck because we > > keep handing work off to other threads and waiting, so the scheduling > > latencies really hurt. > > > > So we''d like to be able to say "hey this is a small amount of io, lets > > just do the checksumming in the current thread", and the same with > > handling the endio stuff. We can''t do that currently because > > filemap_write_and_wait_range is called before we get to fsync. We''d > > like to be able to control this so we can do the appropriate magic to do > > the submission within the fsyncings thread context in order to speed > > things up a bit. > > > > That plus the stuff I said about i_mutex. Is that a good enough reason > > to just push this down into all the filesystems? Thanks, > > > > Fine with the i_mutex. > > I''m wandering that is it worth of doing so? > > I''ve tested your patch with sysbench, and there is little improvement. :( > > Sysbench args: > sysbench --test=fileio --num-threads=1 --file-num=10240 --file-block-size=1K --file-total-size=20M --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= runBtrfs is already dropping i_mutex in its fsync as much as it can. It is somewhat less efficient because it has to take it back again before returning, but I don''t think it will be a huge difference.> > > 10240 files, 2Kb each > ==> fsync_nolock (patch): > Operations performed: 0 Read, 10000 Write, 1024000 Other = 1034000 Total > Read 0b Written 9.7656Mb Total transferred 9.7656Mb (35.152Kb/sec) > 35.15 Requests/sec executed > > fsync (orig): > Operations performed: 0 Read, 10000 Write, 1024000 Other = 1034000 Total > Read 0b Written 9.7656Mb Total transferred 9.7656Mb (35.287Kb/sec) > 35.29 Requests/sec executed > ==> > Seems that the improvement of avoiding threads interchange is not enough. > > BTW, I''m trying to improve the fsync performance stuff, but mainly for large files(>4G). > And I found that a large file will have a tremendous amount of csum items needed to > be flush into tree log during fsync(). Btrfs now uses a brute force approach to > ensure to get the most uptodate copies of everything, and this results in a bad > performance. To change the brute way is bugging me a lot...The big problem with the fsync log is that we need to bump the transaction id as we do tree log commits. This will allow us to find just the things that have changed since our last fsync. The current code that relogs the entire inode every time also needs to be more fine grained. It is much better suited to small files than large ones. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html