Darrick J. Wong
2012-Nov-01 07:58 UTC
[RFC PATCH v2 0/3] mm/fs: Implement faster stable page writes on filesystems
Hi all, This patchset makes some key modifications to the original ''stable page writes'' patchset. First, it provides users (devices and filesystems) of a backing_dev_info the ability to declare whether or not it is necessary to ensure that page contents cannot change during writeout, whereas the current code assumes that this is true. Second, it relaxes the wait_on_page_writeback calls so that they only occur if something needs it. Third, it fixes up (most) of the remaining filesystems to use this improved conditional-wait logic in the hopes of providing stable page writes on all filesystems. It is hoped that (for people not using checksumming devices, anyway) this patchset will give back unnecessary performance decreases since the original stable page write patchset went into 3.0. It seems possible, though, that iscsi and raid5 may wish to use the new stable page write support to enable zero-copy writeout. Unfortunately, it seems that ext3 is still broken wrt stable page writes. One workaround would be to use ext4 instead, or avoid the use of ext3.ko + DIF/DIX. Hopefully it doesn''t take long to sort out. Another thing I noticed is that there are several filesystems that call wait_on_page_writeback before returning VM_FAULT_LOCKED in their page_mkwrite delegates. It might be possible to convert some of these to wait_for_stable_pages unless there''s some other reason that we always want to wait for writeback. Finally, if a filesystem wants the VM to help it provide stable pages, it''s now possible to use the *_require_stable_pages() functions to turn that on. It might be useful for checksumming data blocks during write. This patchset has been lightly tested on 3.7.0-rc3 on x64. --D -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Darrick J. Wong
2012-Nov-01 07:58 UTC
[PATCH 1/3] bdi: Track users that require stable page writes
This creates a per-backing-device counter that tracks the number of users which require pages to be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nobody requires stable pages. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- Documentation/ABI/testing/sysfs-class-bdi | 7 +++++ block/blk-integrity.c | 4 +++ include/linux/backing-dev.h | 16 ++++++++++++ include/linux/blkdev.h | 10 ++++++++ mm/backing-dev.c | 38 +++++++++++++++++++++++++++++ 5 files changed, 75 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi index 5f50097..218a618 100644 --- a/Documentation/ABI/testing/sysfs-class-bdi +++ b/Documentation/ABI/testing/sysfs-class-bdi @@ -48,3 +48,10 @@ max_ratio (read-write) most of the write-back cache. For example in case of an NFS mount that is prone to get stuck, or a FUSE mount which cannot be trusted to play fair. + +stable_pages_required (read-write) + + If set, the backing device requires that all pages comprising a write + request must not be changed until writeout is complete. The system + administrator can turn this on if the hardware does not do so already. + However, once enabled, this flag cannot be disabled. diff --git a/block/blk-integrity.c b/block/blk-integrity.c index da2a818..cf2dd95 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template) } else bi->name = bi_unsupported_name; + queue_require_stable_pages(disk->queue); + return 0; } EXPORT_SYMBOL(blk_integrity_register); @@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk) if (!disk || !disk->integrity) return; + queue_unrequire_stable_pages(disk->queue); + bi = disk->integrity; kobject_uevent(&bi->kobj, KOBJ_REMOVE); diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 2a9a9ab..0554f5d 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -109,6 +109,7 @@ struct backing_dev_info { struct dentry *debug_dir; struct dentry *debug_stats; #endif + atomic_t stable_page_users; }; int bdi_init(struct backing_dev_info *bdi); @@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout); int pdflush_proc_obsolete(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi) +{ + atomic_inc(&bdi->stable_page_users); +} + +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi) +{ + atomic_dec(&bdi->stable_page_users); +} + +static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi) +{ + return atomic_read(&bdi->stable_page_users) > 0; +} + static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi) { return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 1756001..bf927c0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -458,6 +458,16 @@ struct request_queue { (1 << QUEUE_FLAG_SAME_COMP) | \ (1 << QUEUE_FLAG_ADD_RANDOM)) +static inline void queue_require_stable_pages(struct request_queue *q) +{ + bdi_require_stable_pages(&q->backing_dev_info); +} + +static inline void queue_unrequire_stable_pages(struct request_queue *q) +{ + bdi_unrequire_stable_pages(&q->backing_dev_info); +} + static inline void queue_lockdep_assert_held(struct request_queue *q) { if (q->queue_lock) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index d3ca2b3..dd9f5ed 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev, } BDI_SHOW(max_ratio, bdi->max_ratio) +static ssize_t stable_pages_required_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned int spw; + ssize_t ret; + + ret = kstrtouint(buf, 10, &spw); + if (ret < 0) + return ret; + + /* + * SPW could be enabled due to hw requirement, so don''t + * let users disable it. + */ + if (bdi_cap_stable_pages_required(bdi) && spw == 0) + return -EINVAL; + + if (spw != 0) + atomic_inc(&bdi->stable_page_users); + + return count; +} + +static ssize_t stable_pages_required_show(struct device *dev, + struct device_attribute *attr, + char *page) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + + return snprintf(page, PAGE_SIZE-1, "%d\n", + bdi_cap_stable_pages_required(bdi) ? 1 : 0); +} + #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store) static struct device_attribute bdi_dev_attrs[] = { __ATTR_RW(read_ahead_kb), __ATTR_RW(min_ratio), __ATTR_RW(max_ratio), + __ATTR_RW(stable_pages_required), __ATTR_NULL, }; @@ -650,6 +686,8 @@ int bdi_init(struct backing_dev_info *bdi) bdi->write_bandwidth = INIT_BW; bdi->avg_write_bandwidth = INIT_BW; + atomic_set(&bdi->stable_page_users, 0); + err = fprop_local_init_percpu(&bdi->completions); if (err) { -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Darrick J. Wong
2012-Nov-01 07:58 UTC
[PATCH 2/3] mm: Only enforce stable page writes if the backing device requires it
Create a helper function to check if a backing device requires stable page writes and, if so, performs the necessary wait. Then, make it so that all points in the memory manager that handle making pages writable use the helper function. This should provide stable page write support to most filesystems, while eliminating unnecessary waiting for devices that don''t require the feature. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/buffer.c | 2 +- fs/ext4/inode.c | 2 +- include/linux/pagemap.h | 1 + mm/filemap.c | 3 ++- mm/page-writeback.c | 11 +++++++++++ 5 files changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index b5f0442..cac3007 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2334,7 +2334,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, if (unlikely(ret < 0)) goto out_unlock; set_page_dirty(page); - wait_on_page_writeback(page); + wait_on_stable_page_write(page); return 0; out_unlock: unlock_page(page); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b3c243b..948d68a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4814,7 +4814,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { /* Wait so that we don''t change page under IO */ - wait_on_page_writeback(page); + wait_on_stable_page_write(page); ret = VM_FAULT_LOCKED; goto out; } diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e42c762..c28da25 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -398,6 +398,7 @@ static inline void wait_on_page_writeback(struct page *page) } extern void end_page_writeback(struct page *page); +void wait_on_stable_page_write(struct page *page); /* * Add an arbitrary waiter to a page''s wait queue diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..ee46141 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1728,6 +1728,7 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) * see the dirty page and writeprotect it again. */ set_page_dirty(page); + wait_on_stable_page_write(page); out: sb_end_pagefault(inode->i_sb); return ret; @@ -2274,7 +2275,7 @@ repeat: return NULL; } found: - wait_on_page_writeback(page); + wait_on_stable_page_write(page); return page; } EXPORT_SYMBOL(grab_cache_page_write_begin); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 830893b..916dae1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2275,3 +2275,14 @@ int mapping_tagged(struct address_space *mapping, int tag) return radix_tree_tagged(&mapping->page_tree, tag); } EXPORT_SYMBOL(mapping_tagged); + +void wait_on_stable_page_write(struct page *page) +{ + struct backing_dev_info *bdi = page->mapping->backing_dev_info; + + if (!bdi_cap_stable_pages_required(bdi)) + return; + + wait_on_page_writeback(page); +} +EXPORT_SYMBOL_GPL(wait_on_stable_page_write); -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Darrick J. Wong
2012-Nov-01 07:58 UTC
[PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback
Fix up the filesystems that provide their own ->page_mkwrite handlers to provide stable page writes if necessary. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/9p/vfs_file.c | 1 + fs/afs/write.c | 4 ++-- fs/ceph/addr.c | 1 + fs/cifs/file.c | 1 + fs/ocfs2/mmap.c | 1 + fs/ubifs/file.c | 4 ++-- 6 files changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c index c2483e9..aa253f0 100644 --- a/fs/9p/vfs_file.c +++ b/fs/9p/vfs_file.c @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) lock_page(page); if (page->mapping != inode->i_mapping) goto out_unlock; + wait_on_stable_page_write(page); return VM_FAULT_LOCKED; out_unlock: diff --git a/fs/afs/write.c b/fs/afs/write.c index 9aa52d9..39eb2a4 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page) #ifdef CONFIG_AFS_FSCACHE fscache_wait_on_page_write(vnode->cache, page); #endif - + wait_on_stable_page_write(page); _leave(" = 0"); - return 0; + return VM_FAULT_LOCKED; } diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 6690269..e9734bf 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) set_page_dirty(page); up_read(&mdsc->snap_rwsem); ret = VM_FAULT_LOCKED; + wait_on_stable_page_write(page); } else { if (ret == -ENOMEM) ret = VM_FAULT_OOM; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index edb25b4..a8770bf 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) struct page *page = vmf->page; lock_page(page); + wait_on_stable_page_write(page); return VM_FAULT_LOCKED; } diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c index 47a87dd..a0027b1 100644 --- a/fs/ocfs2/mmap.c +++ b/fs/ocfs2/mmap.c @@ -124,6 +124,7 @@ static int __ocfs2_page_mkwrite(struct file *file, struct buffer_head *di_bh, fsdata); BUG_ON(ret != len); ret = VM_FAULT_LOCKED; + wait_on_stable_page_write(page); out: return ret; } diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index 5bc7781..cb0d3aa 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1522,8 +1522,8 @@ static int ubifs_vm_page_mkwrite(struct vm_area_struct *vma, ubifs_release_dirty_inode_budget(c, ui); } - unlock_page(page); - return 0; + wait_on_stable_page_write(page); + return VM_FAULT_LOCKED; out_unlock: unlock_page(page); -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Jan Kara
2012-Nov-01 12:36 UTC
Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback
On Thu 01-11-12 00:58:29, Darrick J. Wong wrote:> Fix up the filesystems that provide their own ->page_mkwrite handlers to > provide stable page writes if necessary. > > Signed-off-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > --- > fs/9p/vfs_file.c | 1 + > fs/afs/write.c | 4 ++-- > fs/ceph/addr.c | 1 + > fs/cifs/file.c | 1 + > fs/ocfs2/mmap.c | 1 + > fs/ubifs/file.c | 4 ++-- > 6 files changed, 8 insertions(+), 4 deletions(-) > > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c > index c2483e9..aa253f0 100644 > --- a/fs/9p/vfs_file.c > +++ b/fs/9p/vfs_file.c > @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > lock_page(page); > if (page->mapping != inode->i_mapping) > goto out_unlock; > + wait_on_stable_page_write(page); > > return VM_FAULT_LOCKED; > out_unlock: > diff --git a/fs/afs/write.c b/fs/afs/write.c > index 9aa52d9..39eb2a4 100644 > --- a/fs/afs/write.c > +++ b/fs/afs/write.c > @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page) > #ifdef CONFIG_AFS_FSCACHE > fscache_wait_on_page_write(vnode->cache, page); > #endif > - > + wait_on_stable_page_write(page); > _leave(" = 0"); > - return 0; > + return VM_FAULT_LOCKED; > }Oh, I missed these two since I''ve got confused by fscache_wait_on_page_write().> diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c > index 47a87dd..a0027b1 100644 > --- a/fs/ocfs2/mmap.c > +++ b/fs/ocfs2/mmap.c > @@ -124,6 +124,7 @@ static int __ocfs2_page_mkwrite(struct file *file, struct buffer_head *di_bh, > fsdata); > BUG_ON(ret != len); > ret = VM_FAULT_LOCKED; > + wait_on_stable_page_write(page); > out: > return ret; > }Actually, this is not so easy. ocfs2 doesn''t use grab_cache_page_write_begin() so you have to modify write_begin() path as well. And then you don''t have to modify __ocfs2_page_mkwrite() because it uses ocfs2_write_begin(). Preferably teach it to use grab_cache_page_write_begin()... And I think you are missing ncpfs. Because ncp_file_mmap does not set .mkwrite - it should use filemap_page_mkwrite() I think. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR
Jan Kara
2012-Nov-01 13:28 UTC
Re: [PATCH 2/3] mm: Only enforce stable page writes if the backing device requires it
On Thu 01-11-12 00:58:21, Darrick J. Wong wrote:> Create a helper function to check if a backing device requires stable page > writes and, if so, performs the necessary wait. Then, make it so that all > points in the memory manager that handle making pages writable use the helper > function. This should provide stable page write support to most filesystems, > while eliminating unnecessary waiting for devices that don''t require the > feature. > > Signed-off-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > --- > fs/buffer.c | 2 +- > fs/ext4/inode.c | 2 +- > include/linux/pagemap.h | 1 + > mm/filemap.c | 3 ++- > mm/page-writeback.c | 11 +++++++++++ > 5 files changed, 16 insertions(+), 3 deletions(-) >..> diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 830893b..916dae1 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2275,3 +2275,14 @@ int mapping_tagged(struct address_space *mapping, int tag) > return radix_tree_tagged(&mapping->page_tree, tag); > } > EXPORT_SYMBOL(mapping_tagged); > + > +void wait_on_stable_page_write(struct page *page) > +{ > + struct backing_dev_info *bdi = page->mapping->backing_dev_info; > + > + if (!bdi_cap_stable_pages_required(bdi)) > + return; > + > + wait_on_page_writeback(page); > +} > +EXPORT_SYMBOL_GPL(wait_on_stable_page_write);Just one nit: Maybe "wait_if_stable_write()" would describe the function better? Otherwise the patch looks OK. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR
Jan Kara
2012-Nov-01 13:31 UTC
Re: [PATCH 1/3] bdi: Track users that require stable page writes
On Thu 01-11-12 00:58:13, Darrick J. Wong wrote:> This creates a per-backing-device counter that tracks the number of users which > require pages to be held immutable during writeout. Eventually it will be used > to waive wait_for_page_writeback() if nobody requires stable pages.As I wrote in another mail, maybe a combination of bdi and sb flag would make things simpler (less chances for errors). But I can live with this as well... Honza> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > Documentation/ABI/testing/sysfs-class-bdi | 7 +++++ > block/blk-integrity.c | 4 +++ > include/linux/backing-dev.h | 16 ++++++++++++ > include/linux/blkdev.h | 10 ++++++++ > mm/backing-dev.c | 38 +++++++++++++++++++++++++++++ > 5 files changed, 75 insertions(+) > > > diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi > index 5f50097..218a618 100644 > --- a/Documentation/ABI/testing/sysfs-class-bdi > +++ b/Documentation/ABI/testing/sysfs-class-bdi > @@ -48,3 +48,10 @@ max_ratio (read-write) > most of the write-back cache. For example in case of an NFS > mount that is prone to get stuck, or a FUSE mount which cannot > be trusted to play fair. > + > +stable_pages_required (read-write) > + > + If set, the backing device requires that all pages comprising a write > + request must not be changed until writeout is complete. The system > + administrator can turn this on if the hardware does not do so already. > + However, once enabled, this flag cannot be disabled. > diff --git a/block/blk-integrity.c b/block/blk-integrity.c > index da2a818..cf2dd95 100644 > --- a/block/blk-integrity.c > +++ b/block/blk-integrity.c > @@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template) > } else > bi->name = bi_unsupported_name; > > + queue_require_stable_pages(disk->queue); > + > return 0; > } > EXPORT_SYMBOL(blk_integrity_register); > @@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk) > if (!disk || !disk->integrity) > return; > > + queue_unrequire_stable_pages(disk->queue); > + > bi = disk->integrity; > > kobject_uevent(&bi->kobj, KOBJ_REMOVE); > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h > index 2a9a9ab..0554f5d 100644 > --- a/include/linux/backing-dev.h > +++ b/include/linux/backing-dev.h > @@ -109,6 +109,7 @@ struct backing_dev_info { > struct dentry *debug_dir; > struct dentry *debug_stats; > #endif > + atomic_t stable_page_users; > }; > > int bdi_init(struct backing_dev_info *bdi); > @@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout); > int pdflush_proc_obsolete(struct ctl_table *table, int write, > void __user *buffer, size_t *lenp, loff_t *ppos); > > +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi) > +{ > + atomic_inc(&bdi->stable_page_users); > +} > + > +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi) > +{ > + atomic_dec(&bdi->stable_page_users); > +} > + > +static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi) > +{ > + return atomic_read(&bdi->stable_page_users) > 0; > +} > + > static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi) > { > return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK); > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 1756001..bf927c0 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -458,6 +458,16 @@ struct request_queue { > (1 << QUEUE_FLAG_SAME_COMP) | \ > (1 << QUEUE_FLAG_ADD_RANDOM)) > > +static inline void queue_require_stable_pages(struct request_queue *q) > +{ > + bdi_require_stable_pages(&q->backing_dev_info); > +} > + > +static inline void queue_unrequire_stable_pages(struct request_queue *q) > +{ > + bdi_unrequire_stable_pages(&q->backing_dev_info); > +} > + > static inline void queue_lockdep_assert_held(struct request_queue *q) > { > if (q->queue_lock) > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index d3ca2b3..dd9f5ed 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev, > } > BDI_SHOW(max_ratio, bdi->max_ratio) > > +static ssize_t stable_pages_required_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + unsigned int spw; > + ssize_t ret; > + > + ret = kstrtouint(buf, 10, &spw); > + if (ret < 0) > + return ret; > + > + /* > + * SPW could be enabled due to hw requirement, so don''t > + * let users disable it. > + */ > + if (bdi_cap_stable_pages_required(bdi) && spw == 0) > + return -EINVAL; > + > + if (spw != 0) > + atomic_inc(&bdi->stable_page_users); > + > + return count; > +} > + > +static ssize_t stable_pages_required_show(struct device *dev, > + struct device_attribute *attr, > + char *page) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + > + return snprintf(page, PAGE_SIZE-1, "%d\n", > + bdi_cap_stable_pages_required(bdi) ? 1 : 0); > +} > + > #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store) > > static struct device_attribute bdi_dev_attrs[] = { > __ATTR_RW(read_ahead_kb), > __ATTR_RW(min_ratio), > __ATTR_RW(max_ratio), > + __ATTR_RW(stable_pages_required), > __ATTR_NULL, > }; > > @@ -650,6 +686,8 @@ int bdi_init(struct backing_dev_info *bdi) > bdi->write_bandwidth = INIT_BW; > bdi->avg_write_bandwidth = INIT_BW; > > + atomic_set(&bdi->stable_page_users, 0); > + > err = fprop_local_init_percpu(&bdi->completions); > > if (err) { > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Boaz Harrosh
2012-Nov-01 18:21 UTC
Re: [PATCH 1/3] bdi: Track users that require stable page writes
On 11/01/2012 12:58 AM, Darrick J. Wong wrote:> This creates a per-backing-device counter that tracks the number of users which > require pages to be held immutable during writeout. Eventually it will be used > to waive wait_for_page_writeback() if nobody requires stable pages. >There is two things I do not like: 1. Please remind me why we need the users counter? If the device needs it, it will always be needed. The below queue_unrequire_stable_pages call at blk_integrity_unregister only happens at device destruction time, no? It was also said that maybe individual filesystems would need stable pages where other FSs using the same BDI do not. But since the FS is at the driving seat in any case, it can just do wait_for_writeback() regardless and can care less about the bdi flag and the other FSs. Actually all those FSs already do this this, and do not need any help. So this reason is mute. 2. I hate the atomic_read for every mkwrite. I think we can do better, since as you noted it can never be turned off, only on, at init time. And because of 1. above it is not dynamic. I think I like your previous simple bool better. But I do like a lot the new request_queue API you added here. Thanks Boaz> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > Documentation/ABI/testing/sysfs-class-bdi | 7 +++++ > block/blk-integrity.c | 4 +++ > include/linux/backing-dev.h | 16 ++++++++++++ > include/linux/blkdev.h | 10 ++++++++ > mm/backing-dev.c | 38 +++++++++++++++++++++++++++++ > 5 files changed, 75 insertions(+) > > > diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi > index 5f50097..218a618 100644 > --- a/Documentation/ABI/testing/sysfs-class-bdi > +++ b/Documentation/ABI/testing/sysfs-class-bdi > @@ -48,3 +48,10 @@ max_ratio (read-write) > most of the write-back cache. For example in case of an NFS > mount that is prone to get stuck, or a FUSE mount which cannot > be trusted to play fair. > + > +stable_pages_required (read-write) > + > + If set, the backing device requires that all pages comprising a write > + request must not be changed until writeout is complete. The system > + administrator can turn this on if the hardware does not do so already. > + However, once enabled, this flag cannot be disabled. > diff --git a/block/blk-integrity.c b/block/blk-integrity.c > index da2a818..cf2dd95 100644 > --- a/block/blk-integrity.c > +++ b/block/blk-integrity.c > @@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template) > } else > bi->name = bi_unsupported_name; > > + queue_require_stable_pages(disk->queue); > + > return 0; > } > EXPORT_SYMBOL(blk_integrity_register); > @@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk) > if (!disk || !disk->integrity) > return; > > + queue_unrequire_stable_pages(disk->queue); > + > bi = disk->integrity; > > kobject_uevent(&bi->kobj, KOBJ_REMOVE); > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h > index 2a9a9ab..0554f5d 100644 > --- a/include/linux/backing-dev.h > +++ b/include/linux/backing-dev.h > @@ -109,6 +109,7 @@ struct backing_dev_info { > struct dentry *debug_dir; > struct dentry *debug_stats; > #endif > + atomic_t stable_page_users; > }; > > int bdi_init(struct backing_dev_info *bdi); > @@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout); > int pdflush_proc_obsolete(struct ctl_table *table, int write, > void __user *buffer, size_t *lenp, loff_t *ppos); > > +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi) > +{ > + atomic_inc(&bdi->stable_page_users); > +} > + > +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi) > +{ > + atomic_dec(&bdi->stable_page_users); > +} > + > +static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi) > +{ > + return atomic_read(&bdi->stable_page_users) > 0; > +} > + > static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi) > { > return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK); > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 1756001..bf927c0 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -458,6 +458,16 @@ struct request_queue { > (1 << QUEUE_FLAG_SAME_COMP) | \ > (1 << QUEUE_FLAG_ADD_RANDOM)) > > +static inline void queue_require_stable_pages(struct request_queue *q) > +{ > + bdi_require_stable_pages(&q->backing_dev_info); > +} > + > +static inline void queue_unrequire_stable_pages(struct request_queue *q) > +{ > + bdi_unrequire_stable_pages(&q->backing_dev_info); > +} > + > static inline void queue_lockdep_assert_held(struct request_queue *q) > { > if (q->queue_lock) > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index d3ca2b3..dd9f5ed 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev, > } > BDI_SHOW(max_ratio, bdi->max_ratio) > > +static ssize_t stable_pages_required_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + unsigned int spw; > + ssize_t ret; > + > + ret = kstrtouint(buf, 10, &spw); > + if (ret < 0) > + return ret; > + > + /* > + * SPW could be enabled due to hw requirement, so don''t > + * let users disable it. > + */ > + if (bdi_cap_stable_pages_required(bdi) && spw == 0) > + return -EINVAL; > + > + if (spw != 0) > + atomic_inc(&bdi->stable_page_users); > + > + return count; > +} > + > +static ssize_t stable_pages_required_show(struct device *dev, > + struct device_attribute *attr, > + char *page) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + > + return snprintf(page, PAGE_SIZE-1, "%d\n", > + bdi_cap_stable_pages_required(bdi) ? 1 : 0); > +} > + > #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store) > > static struct device_attribute bdi_dev_attrs[] = { > __ATTR_RW(read_ahead_kb), > __ATTR_RW(min_ratio), > __ATTR_RW(max_ratio), > + __ATTR_RW(stable_pages_required), > __ATTR_NULL, > }; > > @@ -650,6 +686,8 @@ int bdi_init(struct backing_dev_info *bdi) > bdi->write_bandwidth = INIT_BW; > bdi->avg_write_bandwidth = INIT_BW; > > + atomic_set(&bdi->stable_page_users, 0); > + > err = fprop_local_init_percpu(&bdi->completions); > > if (err) { >-- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Boaz Harrosh
2012-Nov-01 18:43 UTC
Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback
On 11/01/2012 12:58 AM, Darrick J. Wong wrote:> Fix up the filesystems that provide their own ->page_mkwrite handlers to > provide stable page writes if necessary. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > fs/9p/vfs_file.c | 1 + > fs/afs/write.c | 4 ++-- > fs/ceph/addr.c | 1 + > fs/cifs/file.c | 1 + > fs/ocfs2/mmap.c | 1 + > fs/ubifs/file.c | 4 ++-- > 6 files changed, 8 insertions(+), 4 deletions(-) > > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c > index c2483e9..aa253f0 100644 > --- a/fs/9p/vfs_file.c > +++ b/fs/9p/vfs_file.c > @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > lock_page(page); > if (page->mapping != inode->i_mapping) > goto out_unlock; > + wait_on_stable_page_write(page); >Good god thanks, yes please ;-)> return VM_FAULT_LOCKED; > out_unlock: > diff --git a/fs/afs/write.c b/fs/afs/write.c > index 9aa52d9..39eb2a4 100644 > --- a/fs/afs/write.c > +++ b/fs/afs/write.c > @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)afs, is it not a network filesystem? which means that it has it''s own emulated none-block-device BDI, registered internally. So if you do need stable pages someone should call bdi_require_stable_pages() But again since it is a network filesystem I don''t see how it is needed, and/or it might be taken care of already.> #ifdef CONFIG_AFS_FSCACHE > fscache_wait_on_page_write(vnode->cache, page); > #endif > - > + wait_on_stable_page_write(page); > _leave(" = 0"); > - return 0; > + return VM_FAULT_LOCKED; > } > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.cCEPH for sure has it''s own "emulated none-block-device BDI". This one is also a pure networking filesystem. And it already does what it needs to do with wait_on_writeback(). So i do not think you should touch CEPH> index 6690269..e9734bf 100644 > --- a/fs/ceph/addr.c > +++ b/fs/ceph/addr.c > @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > set_page_dirty(page); > up_read(&mdsc->snap_rwsem); > ret = VM_FAULT_LOCKED; > + wait_on_stable_page_write(page); > } else { > if (ret == -ENOMEM) > ret = VM_FAULT_OOM; > diff --git a/fs/cifs/file.c b/fs/cifs/file.cCifs also self-BDI network filesystem, but> index edb25b4..a8770bf 100644 > --- a/fs/cifs/file.c > +++ b/fs/cifs/file.c > @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > struct page *page = vmf->page; > > lock_page(page);It waits by locking the page, that''s cifs naive way of waiting for writeback> + wait_on_stable_page_write(page);Instead it could do better and not override page_mkwrite at all, and all it needs to do is call bdi_require_stable_pages() at it''s own registered BDI> return VM_FAULT_LOCKED; > } > > diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c > index 47a87dd..a0027b1 100644 > --- a/fs/ocfs2/mmap.c > +++ b/fs/ocfs2/mmap.c > @@ -124,6 +124,7 @@ static int __ocfs2_page_mkwrite(struct file *file, struct buffer_head *di_bh, > fsdata); > BUG_ON(ret != len); > ret = VM_FAULT_LOCKED; > + wait_on_stable_page_write(page); > out: > return ret; > } > diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c > index 5bc7781..cb0d3aa 100644 > --- a/fs/ubifs/file.c > +++ b/fs/ubifs/file.c > @@ -1522,8 +1522,8 @@ static int ubifs_vm_page_mkwrite(struct vm_area_struct *vma, > ubifs_release_dirty_inode_budget(c, ui); > } > > - unlock_page(page); > - return 0; > + wait_on_stable_page_write(page);ubifs has it''s special ubi block device. So someone needs to call bdi_require_stable_pages() for this to work. I think that here too. The existing code, like cifs, calls page_lock, as a way of waiting for writeback. So this is certainly not finished.> + return VM_FAULT_LOCKED; > > out_unlock: > unlock_page(page); >Cheers Boaz -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Darrick J. Wong
2012-Nov-01 18:57 UTC
Re: [PATCH 1/3] bdi: Track users that require stable page writes
On Thu, Nov 01, 2012 at 11:21:22AM -0700, Boaz Harrosh wrote:> On 11/01/2012 12:58 AM, Darrick J. Wong wrote: > > This creates a per-backing-device counter that tracks the number of users which > > require pages to be held immutable during writeout. Eventually it will be used > > to waive wait_for_page_writeback() if nobody requires stable pages. > > > > There is two things I do not like: > 1. Please remind me why we need the users counter? > If the device needs it, it will always be needed. The below > queue_unrequire_stable_pages call at blk_integrity_unregister > only happens at device destruction time, no? > > It was also said that maybe individual filesystems would need > stable pages where other FSs using the same BDI do not. But > since the FS is at the driving seat in any case, it can just > do wait_for_writeback() regardless and can care less about > the bdi flag and the other FSs. Actually all those FSs already > do this this, and do not need any help. So this reason is > mute.The counter exists so that a filesystem can forcibly enable stable page writes even if the underlying device doesn''t require it, because the generic fs/mm waiting only happens if stable_pages_required=1. The idea here was to allow a filesystem that needs stable page writes for its own purposes (i.e. data block checksumming) to be able to enforce the requirement even if the disk doesn''t care (or doesn''t exist). But maybe there are no such filesystems? <shrug> I don''t really like the idea that you can dirty a page during writeout, which means that the disk could write the before page, the after page, or some weird mix of the two, and all you get is a promise that the after page will get rewritten if the power doesn''t go out. Not to mention that it could happen again. :)> 2. I hate the atomic_read for every mkwrite. I think we can do > better, since as you noted it can never be turned off, only > on, at init time. And because of 1. above it is not dynamic. > I think I like your previous simple bool better.I doubt the counter would change much; I could probably change it to something less heavyweight if it''s really a problem.> But I do like a lot the new request_queue API you added here. > ThanksGood! --D> > Boaz > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > > Documentation/ABI/testing/sysfs-class-bdi | 7 +++++ > > block/blk-integrity.c | 4 +++ > > include/linux/backing-dev.h | 16 ++++++++++++ > > include/linux/blkdev.h | 10 ++++++++ > > mm/backing-dev.c | 38 +++++++++++++++++++++++++++++ > > 5 files changed, 75 insertions(+) > > > > > > diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi > > index 5f50097..218a618 100644 > > --- a/Documentation/ABI/testing/sysfs-class-bdi > > +++ b/Documentation/ABI/testing/sysfs-class-bdi > > @@ -48,3 +48,10 @@ max_ratio (read-write) > > most of the write-back cache. For example in case of an NFS > > mount that is prone to get stuck, or a FUSE mount which cannot > > be trusted to play fair. > > + > > +stable_pages_required (read-write) > > + > > + If set, the backing device requires that all pages comprising a write > > + request must not be changed until writeout is complete. The system > > + administrator can turn this on if the hardware does not do so already. > > + However, once enabled, this flag cannot be disabled. > > diff --git a/block/blk-integrity.c b/block/blk-integrity.c > > index da2a818..cf2dd95 100644 > > --- a/block/blk-integrity.c > > +++ b/block/blk-integrity.c > > @@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template) > > } else > > bi->name = bi_unsupported_name; > > > > + queue_require_stable_pages(disk->queue); > > + > > return 0; > > } > > EXPORT_SYMBOL(blk_integrity_register); > > @@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk) > > if (!disk || !disk->integrity) > > return; > > > > + queue_unrequire_stable_pages(disk->queue); > > + > > bi = disk->integrity; > > > > kobject_uevent(&bi->kobj, KOBJ_REMOVE); > > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h > > index 2a9a9ab..0554f5d 100644 > > --- a/include/linux/backing-dev.h > > +++ b/include/linux/backing-dev.h > > @@ -109,6 +109,7 @@ struct backing_dev_info { > > struct dentry *debug_dir; > > struct dentry *debug_stats; > > #endif > > + atomic_t stable_page_users; > > }; > > > > int bdi_init(struct backing_dev_info *bdi); > > @@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout); > > int pdflush_proc_obsolete(struct ctl_table *table, int write, > > void __user *buffer, size_t *lenp, loff_t *ppos); > > > > +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi) > > +{ > > + atomic_inc(&bdi->stable_page_users); > > +} > > + > > +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi) > > +{ > > + atomic_dec(&bdi->stable_page_users); > > +} > > + > > +static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi) > > +{ > > + return atomic_read(&bdi->stable_page_users) > 0; > > +} > > + > > static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi) > > { > > return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK); > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index 1756001..bf927c0 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -458,6 +458,16 @@ struct request_queue { > > (1 << QUEUE_FLAG_SAME_COMP) | \ > > (1 << QUEUE_FLAG_ADD_RANDOM)) > > > > +static inline void queue_require_stable_pages(struct request_queue *q) > > +{ > > + bdi_require_stable_pages(&q->backing_dev_info); > > +} > > + > > +static inline void queue_unrequire_stable_pages(struct request_queue *q) > > +{ > > + bdi_unrequire_stable_pages(&q->backing_dev_info); > > +} > > + > > static inline void queue_lockdep_assert_held(struct request_queue *q) > > { > > if (q->queue_lock) > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > index d3ca2b3..dd9f5ed 100644 > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev, > > } > > BDI_SHOW(max_ratio, bdi->max_ratio) > > > > +static ssize_t stable_pages_required_store(struct device *dev, > > + struct device_attribute *attr, > > + const char *buf, size_t count) > > +{ > > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > > + unsigned int spw; > > + ssize_t ret; > > + > > + ret = kstrtouint(buf, 10, &spw); > > + if (ret < 0) > > + return ret; > > + > > + /* > > + * SPW could be enabled due to hw requirement, so don''t > > + * let users disable it. > > + */ > > + if (bdi_cap_stable_pages_required(bdi) && spw == 0) > > + return -EINVAL; > > + > > + if (spw != 0) > > + atomic_inc(&bdi->stable_page_users); > > + > > + return count; > > +} > > + > > +static ssize_t stable_pages_required_show(struct device *dev, > > + struct device_attribute *attr, > > + char *page) > > +{ > > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > > + > > + return snprintf(page, PAGE_SIZE-1, "%d\n", > > + bdi_cap_stable_pages_required(bdi) ? 1 : 0); > > +} > > + > > #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store) > > > > static struct device_attribute bdi_dev_attrs[] = { > > __ATTR_RW(read_ahead_kb), > > __ATTR_RW(min_ratio), > > __ATTR_RW(max_ratio), > > + __ATTR_RW(stable_pages_required), > > __ATTR_NULL, > > }; > > > > @@ -650,6 +686,8 @@ int bdi_init(struct backing_dev_info *bdi) > > bdi->write_bandwidth = INIT_BW; > > bdi->avg_write_bandwidth = INIT_BW; > > > > + atomic_set(&bdi->stable_page_users, 0); > > + > > err = fprop_local_init_percpu(&bdi->completions); > > > > if (err) { > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Jeff Layton
2012-Nov-01 20:22 UTC
Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback
On Thu, 1 Nov 2012 11:43:26 -0700 Boaz Harrosh <bharrosh@panasas.com> wrote:> On 11/01/2012 12:58 AM, Darrick J. Wong wrote: > > Fix up the filesystems that provide their own ->page_mkwrite handlers to > > provide stable page writes if necessary. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > > fs/9p/vfs_file.c | 1 + > > fs/afs/write.c | 4 ++-- > > fs/ceph/addr.c | 1 + > > fs/cifs/file.c | 1 + > > fs/ocfs2/mmap.c | 1 + > > fs/ubifs/file.c | 4 ++-- > > 6 files changed, 8 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c > > index c2483e9..aa253f0 100644 > > --- a/fs/9p/vfs_file.c > > +++ b/fs/9p/vfs_file.c > > @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > lock_page(page); > > if (page->mapping != inode->i_mapping) > > goto out_unlock; > > + wait_on_stable_page_write(page); > > > > Good god thanks, yes please ;-) > > > return VM_FAULT_LOCKED; > > out_unlock: > > diff --git a/fs/afs/write.c b/fs/afs/write.c > > index 9aa52d9..39eb2a4 100644 > > --- a/fs/afs/write.c > > +++ b/fs/afs/write.c > > @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page) > > afs, is it not a network filesystem? which means that it has it''s own emulated none-block-device > BDI, registered internally. So if you do need stable pages someone should call > bdi_require_stable_pages() > > But again since it is a network filesystem I don''t see how it is needed, and/or it might be > taken care of already. > > > #ifdef CONFIG_AFS_FSCACHE > > fscache_wait_on_page_write(vnode->cache, page); > > #endif > > - > > + wait_on_stable_page_write(page); > > _leave(" = 0"); > > - return 0; > > + return VM_FAULT_LOCKED; > > } > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > > CEPH for sure has it''s own "emulated none-block-device BDI". This one is also > a pure networking filesystem. > > And it already does what it needs to do with wait_on_writeback(). > > So i do not think you should touch CEPH > > > index 6690269..e9734bf 100644 > > --- a/fs/ceph/addr.c > > +++ b/fs/ceph/addr.c > > @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > set_page_dirty(page); > > up_read(&mdsc->snap_rwsem); > > ret = VM_FAULT_LOCKED; > > + wait_on_stable_page_write(page); > > } else { > > if (ret == -ENOMEM) > > ret = VM_FAULT_OOM; > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > > Cifs also self-BDI network filesystem, but > > > index edb25b4..a8770bf 100644 > > --- a/fs/cifs/file.c > > +++ b/fs/cifs/file.c > > @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > struct page *page = vmf->page; > > > > lock_page(page); > > It waits by locking the page, that''s cifs naive way of waiting for writeback > > > + wait_on_stable_page_write(page); > > Instead it could do better and not override page_mkwrite at all, and all it needs > to do is call bdi_require_stable_pages() at it''s own registered BDI >Hmm...I don''t know... I''ve never been crazy about using the page lock for this, but in the absence of a better way to guarantee stable pages, it was what I ended up with at the time. cifs_writepages will hold the page lock until kernel_sendmsg returns. At that point the TCP layer will have copied off the page data so it''s safe to release it. With this change though, we''re going to end up blocking until the writeback flag clears, right? And I think that will happen when the reply comes in? So, we''ll end up blocking for much longer than is really necessary in page_mkwrite with this change. -- Jeff Layton <jlayton@samba.org> -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Boaz Harrosh
2012-Nov-01 22:23 UTC
Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback
On 11/01/2012 01:22 PM, Jeff Layton wrote:> Hmm...I don''t know... > > I''ve never been crazy about using the page lock for this, but in the > absence of a better way to guarantee stable pages, it was what I ended > up with at the time. cifs_writepages will hold the page lock until > kernel_sendmsg returns. At that point the TCP layer will have copied > off the page data so it''s safe to release it. > > With this change though, we''re going to end up blocking until the > writeback flag clears, right? And I think that will happen when the > reply comes in? So, we''ll end up blocking for much longer than is > really necessary in page_mkwrite with this change. >Hmm OK, that is a very good point. In that case it is just a simple nack on Darrick''s hunk to cifs. cifs is fine and should not be touched Boaz -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Darrick J. Wong
2012-Nov-01 22:47 UTC
Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback
On Thu, Nov 01, 2012 at 04:22:54PM -0400, Jeff Layton wrote:> On Thu, 1 Nov 2012 11:43:26 -0700 > Boaz Harrosh <bharrosh@panasas.com> wrote: > > > On 11/01/2012 12:58 AM, Darrick J. Wong wrote: > > > Fix up the filesystems that provide their own ->page_mkwrite handlers to > > > provide stable page writes if necessary. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > --- > > > fs/9p/vfs_file.c | 1 + > > > fs/afs/write.c | 4 ++-- > > > fs/ceph/addr.c | 1 + > > > fs/cifs/file.c | 1 + > > > fs/ocfs2/mmap.c | 1 + > > > fs/ubifs/file.c | 4 ++-- > > > 6 files changed, 8 insertions(+), 4 deletions(-) > > > > > > > > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c > > > index c2483e9..aa253f0 100644 > > > --- a/fs/9p/vfs_file.c > > > +++ b/fs/9p/vfs_file.c > > > @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > lock_page(page); > > > if (page->mapping != inode->i_mapping) > > > goto out_unlock; > > > + wait_on_stable_page_write(page); > > > > > > > Good god thanks, yes please ;-) > > > > > return VM_FAULT_LOCKED; > > > out_unlock: > > > diff --git a/fs/afs/write.c b/fs/afs/write.c > > > index 9aa52d9..39eb2a4 100644 > > > --- a/fs/afs/write.c > > > +++ b/fs/afs/write.c > > > @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page) > > > > afs, is it not a network filesystem? which means that it has it''s own emulated none-block-device > > BDI, registered internally. So if you do need stable pages someone should call > > bdi_require_stable_pages() > > > > But again since it is a network filesystem I don''t see how it is needed, and/or it might be > > taken care of already. > > > > > #ifdef CONFIG_AFS_FSCACHE > > > fscache_wait_on_page_write(vnode->cache, page); > > > #endif > > > - > > > + wait_on_stable_page_write(page); > > > _leave(" = 0"); > > > - return 0; > > > + return VM_FAULT_LOCKED; > > > } > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > > > > CEPH for sure has it''s own "emulated none-block-device BDI". This one is also > > a pure networking filesystem. > > > > And it already does what it needs to do with wait_on_writeback(). > > > > So i do not think you should touch CEPH > > > > > index 6690269..e9734bf 100644 > > > --- a/fs/ceph/addr.c > > > +++ b/fs/ceph/addr.c > > > @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > set_page_dirty(page); > > > up_read(&mdsc->snap_rwsem); > > > ret = VM_FAULT_LOCKED; > > > + wait_on_stable_page_write(page); > > > } else { > > > if (ret == -ENOMEM) > > > ret = VM_FAULT_OOM; > > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > > > > Cifs also self-BDI network filesystem, but > > > > > index edb25b4..a8770bf 100644 > > > --- a/fs/cifs/file.c > > > +++ b/fs/cifs/file.c > > > @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > struct page *page = vmf->page; > > > > > > lock_page(page); > > > > It waits by locking the page, that''s cifs naive way of waiting for writeback > > > > > + wait_on_stable_page_write(page); > > > > Instead it could do better and not override page_mkwrite at all, and all it needs > > to do is call bdi_require_stable_pages() at it''s own registered BDI > > > > Hmm...I don''t know... > > I''ve never been crazy about using the page lock for this, but in the > absence of a better way to guarantee stable pages, it was what I ended > up with at the time. cifs_writepages will hold the page lock until > kernel_sendmsg returns. At that point the TCP layer will have copied > off the page data so it''s safe to release it. > > With this change though, we''re going to end up blocking until the > writeback flag clears, right? And I think that will happen when the > reply comes in? So, we''ll end up blocking for much longer than is > really necessary in page_mkwrite with this change.That''s a very good point to make-- network FSes can stop the stable-waiting after the request is sent. Can I interest you in a new page flag (PG_stable) that indicates when a page has to be held for stable write? Along with a modification to wait_on_stable_page_write that uses the new PG_stable flag instead of just writeback? Then, you can clear PG_stable right after the sendmsg() and release the page for further activity without having to overload the page lock. I wrote a patch that does exactly that as part of my work to defer the integrity checksumming until the last possible instant. However, I haven''t gotten that part to work yet, so I left the PG_stable patch out of this submission. On the other hand, it sounds like you could use it. --D> > -- > Jeff Layton <jlayton@samba.org> > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Boaz Harrosh
2012-Nov-01 22:56 UTC
Re: [PATCH 1/3] bdi: Track users that require stable page writes
On 11/01/2012 11:57 AM, Darrick J. Wong wrote:> On Thu, Nov 01, 2012 at 11:21:22AM -0700, Boaz Harrosh wrote: >> On 11/01/2012 12:58 AM, Darrick J. Wong wrote: >>> This creates a per-backing-device counter that tracks the number of users which >>> require pages to be held immutable during writeout. Eventually it will be used >>> to waive wait_for_page_writeback() if nobody requires stable pages. >>> >> >> There is two things I do not like: >> 1. Please remind me why we need the users counter? >> If the device needs it, it will always be needed. The below >> queue_unrequire_stable_pages call at blk_integrity_unregister >> only happens at device destruction time, no? >> >> It was also said that maybe individual filesystems would need >> stable pages where other FSs using the same BDI do not. But >> since the FS is at the driving seat in any case, it can just >> do wait_for_writeback() regardless and can care less about >> the bdi flag and the other FSs. Actually all those FSs already >> do this this, and do not need any help. So this reason is >> mute. > > The counter exists so that a filesystem can forcibly enable stable page writes > even if the underlying device doesn''t require it, because the generic fs/mm > waiting only happens if stable_pages_required=1. The idea here was to allow a > filesystem that needs stable page writes for its own purposes (i.e. data block > checksumming) to be able to enforce the requirement even if the disk doesn''t > care (or doesn''t exist). >But the filesystem does not need BDI flag to do that, It can just call wait_on_page_writeback() directly and or any other waiting like cifs does, and this way will not affect any other partitions of the same BDI. So this flag is never needed by the FS, it is always to service the device.> But maybe there are no such filesystems? >Exactly, all the FSs that do care, already take care of it.> <shrug> I don''t really like the idea that you can dirty a page during writeout, > which means that the disk could write the before page, the after page, or some > weird mix of the two, and all you get is a promise that the after page will get > rewritten if the power doesn''t go out. Not to mention that it could happen > again. :) >I guess that''s life. But what is the difference between a modification of a page that comes 1 nanosecond before the end_write_back, and another one that came 1 nano after end_write_back. To the disk it looks like the same load pattern. I do have in mind a way that we can minimize these redirty-while-writeback by 90% but I don''t care enough (And am not paid) to fix it, so the contention is currently worse then what it could be.>> 2. I hate the atomic_read for every mkwrite. I think we can do >> better, since as you noted it can never be turned off, only >> on, at init time. And because of 1. above it is not dynamic. >> I think I like your previous simple bool better. > > I doubt the counter would change much; I could probably change it to something > less heavyweight if it''s really a problem. >I hope you are convinced that a counter is not needed, and a simple bool like before is enough Thanks Boaz -- To unsubscribe, send a message with ''unsubscribe linux-mm'' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don''t email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Jan Kara
2012-Nov-01 23:15 UTC
Re: [PATCH 1/3] bdi: Track users that require stable page writes
On Thu 01-11-12 15:56:34, Boaz Harrosh wrote:> On 11/01/2012 11:57 AM, Darrick J. Wong wrote: > > On Thu, Nov 01, 2012 at 11:21:22AM -0700, Boaz Harrosh wrote: > >> On 11/01/2012 12:58 AM, Darrick J. Wong wrote: > >>> This creates a per-backing-device counter that tracks the number of users which > >>> require pages to be held immutable during writeout. Eventually it will be used > >>> to waive wait_for_page_writeback() if nobody requires stable pages. > >>> > >> > >> There is two things I do not like: > >> 1. Please remind me why we need the users counter? > >> If the device needs it, it will always be needed. The below > >> queue_unrequire_stable_pages call at blk_integrity_unregister > >> only happens at device destruction time, no? > >> > >> It was also said that maybe individual filesystems would need > >> stable pages where other FSs using the same BDI do not. But > >> since the FS is at the driving seat in any case, it can just > >> do wait_for_writeback() regardless and can care less about > >> the bdi flag and the other FSs. Actually all those FSs already > >> do this this, and do not need any help. So this reason is > >> mute. > > > > The counter exists so that a filesystem can forcibly enable stable page writes > > even if the underlying device doesn''t require it, because the generic fs/mm > > waiting only happens if stable_pages_required=1. The idea here was to allow a > > filesystem that needs stable page writes for its own purposes (i.e. data block > > checksumming) to be able to enforce the requirement even if the disk doesn''t > > care (or doesn''t exist). > > > > But the filesystem does not need BDI flag to do that, It can just call > wait_on_page_writeback() directly and or any other waiting like cifs does, > and this way will not affect any other partitions of the same BDI. So this flag > is never needed by the FS, it is always to service the device.But if they use some generic VFS functions, these VFS functions need to know whether to wait or not - e.g. grab_cache_page_write_begin() or block_page_mkwrite().> > But maybe there are no such filesystems? > > > > Exactly, all the FSs that do care, already take care of it.I''m not exactly sure whether FSs that do care didn''t start to use generic functions which now always call wait_on_page_writeback() for quite a few releases... Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR
Jeff Layton
2012-Nov-02 00:36 UTC
Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback
On Thu, 1 Nov 2012 15:47:30 -0700 "Darrick J. Wong" <darrick.wong@oracle.com> wrote:> On Thu, Nov 01, 2012 at 04:22:54PM -0400, Jeff Layton wrote: > > On Thu, 1 Nov 2012 11:43:26 -0700 > > Boaz Harrosh <bharrosh@panasas.com> wrote: > > > > > On 11/01/2012 12:58 AM, Darrick J. Wong wrote: > > > > Fix up the filesystems that provide their own ->page_mkwrite handlers to > > > > provide stable page writes if necessary. > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > --- > > > > fs/9p/vfs_file.c | 1 + > > > > fs/afs/write.c | 4 ++-- > > > > fs/ceph/addr.c | 1 + > > > > fs/cifs/file.c | 1 + > > > > fs/ocfs2/mmap.c | 1 + > > > > fs/ubifs/file.c | 4 ++-- > > > > 6 files changed, 8 insertions(+), 4 deletions(-) > > > > > > > > > > > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c > > > > index c2483e9..aa253f0 100644 > > > > --- a/fs/9p/vfs_file.c > > > > +++ b/fs/9p/vfs_file.c > > > > @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > > lock_page(page); > > > > if (page->mapping != inode->i_mapping) > > > > goto out_unlock; > > > > + wait_on_stable_page_write(page); > > > > > > > > > > Good god thanks, yes please ;-) > > > > > > > return VM_FAULT_LOCKED; > > > > out_unlock: > > > > diff --git a/fs/afs/write.c b/fs/afs/write.c > > > > index 9aa52d9..39eb2a4 100644 > > > > --- a/fs/afs/write.c > > > > +++ b/fs/afs/write.c > > > > @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page) > > > > > > afs, is it not a network filesystem? which means that it has it''s own emulated none-block-device > > > BDI, registered internally. So if you do need stable pages someone should call > > > bdi_require_stable_pages() > > > > > > But again since it is a network filesystem I don''t see how it is needed, and/or it might be > > > taken care of already. > > > > > > > #ifdef CONFIG_AFS_FSCACHE > > > > fscache_wait_on_page_write(vnode->cache, page); > > > > #endif > > > > - > > > > + wait_on_stable_page_write(page); > > > > _leave(" = 0"); > > > > - return 0; > > > > + return VM_FAULT_LOCKED; > > > > } > > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > > > > > > CEPH for sure has it''s own "emulated none-block-device BDI". This one is also > > > a pure networking filesystem. > > > > > > And it already does what it needs to do with wait_on_writeback(). > > > > > > So i do not think you should touch CEPH > > > > > > > index 6690269..e9734bf 100644 > > > > --- a/fs/ceph/addr.c > > > > +++ b/fs/ceph/addr.c > > > > @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > > set_page_dirty(page); > > > > up_read(&mdsc->snap_rwsem); > > > > ret = VM_FAULT_LOCKED; > > > > + wait_on_stable_page_write(page); > > > > } else { > > > > if (ret == -ENOMEM) > > > > ret = VM_FAULT_OOM; > > > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > > > > > > Cifs also self-BDI network filesystem, but > > > > > > > index edb25b4..a8770bf 100644 > > > > --- a/fs/cifs/file.c > > > > +++ b/fs/cifs/file.c > > > > @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > > struct page *page = vmf->page; > > > > > > > > lock_page(page); > > > > > > It waits by locking the page, that''s cifs naive way of waiting for writeback > > > > > > > + wait_on_stable_page_write(page); > > > > > > Instead it could do better and not override page_mkwrite at all, and all it needs > > > to do is call bdi_require_stable_pages() at it''s own registered BDI > > > > > > > Hmm...I don''t know... > > > > I''ve never been crazy about using the page lock for this, but in the > > absence of a better way to guarantee stable pages, it was what I ended > > up with at the time. cifs_writepages will hold the page lock until > > kernel_sendmsg returns. At that point the TCP layer will have copied > > off the page data so it''s safe to release it. > > > > With this change though, we''re going to end up blocking until the > > writeback flag clears, right? And I think that will happen when the > > reply comes in? So, we''ll end up blocking for much longer than is > > really necessary in page_mkwrite with this change. > > That''s a very good point to make-- network FSes can stop the stable-waiting > after the request is sent.Well, it depends... If the fs in question uses kernel_sendpage (or the equivalent) then the page will be inlined into the fraglist of the skb. If you use that, then you can''t just drop it after the send. It''s also possible that the fs doesn''t care about page stability at all (like if signatures aren''t being used). So I think you probably need to account for several different possibilities of "page stability lifetimes" here...> Can I interest you in a new page flag (PG_stable) > that indicates when a page has to be held for stable write? Along with a > modification to wait_on_stable_page_write that uses the new PG_stable flag > instead of just writeback? Then, you can clear PG_stable right after the > sendmsg() and release the page for further activity without having to overload > the page lock. > > I wrote a patch that does exactly that as part of my work to defer the > integrity checksumming until the last possible instant. However, I haven''t > gotten that part to work yet, so I left the PG_stable patch out of this > submission. On the other hand, it sounds like you could use it. >That sounds much more suitable for CIFS and possibly for others too. -- Jeff Layton <jlayton@samba.org> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Apparently Analagous Threads
- [PATCH 1/5] Btrfs: fix bh leak on __btrfs_open_devices path
- [PATCH 06/15] genhd: Add return code to device_add_disk
- [PATCH 06/15] genhd: Add return code to device_add_disk
- [PATCH] fs: push file_update_time into ->page_mkwrite
- [PATCH 06/15] genhd: Add return code to device_add_disk