Darrick J. Wong
2018-Oct-05 00:44 UTC
[Ocfs2-devel] [PATCH 00/15] fs: fixes for serious clone/dedupe problems
Hi all, Dave, Eric, and I have been chasing a stale data exposure bug in the XFS reflink implementation, and tracked it down to reflink forgetting to do some of the file-extending activities that must happen for regular writes. We then started auditing the clone, dedupe, and copyfile code and realized that from a file contents perspective, clonerange isn't any different from a regular file write. Unfortunately, we also noticed that *unlike* a regular write, clonerange skips a ton of overflow checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS, and RLIMIT_FSIZE. We also observed that cloning into a file did not strip security privileges (suid, capabilities) like a regular write would. I also noticed that xfs and ocfs2 need to dump the page cache before remapping blocks, not after. In fixing the range checking problems I also realized that both dedupe and copyfile tell userspace how much of the requested operation was acted upon. Since the range validation can shorten a clone request (or we can ENOSPC midway through), we might as well plumb the short operation reporting back through the VFS indirection code to userspace. So, here's the whole giant pile of patches[1] that fix all the problems. The patch "generic: test reflink side effects" recently sent to fstests exercises the fixes in this series. Tests are in [2]. --D [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel [2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel
Darrick J. Wong
2018-Oct-05 00:44 UTC
[Ocfs2-devel] [PATCH 01/15] xfs: add a per-xfs trace_printk macro
From: Darrick J. Wong <darrick.wong at oracle.com> Add a "xfs_tprintk" macro so that developers can use trace_printk to print out arbitrary debugging information with the XFS device name attached to the trace output. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/xfs/xfs_error.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h index 246d3e989c6c..c3d9546b138c 100644 --- a/fs/xfs/xfs_error.h +++ b/fs/xfs/xfs_error.h @@ -99,4 +99,9 @@ extern int xfs_errortag_clearall(struct xfs_mount *mp); #define XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 #define XFS_PTAG_FSBLOCK_ZERO 0x00000080 +/* trace printk version of xfs_err and friends */ +#define xfs_tprintk(mp, fmt, args...) \ + trace_printk("dev %d:%d " fmt, MAJOR((mp)->m_super->s_dev), \ + MINOR((mp)->m_super->s_dev), ##args) + #endif /* __XFS_ERROR_H__ */
Darrick J. Wong
2018-Oct-05 00:44 UTC
[Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
From: Darrick J. Wong <darrick.wong at oracle.com> Refactor all the reflink preparation steps into a separate helper that we'll use to land all the upcoming fixes for insufficient input checks. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/xfs/xfs_reflink.c | 96 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 71 insertions(+), 25 deletions(-) diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 38f405415b88..80ca9b6793cd 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1195,11 +1195,33 @@ xfs_iolock_two_inodes_and_break_layout( return 0; } +/* Unlock both inodes after they've been prepped for a range clone. */ +STATIC void +xfs_reflink_remap_unlock( + struct file *file_in, + struct file *file_out) +{ + struct inode *inode_in = file_inode(file_in); + struct xfs_inode *src = XFS_I(inode_in); + struct inode *inode_out = file_inode(file_out); + struct xfs_inode *dest = XFS_I(inode_out); + bool same_inode = (inode_in == inode_out); + + xfs_iunlock(dest, XFS_MMAPLOCK_EXCL); + if (!same_inode) + xfs_iunlock(src, XFS_MMAPLOCK_SHARED); + inode_unlock(inode_out); + if (!same_inode) + inode_unlock_shared(inode_in); +} + /* - * Link a range of blocks from one file to another. + * Prepare two files for range cloning. Upon a successful return both inodes + * will have the iolock and mmaplock held, the page cache of the out file + * will be truncated, and any leases on the out file will have been broken. */ -int -xfs_reflink_remap_range( +STATIC int +xfs_reflink_remap_prep( struct file *file_in, loff_t pos_in, struct file *file_out, @@ -1211,19 +1233,9 @@ xfs_reflink_remap_range( struct xfs_inode *src = XFS_I(inode_in); struct inode *inode_out = file_inode(file_out); struct xfs_inode *dest = XFS_I(inode_out); - struct xfs_mount *mp = src->i_mount; bool same_inode = (inode_in == inode_out); - xfs_fileoff_t sfsbno, dfsbno; - xfs_filblks_t fsblen; - xfs_extlen_t cowextsize; ssize_t ret; - if (!xfs_sb_version_hasreflink(&mp->m_sb)) - return -EOPNOTSUPP; - - if (XFS_FORCED_SHUTDOWN(mp)) - return -EIO; - /* Lock both files against IO */ ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out); if (ret) @@ -1254,8 +1266,6 @@ xfs_reflink_remap_range( if (ret) goto out_unlock; - trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out); - /* * Clear out post-eof preallocations because we don't have page cache * backing the delayed allocations and they'll never get freed on @@ -1272,6 +1282,51 @@ xfs_reflink_remap_range( if (ret) goto out_unlock; + /* Zap any page cache for the destination file's range. */ + truncate_inode_pages_range(&inode_out->i_data, pos_out, + PAGE_ALIGN(pos_out + len) - 1); + return 0; +out_unlock: + xfs_reflink_remap_unlock(file_in, file_out); + return ret; +} + +/* + * Link a range of blocks from one file to another. + */ +int +xfs_reflink_remap_range( + struct file *file_in, + loff_t pos_in, + struct file *file_out, + loff_t pos_out, + u64 len, + bool is_dedupe) +{ + struct inode *inode_in = file_inode(file_in); + struct xfs_inode *src = XFS_I(inode_in); + struct inode *inode_out = file_inode(file_out); + struct xfs_inode *dest = XFS_I(inode_out); + struct xfs_mount *mp = src->i_mount; + xfs_fileoff_t sfsbno, dfsbno; + xfs_filblks_t fsblen; + xfs_extlen_t cowextsize; + ssize_t ret; + + if (!xfs_sb_version_hasreflink(&mp->m_sb)) + return -EOPNOTSUPP; + + if (XFS_FORCED_SHUTDOWN(mp)) + return -EIO; + + /* Prepare and then clone file data. */ + ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out, + len, is_dedupe); + if (ret) + return ret; + + trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out); + dfsbno = XFS_B_TO_FSBT(mp, pos_out); sfsbno = XFS_B_TO_FSBT(mp, pos_in); fsblen = XFS_B_TO_FSB(mp, len); @@ -1280,10 +1335,6 @@ xfs_reflink_remap_range( if (ret) goto out_unlock; - /* Zap any page cache for the destination file's range. */ - truncate_inode_pages_range(&inode_out->i_data, pos_out, - PAGE_ALIGN(pos_out + len) - 1); - /* * Carry the cowextsize hint from src to dest if we're sharing the * entire source file to the entire destination file, the source file @@ -1300,12 +1351,7 @@ xfs_reflink_remap_range( is_dedupe); out_unlock: - xfs_iunlock(dest, XFS_MMAPLOCK_EXCL); - if (!same_inode) - xfs_iunlock(src, XFS_MMAPLOCK_SHARED); - inode_unlock(inode_out); - if (!same_inode) - inode_unlock_shared(inode_in); + xfs_reflink_remap_unlock(file_in, file_out); if (ret) trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_); return ret;
Darrick J. Wong
2018-Oct-05 00:44 UTC
[Ocfs2-devel] [PATCH 03/15] xfs: zero posteof blocks when cloning above eof
From: Darrick J. Wong <darrick.wong at oracle.com> When we're reflinking between two files and the destination file range is well beyond the destination file's EOF marker, zero any posteof speculative preallocations in the destination file so that we don't expose stale disk contents. The previous strategy of trying to clear the preallocations does not work if the destination file has the PREALLOC flag set. Uncovered by shared/010. Reported-by: Zorro Lang <zlang at redhat.com> Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259 Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/xfs/xfs_reflink.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 80ca9b6793cd..55da7e1154f4 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1215,6 +1215,26 @@ xfs_reflink_remap_unlock( inode_unlock_shared(inode_in); } +/* + * If we're reflinking to a point past the destination file's EOF, we must + * zero any speculative post-EOF preallocations that sit between the old EOF + * and the destination file offset. + */ +static int +xfs_reflink_zero_posteof( + struct xfs_inode *ip, + loff_t pos) +{ + loff_t isize = i_size_read(VFS_I(ip)); + + if (pos <= isize) + return 0; + + trace_xfs_zero_eof(ip, isize, pos - isize); + return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL, + &xfs_iomap_ops); +} + /* * Prepare two files for range cloning. Upon a successful return both inodes * will have the iolock and mmaplock held, the page cache of the out file @@ -1267,15 +1287,12 @@ xfs_reflink_remap_prep( goto out_unlock; /* - * Clear out post-eof preallocations because we don't have page cache - * backing the delayed allocations and they'll never get freed on - * their own. + * Zero existing post-eof speculative preallocations in the destination + * file. */ - if (xfs_can_free_eofblocks(dest, true)) { - ret = xfs_free_eofblocks(dest); - if (ret) - goto out_unlock; - } + ret = xfs_reflink_zero_posteof(dest, pos_out); + if (ret) + goto out_unlock; /* Set flags and remap blocks. */ ret = xfs_reflink_set_inode_flag(src, dest);
Darrick J. Wong
2018-Oct-05 00:45 UTC
[Ocfs2-devel] [PATCH 04/15] xfs: update ctime and remove suid before cloning files
From: Darrick J. Wong <darrick.wong at oracle.com> Before cloning into a file, update the ctime and remove sensitive attributes like suid, just like we'd do for a regular file write. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/xfs/xfs_reflink.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 55da7e1154f4..d8f209bc8937 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1239,6 +1239,7 @@ xfs_reflink_zero_posteof( * Prepare two files for range cloning. Upon a successful return both inodes * will have the iolock and mmaplock held, the page cache of the out file * will be truncated, and any leases on the out file will have been broken. + * This function borrows heavily from xfs_file_aio_write_checks. */ STATIC int xfs_reflink_remap_prep( @@ -1302,6 +1303,30 @@ xfs_reflink_remap_prep( /* Zap any page cache for the destination file's range. */ truncate_inode_pages_range(&inode_out->i_data, pos_out, PAGE_ALIGN(pos_out + len) - 1); + + /* If we're altering the file contents... */ + if (!is_dedupe) { + /* + * ...update the timestamps (which will grab the ilock again + * from xfs_fs_dirty_inode, so we have to call it before we + * take the ilock). + */ + if (!(file_out->f_mode & FMODE_NOCMTIME)) { + ret = file_update_time(file_out); + if (ret) + goto out_unlock; + } + + /* + * ...clear the security bits if the process is not being run + * by root. This keeps people from modifying setuid and setgid + * binaries. + */ + ret = file_remove_privs(file_out); + if (ret) + goto out_unlock; + } + return 0; out_unlock: xfs_reflink_remap_unlock(file_in, file_out);
Darrick J. Wong
2018-Oct-05 00:45 UTC
[Ocfs2-devel] [PATCH 05/15] vfs: check file ranges before cloning files
From: Darrick J. Wong <darrick.wong at oracle.com> Move the file range checks from vfs_clone_file_prep into a separate generic_clone_checks function so that all the checks are collected in a central location. This forms the basis for adding more checks from generic_write_checks that will make cloning's input checking more consistent with write input checking. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/ocfs2/refcounttree.c | 2 + fs/read_write.c | 55 ++++++++++---------------------------- fs/xfs/xfs_reflink.c | 2 + include/linux/fs.h | 9 ++++-- mm/filemap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 90 insertions(+), 46 deletions(-) diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index 7869622af22a..11e4aad7b783 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -4842,7 +4842,7 @@ int ocfs2_reflink_remap_range(struct file *file_in, (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE)) goto out_unlock; - ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out, + ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out, &len, is_dedupe); if (ret <= 0) goto out_unlock; diff --git a/fs/read_write.c b/fs/read_write.c index 39b4a21dd933..973d3da78c09 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1717,12 +1717,13 @@ static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write) * Returns: 0 for "nothing to clone", 1 for "something to clone", or * the usual negative error code. */ -int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, - struct inode *inode_out, loff_t pos_out, - u64 *len, bool is_dedupe) +int vfs_clone_file_prep(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + u64 *len, bool is_dedupe) { - loff_t bs = inode_out->i_sb->s_blocksize; - loff_t blen; + struct inode *inode_in = file_inode(file_in); + struct inode *inode_out = file_inode(file_out); + uint64_t nlen; loff_t isize; bool same_inode = (inode_in == inode_out); int ret; @@ -1740,10 +1741,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) return -EINVAL; - /* Are we going all the way to the end? */ isize = i_size_read(inode_in); - if (isize == 0) - return 0; /* Zero length dedupe exits immediately; reflink goes to EOF. */ if (*len == 0) { @@ -1754,37 +1752,15 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, *len = isize - pos_in; } - /* Ensure offsets don't wrap and the input is inside i_size */ - if (pos_in + *len < pos_in || pos_out + *len < pos_out || - pos_in + *len > isize) - return -EINVAL; - - /* Don't allow dedupe past EOF in the dest file */ - if (is_dedupe) { - loff_t disize; - - disize = i_size_read(inode_out); - if (pos_out >= disize || pos_out + *len > disize) - return -EINVAL; - } - - /* If we're linking to EOF, continue to the block boundary. */ - if (pos_in + *len == isize) - blen = ALIGN(isize, bs) - pos_in; - else - blen = *len; - - /* Only reflink if we're aligned to block boundaries */ - if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) || - !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs)) + /* Check that we don't violate system file offset limits. */ + nlen = *len; + ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, &nlen, + is_dedupe); + if (ret) + return ret; + if (nlen != *len) return -EINVAL; - /* Don't allow overlapped reflink within the same file */ - if (same_inode) { - if (pos_out + blen > pos_in && pos_out < pos_in + blen) - return -EINVAL; - } - /* Wait for the completion of any pending IOs on both files */ inode_dio_wait(inode_in); if (!same_inode) @@ -1816,7 +1792,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, return 1; } -EXPORT_SYMBOL(vfs_clone_file_prep_inodes); +EXPORT_SYMBOL(vfs_clone_file_prep); int vfs_clone_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len) @@ -1854,9 +1830,6 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in, if (ret) return ret; - if (pos_in + len > i_size_read(inode_in)) - return -EINVAL; - ret = file_in->f_op->clone_file_range(file_in, pos_in, file_out, pos_out, len); if (!ret) { diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index d8f209bc8937..1955e093e9ea 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1277,7 +1277,7 @@ xfs_reflink_remap_prep( if (IS_DAX(inode_in) || IS_DAX(inode_out)) goto out_unlock; - ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out, + ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out, &len, is_dedupe); if (ret <= 0) goto out_unlock; diff --git a/include/linux/fs.h b/include/linux/fs.h index 6c0b4a1c22ff..2a4141d36ebf 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1825,9 +1825,9 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *, unsigned long, loff_t *, rwf_t); extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, loff_t, size_t, unsigned int); -extern int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, - struct inode *inode_out, loff_t pos_out, - u64 *len, bool is_dedupe); +extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + u64 *count, bool is_dedupe); extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len); extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, @@ -2978,6 +2978,9 @@ extern int sb_min_blocksize(struct super_block *, int); extern int generic_file_mmap(struct file *, struct vm_area_struct *); extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *); +extern int generic_clone_checks(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + uint64_t *count, bool is_dedupe); extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *); extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *); extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *); diff --git a/mm/filemap.c b/mm/filemap.c index 52517f28e6f4..68ec91d05c7b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2974,6 +2974,74 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) } EXPORT_SYMBOL(generic_write_checks); +/* + * Performs necessary checks before doing a clone. + * + * Can adjust amount of bytes to clone. + * Returns appropriate error code that caller should return or + * zero in case the clone should be allowed. + */ +int generic_clone_checks(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + uint64_t *req_count, bool is_dedupe) +{ + struct inode *inode_in = file_in->f_mapping->host; + struct inode *inode_out = file_out->f_mapping->host; + unsigned long limit = rlimit(RLIMIT_FSIZE); + uint64_t count = *req_count; + uint64_t bcount; + loff_t size_in, size_out; + loff_t bs = inode_out->i_sb->s_blocksize; + + /* The start of both ranges must be aligned to an fs block. */ + if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs)) + return -EINVAL; + + /* Ensure offsets don't wrap. */ + if (pos_in + count < pos_in || pos_out + count < pos_out) + return -EINVAL; + + size_in = i_size_read(inode_in); + size_out = i_size_read(inode_out); + + /* Dedupe requires both ranges to be within EOF. */ + if (is_dedupe && + (pos_in >= size_in || pos_in + count > size_in || + pos_out >= size_out || pos_out + count > size_out)) + return -EINVAL; + + /* Ensure the infile range is within the infile. */ + if (pos_in >= size_in) + return -EINVAL; + count = min(count, size_in - (uint64_t)pos_in); + + /* + * If the user wanted us to link to the infile's EOF, round up to the + * next block boundary for this check. + * + * Otherwise, make sure the count is also block-aligned, having + * already confirmed the starting offsets' block alignment. + */ + if (pos_in + count == size_in) { + bcount = ALIGN(size_in, bs) - pos_in; + } else { + if (!IS_ALIGNED(count, bs)) + return -EINVAL; + + bcount = count; + } + + /* Don't allow overlapped cloning within the same file. */ + if (inode_in == inode_out && + pos_out + bcount > pos_in && + pos_out < pos_in + bcount) + return -EINVAL; + + *req_count = count; + return 0; +} +EXPORT_SYMBOL(generic_clone_checks); + int pagecache_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata)
Darrick J. Wong
2018-Oct-05 00:45 UTC
[Ocfs2-devel] [PATCH 06/15] vfs: strengthen checking of file range inputs to clone/dedupe range
From: Darrick J. Wong <darrick.wong at oracle.com> Clone range is an optimization on a regular file write. File writes that extend the file length are subject to various constraints which are not checked by clonerange. This is a correctness problem, because we're never allowed to touch ranges that the page cache can't support (s_maxbytes); we're not supposed to deal with large offsets (MAX_NON_LFS) if O_LARGEFILE isn't set; and we must obey resource limits (RLIMIT_FSIZE). Therefore, add these checks to the new generic_clone_checks function so that we curtail unexpected behavior. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- mm/filemap.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index 68ec91d05c7b..f74391721234 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3015,6 +3015,37 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in, return -EINVAL; count = min(count, size_in - (uint64_t)pos_in); + /* Don't exceed RLMIT_FSIZE in the file we're writing into. */ + if (limit != RLIM_INFINITY) { + if (pos_out >= limit) { + send_sig(SIGXFSZ, current, 0); + return -EFBIG; + } + count = min(count, limit - (uint64_t)pos_out); + } + + /* Don't exceed the LFS limits. */ + if (unlikely(pos_out + count > MAX_NON_LFS && + !(file_out->f_flags & O_LARGEFILE))) { + if (pos_out >= MAX_NON_LFS) + return -EFBIG; + count = min(count, MAX_NON_LFS - (uint64_t)pos_out); + } + if (unlikely(pos_in + count > MAX_NON_LFS && + !(file_in->f_flags & O_LARGEFILE))) { + if (pos_in >= MAX_NON_LFS) + return -EFBIG; + count = min(count, MAX_NON_LFS - (uint64_t)pos_in); + } + + /* Don't operate on ranges the page cache doesn't support. */ + if (unlikely(pos_out >= inode_out->i_sb->s_maxbytes || + pos_in >= inode_in->i_sb->s_maxbytes)) + return -EFBIG; + + count = min(count, inode_out->i_sb->s_maxbytes - (uint64_t)pos_out); + count = min(count, inode_in->i_sb->s_maxbytes - (uint64_t)pos_in); + /* * If the user wanted us to link to the infile's EOF, round up to the * next block boundary for this check.
Darrick J. Wong
2018-Oct-05 00:45 UTC
[Ocfs2-devel] [PATCH 07/15] vfs: skip zero-length dedupe requests
From: Darrick J. Wong <darrick.wong at oracle.com> Don't bother calling the filesystem for a zero-length dedupe request; we can return zero and exit. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/read_write.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/read_write.c b/fs/read_write.c index 973d3da78c09..99b2f809180c 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1966,6 +1966,11 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, if (!dst_file->f_op->dedupe_file_range) goto out_drop_write; + if (len == 0) { + ret = 0; + goto out_drop_write; + } + ret = dst_file->f_op->dedupe_file_range(src_file, src_pos, dst_file, dst_pos, len); out_drop_write:
Darrick J. Wong
2018-Oct-05 00:45 UTC
[Ocfs2-devel] [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
From: Darrick J. Wong <darrick.wong at oracle.com> Change the clone_file_range and dedupe_file_range functions to return the number of bytes they operated on. This is the precursor to allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/btrfs/ctree.h | 4 ++-- fs/btrfs/ioctl.c | 13 +++++++++---- fs/nfs/nfs4file.c | 4 ++-- fs/ocfs2/file.c | 18 ++++++++++++------ fs/overlayfs/file.c | 18 ++++++++++++------ fs/read_write.c | 31 ++++++++++++++++++------------- fs/xfs/xfs_file.c | 18 ++++++++++++------ include/linux/fs.h | 12 +++++++----- 8 files changed, 74 insertions(+), 44 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2cddfe7806a4..864651257142 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3218,7 +3218,7 @@ void btrfs_get_block_group_info(struct list_head *groups_list, struct btrfs_ioctl_space_info *space); void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_balance_args *bargs); -int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff, +s64 btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff, struct file *dst_file, loff_t dst_loff, u64 olen); @@ -3250,7 +3250,7 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, size_t write_bytes, struct extent_state **cached); int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); -int btrfs_clone_file_range(struct file *file_in, loff_t pos_in, +s64 btrfs_clone_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len); /* tree-defrag.c */ diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d60b6caf09e8..35ba974f1333 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3627,13 +3627,14 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, return ret; } -int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff, +s64 btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff, struct file *dst_file, loff_t dst_loff, u64 olen) { struct inode *src = file_inode(src_file); struct inode *dst = file_inode(dst_file); u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize; + int ret; if (WARN_ON_ONCE(bs < PAGE_SIZE)) { /* @@ -3644,7 +3645,8 @@ int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff, return -EINVAL; } - return btrfs_extent_same(src, src_loff, olen, dst, dst_loff); + ret = btrfs_extent_same(src, src_loff, olen, dst, dst_loff); + return ret < 0 ? ret : olen; } static int clone_finish_inode_update(struct btrfs_trans_handle *trans, @@ -4348,10 +4350,13 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src, return ret; } -int btrfs_clone_file_range(struct file *src_file, loff_t off, +s64 btrfs_clone_file_range(struct file *src_file, loff_t off, struct file *dst_file, loff_t destoff, u64 len) { - return btrfs_clone_files(dst_file, src_file, off, len, destoff); + int ret; + + ret = btrfs_clone_files(dst_file, src_file, off, len, destoff); + return ret < 0 ? ret : len; } static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp) diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index 4288a6ecaf75..f914861f844f 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -180,7 +180,7 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t return nfs42_proc_allocate(filep, offset, len); } -static int nfs42_clone_file_range(struct file *src_file, loff_t src_off, +static s64 nfs42_clone_file_range(struct file *src_file, loff_t src_off, struct file *dst_file, loff_t dst_off, u64 count) { struct inode *dst_inode = file_inode(dst_file); @@ -240,7 +240,7 @@ static int nfs42_clone_file_range(struct file *src_file, loff_t src_off, inode_unlock(src_inode); } out: - return ret; + return ret < 0 ? ret : count; } #endif /* CONFIG_NFS_V4_2 */ diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 9fa35cb6f6e0..c4b78ee4a593 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2527,24 +2527,30 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence) return offset; } -static int ocfs2_file_clone_range(struct file *file_in, +static s64 ocfs2_file_clone_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len) { - return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, false); + int ret; + + ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, false); + return ret < 0 ? ret : len; } -static int ocfs2_file_dedupe_range(struct file *file_in, +static s64 ocfs2_file_dedupe_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len) { - return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, true); + int ret; + + ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, true); + return ret < 0 ? ret : len; } const struct inode_operations ocfs2_file_iops = { diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index aeaefd2a551b..6d792d817538 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -487,16 +487,21 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in, OVL_COPY); } -static int ovl_clone_file_range(struct file *file_in, loff_t pos_in, +static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len) { - return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0, - OVL_CLONE); + int ret; + + ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0, + OVL_CLONE); + return ret < 0 ? ret : len; } -static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in, +static s64 ovl_dedupe_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len) { + int ret; + /* * Don't copy up because of a dedupe request, this wouldn't make sense * most of the time (data would be duplicated instead of deduplicated). @@ -505,8 +510,9 @@ static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in, !ovl_inode_upper(file_inode(file_out))) return -EPERM; - return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0, - OVL_DEDUPE); + ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0, + OVL_DEDUPE); + return ret < 0 ? ret : len; } const struct file_operations ovl_file_operations = { diff --git a/fs/read_write.c b/fs/read_write.c index 99b2f809180c..f51751281454 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1589,10 +1589,12 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, * more efficient if both clone and copy are supported (e.g. NFS). */ if (file_in->f_op->clone_file_range) { - ret = file_in->f_op->clone_file_range(file_in, pos_in, - file_out, pos_out, len); - if (ret == 0) { - ret = len; + s64 cloned; + + cloned = file_in->f_op->clone_file_range(file_in, pos_in, + file_out, pos_out, min(MAX_RW_COUNT, len)); + if (cloned >= 0) { + ret = cloned; goto done; } } @@ -1799,6 +1801,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in, { struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); + s64 cloned; int ret; if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) @@ -1830,14 +1833,16 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in, if (ret) return ret; - ret = file_in->f_op->clone_file_range(file_in, pos_in, + cloned = file_in->f_op->clone_file_range(file_in, pos_in, file_out, pos_out, len); - if (!ret) { - fsnotify_access(file_in); - fsnotify_modify(file_out); - } + if (cloned < 0) + return cloned; + else if (len && cloned != len) + return -EINVAL; - return ret; + fsnotify_access(file_in); + fsnotify_modify(file_out); + return 0; } EXPORT_SYMBOL(vfs_clone_file_range); @@ -1937,7 +1942,7 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, } EXPORT_SYMBOL(vfs_dedupe_file_range_compare); -int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, +s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, struct file *dst_file, loff_t dst_pos, u64 len) { s64 ret; @@ -1989,7 +1994,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) int i; int ret; u16 count = same->dest_count; - int deduped; + s64 deduped; if (!(file->f_mode & FMODE_READ)) return -EINVAL; @@ -2046,7 +2051,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) else if (deduped < 0) info->status = deduped; else - info->bytes_deduped = len; + info->bytes_deduped = deduped; next_fdput: fdput(dst_fd); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 61a5ad2600e8..efa95e0d8cee 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -919,7 +919,7 @@ xfs_file_fallocate( return error; } -STATIC int +STATIC s64 xfs_file_clone_range( struct file *file_in, loff_t pos_in, @@ -927,11 +927,14 @@ xfs_file_clone_range( loff_t pos_out, u64 len) { - return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, false); + int ret; + + ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, false); + return ret < 0 ? ret : len; } -STATIC int +STATIC s64 xfs_file_dedupe_range( struct file *file_in, loff_t pos_in, @@ -939,8 +942,11 @@ xfs_file_dedupe_range( loff_t pos_out, u64 len) { - return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, true); + int ret; + + ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, true); + return ret < 0 ? ret : len; } STATIC int diff --git a/include/linux/fs.h b/include/linux/fs.h index 2a4141d36ebf..e5755340e825 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1759,10 +1759,12 @@ struct file_operations { #endif ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); - int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, - u64); - int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, - u64); + s64 (*clone_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + u64 count); + s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + u64 count); int (*fadvise)(struct file *, loff_t, loff_t, int); } __randomize_layout; @@ -1835,7 +1837,7 @@ extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, loff_t len, bool *is_same); extern int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same); -extern int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, +extern s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, struct file *dst_file, loff_t dst_pos, u64 len);
Darrick J. Wong
2018-Oct-05 00:45 UTC
[Ocfs2-devel] [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations
From: Darrick J. Wong <darrick.wong at oracle.com> Pass operational flags to the per-filesystem clone and dedupe implementations. This enables the vfs to signal when it can deal with short clone and short dedupe operations. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/ioctl.c | 3 ++- fs/nfs/nfs4file.c | 3 ++- fs/ocfs2/file.c | 3 ++- fs/ocfs2/refcounttree.c | 2 +- fs/overlayfs/file.c | 3 ++- fs/read_write.c | 9 ++++++--- fs/xfs/xfs_file.c | 3 ++- fs/xfs/xfs_reflink.c | 2 +- include/linux/fs.h | 10 ++++++++-- 10 files changed, 28 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 864651257142..e8c9b871709d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3251,7 +3251,8 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages, struct extent_state **cached); int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); s64 btrfs_clone_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len); + struct file *file_out, loff_t pos_out, u64 len, + unsigned int flags); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 35ba974f1333..b41a65622b93 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4351,7 +4351,8 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src, } s64 btrfs_clone_file_range(struct file *src_file, loff_t off, - struct file *dst_file, loff_t destoff, u64 len) + struct file *dst_file, loff_t destoff, u64 len, + unsigned int flags) { int ret; diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index f914861f844f..f8ff06fc1c73 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -181,7 +181,8 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t } static s64 nfs42_clone_file_range(struct file *src_file, loff_t src_off, - struct file *dst_file, loff_t dst_off, u64 count) + struct file *dst_file, loff_t dst_off, u64 count, + unsigned int flags) { struct inode *dst_inode = file_inode(dst_file); struct nfs_server *server = NFS_SERVER(dst_inode); diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index c4b78ee4a593..1ee6d3ecdac6 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2531,7 +2531,8 @@ static s64 ocfs2_file_clone_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len) + u64 len, + unsigned int flags) { int ret; diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index 11e4aad7b783..3758954f2377 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -4843,7 +4843,7 @@ int ocfs2_reflink_remap_range(struct file *file_in, goto out_unlock; ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out, - &len, is_dedupe); + &len, is_dedupe ? CLONERANGE_DEDUPE : 0); if (ret <= 0) goto out_unlock; diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index 6d792d817538..440cb7a82834 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -488,7 +488,8 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in, } static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len) + struct file *file_out, loff_t pos_out, u64 len, + unsigned int flags) { int ret; diff --git a/fs/read_write.c b/fs/read_write.c index f51751281454..7cfff497263b 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1592,7 +1592,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, s64 cloned; cloned = file_in->f_op->clone_file_range(file_in, pos_in, - file_out, pos_out, min(MAX_RW_COUNT, len)); + file_out, pos_out, min(MAX_RW_COUNT, len), + CLONERANGE_SHORT); if (cloned >= 0) { ret = cloned; goto done; @@ -1721,13 +1722,14 @@ static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write) */ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 *len, bool is_dedupe) + u64 *len, unsigned int flags) { struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); uint64_t nlen; loff_t isize; bool same_inode = (inode_in == inode_out); + bool is_dedupe = (flags & CLONERANGE_DEDUPE); int ret; /* Don't touch certain kinds of inodes */ @@ -1802,6 +1804,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in, struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); s64 cloned; + unsigned int flags = 0; int ret; if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) @@ -1834,7 +1837,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in, return ret; cloned = file_in->f_op->clone_file_range(file_in, pos_in, - file_out, pos_out, len); + file_out, pos_out, len, flags); if (cloned < 0) return cloned; else if (len && cloned != len) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index efa95e0d8cee..d5d6681ca714 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -925,7 +925,8 @@ xfs_file_clone_range( loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len) + u64 len, + unsigned int flags) { int ret; diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 1955e093e9ea..40684dd011ee 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1278,7 +1278,7 @@ xfs_reflink_remap_prep( goto out_unlock; ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out, - &len, is_dedupe); + &len, is_dedupe ? CLONERANGE_DEDUPE : 0); if (ret <= 0) goto out_unlock; diff --git a/include/linux/fs.h b/include/linux/fs.h index e5755340e825..ae5685c31270 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1761,7 +1761,7 @@ struct file_operations { loff_t, size_t, unsigned int); s64 (*clone_file_range)(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 count); + u64 count, unsigned int flags); s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 count); @@ -1827,9 +1827,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *, unsigned long, loff_t *, rwf_t); extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, loff_t, size_t, unsigned int); +/* Caller can handle a shortened operation. */ +#define CLONERANGE_SHORT (1 << 0) +/* End operation at the source file's EOF. */ +#define CLONERANGE_EOF (1 << 1) +/* Operation is actually dedupe, not clone. */ +#define CLONERANGE_DEDUPE (1 << 2) extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 *count, bool is_dedupe); + u64 *count, unsigned int flags); extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len); extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
Darrick J. Wong
2018-Oct-05 00:45 UTC
[Ocfs2-devel] [PATCH 10/15] vfs: make cloning to source file eof more explicit
From: Darrick J. Wong <darrick.wong at oracle.com> Use the new CLONE_RANGE_EOF flag to explicitly declare that the caller wants the clone implementation to set *len appropriately once the files are locked. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/read_write.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 7cfff497263b..4eaea52f70a8 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1747,11 +1747,12 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in, isize = i_size_read(inode_in); - /* Zero length dedupe exits immediately; reflink goes to EOF. */ - if (*len == 0) { - if (is_dedupe || pos_in == isize) - return 0; - if (pos_in > isize) + /* + * If the caller asked to go all the way to the end of the source file, + * set *len now that we have the file locked. + */ + if ((flags & CLONERANGE_EOF) && *len == 0) { + if (pos_in >= isize) return -EINVAL; *len = isize - pos_in; } @@ -1836,6 +1837,9 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in, if (ret) return ret; + if (len == 0) + flags |= CLONERANGE_EOF; + cloned = file_in->f_op->clone_file_range(file_in, pos_in, file_out, pos_out, len, flags); if (cloned < 0)
Darrick J. Wong
2018-Oct-05 00:45 UTC
[Ocfs2-devel] [PATCH 11/15] vfs: allow short clone and dedupe operations
From: Darrick J. Wong <darrick.wong at oracle.com> Allow the clone and dedupe prep function to shorten the request if the caller can handle it. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/read_write.c | 8 ++------ include/linux/fs.h | 2 +- mm/filemap.c | 19 +++++++++++++++---- 3 files changed, 18 insertions(+), 11 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 4eaea52f70a8..292d68c2f47c 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1726,7 +1726,6 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in, { struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); - uint64_t nlen; loff_t isize; bool same_inode = (inode_in == inode_out); bool is_dedupe = (flags & CLONERANGE_DEDUPE); @@ -1758,13 +1757,10 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in, } /* Check that we don't violate system file offset limits. */ - nlen = *len; - ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, &nlen, - is_dedupe); + ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, len, + flags); if (ret) return ret; - if (nlen != *len) - return -EINVAL; /* Wait for the completion of any pending IOs on both files */ inode_dio_wait(inode_in); diff --git a/include/linux/fs.h b/include/linux/fs.h index ae5685c31270..eb35363478e5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2988,7 +2988,7 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *); extern int generic_clone_checks(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - uint64_t *count, bool is_dedupe); + uint64_t *count, unsigned int flags); extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *); extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *); extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *); diff --git a/mm/filemap.c b/mm/filemap.c index f74391721234..013451b8017f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2983,7 +2983,7 @@ EXPORT_SYMBOL(generic_write_checks); */ int generic_clone_checks(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - uint64_t *req_count, bool is_dedupe) + uint64_t *req_count, unsigned int flags) { struct inode *inode_in = file_in->f_mapping->host; struct inode *inode_out = file_out->f_mapping->host; @@ -3005,7 +3005,7 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in, size_out = i_size_read(inode_out); /* Dedupe requires both ranges to be within EOF. */ - if (is_dedupe && + if ((flags & CLONERANGE_DEDUPE) && (pos_in >= size_in || pos_in + count > size_in || pos_out >= size_out || pos_out + count > size_out)) return -EINVAL; @@ -3056,8 +3056,12 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in, if (pos_in + count == size_in) { bcount = ALIGN(size_in, bs) - pos_in; } else { - if (!IS_ALIGNED(count, bs)) - return -EINVAL; + if (!IS_ALIGNED(count, bs)) { + if (flags & CLONERANGE_SHORT) + count = ALIGN_DOWN(count, bs); + else + return -EINVAL; + } bcount = count; } @@ -3068,6 +3072,13 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in, pos_out < pos_in + bcount) return -EINVAL; + /* + * We shortened the request but the caller can't deal with that, so + * bounce the request back to userspace. + */ + if (*req_count != count && !(flags & CLONERANGE_SHORT)) + return -EINVAL; + *req_count = count; return 0; }
Darrick J. Wong
2018-Oct-05 00:46 UTC
[Ocfs2-devel] [PATCH 12/15] vfs: implement opportunistic short dedupe
From: Darrick J. Wong <darrick.wong at oracle.com> For a given dedupe request, the bytes_deduped field in the control structure tells userspace if we managed to deduplicate some, but not all of, the requested regions starting from the file offsets supplied. However, due to sloppy coding, the current dedupe code returns FILE_DEDUPE_RANGE_DIFFERS if any part of the range is different. Fix this so that we can actually support partial request completion. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/read_write.c | 44 +++++++++++++++++++++++++++++++++++--------- include/linux/fs.h | 2 +- 2 files changed, 36 insertions(+), 10 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 292d68c2f47c..9be9f261edd2 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1781,13 +1781,11 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in, * Check that the extents are the same. */ if (is_dedupe) { - bool is_same = false; - ret = vfs_dedupe_file_range_compare(inode_in, pos_in, - inode_out, pos_out, *len, &is_same); + inode_out, pos_out, len); if (ret) return ret; - if (!is_same) + if (*len == 0) return -EBADE; } @@ -1872,13 +1870,30 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset) return page; } +static unsigned int vfs_dedupe_memcmp(const char *s1, const char *s2, + unsigned int cmp_len) +{ + const char *orig_s1 = s1; + const char *e1 = s1 + cmp_len; + const char *e2 = s2 + cmp_len; + + while (s1 < e1 && s2 < e2) { + if (*s1 != *s2) + break; + s1++; + s2++; + } + + return s1 - orig_s1; +} + /* * Compare extents of two files to see if they are the same. * Caller must have locked both inodes to prevent write races. */ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, loff_t destoff, - loff_t len, bool *is_same) + loff_t *req_len) { loff_t src_poff; loff_t dest_poff; @@ -1886,8 +1901,11 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, void *dest_addr; struct page *src_page; struct page *dest_page; - loff_t cmp_len; + loff_t len = *req_len; + loff_t same_len = 0; bool same; + unsigned int cmp_len; + unsigned int cmp_same; int error; error = -EINVAL; @@ -1897,7 +1915,7 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, dest_poff = destoff & (PAGE_SIZE - 1); cmp_len = min(PAGE_SIZE - src_poff, PAGE_SIZE - dest_poff); - cmp_len = min(cmp_len, len); + cmp_len = min_t(loff_t, cmp_len, len); if (cmp_len <= 0) goto out_error; @@ -1919,7 +1937,10 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, flush_dcache_page(src_page); flush_dcache_page(dest_page); - if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len)) + cmp_same = vfs_dedupe_memcmp(src_addr + src_poff, + dest_addr + dest_poff, cmp_len); + same_len += cmp_same; + if (cmp_same != cmp_len) same = false; kunmap_atomic(dest_addr); @@ -1937,7 +1958,12 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, len -= cmp_len; } - *is_same = same; + /* + * If less than the whole range matched, we have to back down to the + * nearest block boundary. + */ + if (*req_len != same_len) + *req_len = ALIGN_DOWN(same_len, dest->i_sb->s_blocksize); return 0; out_error: diff --git a/include/linux/fs.h b/include/linux/fs.h index eb35363478e5..490128b84d10 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1840,7 +1840,7 @@ extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len); extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, loff_t destoff, - loff_t len, bool *is_same); + loff_t *len); extern int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same); extern s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
Darrick J. Wong
2018-Oct-05 00:46 UTC
[Ocfs2-devel] [PATCH 13/15] ocfs2: truncate page cache for clone destination file before remapping
From: Darrick J. Wong <darrick.wong at oracle.com> When cloning blocks into another file, truncate the page cache before we start remapping blocks so that concurrent reads wait for us to finish. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/ocfs2/refcounttree.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index 3758954f2377..02283c38b5a4 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -4853,14 +4853,12 @@ int ocfs2_reflink_remap_range(struct file *file_in, down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem, SINGLE_DEPTH_NESTING); - ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out, - out_bh, pos_out, len); - /* Zap any page cache for the destination file's range. */ - if (!ret) - truncate_inode_pages_range(&inode_out->i_data, pos_out, - PAGE_ALIGN(pos_out + len) - 1); + truncate_inode_pages_range(&inode_out->i_data, pos_out, + PAGE_ALIGN(pos_out + len) - 1); + ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out, + out_bh, pos_out, len); up_write(&OCFS2_I(inode_in)->ip_alloc_sem); if (!same_inode) up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
Darrick J. Wong
2018-Oct-05 00:46 UTC
[Ocfs2-devel] [PATCH 14/15] ocfs2: support partial clone range and dedupe range
From: Darrick J. Wong <darrick.wong at oracle.com> Change the ocfs2 remap code to allow for returning partial results. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/ocfs2/file.c | 14 ++++---------- fs/ocfs2/refcounttree.c | 35 ++++++++++++++++++++--------------- fs/ocfs2/refcounttree.h | 4 ++-- 3 files changed, 26 insertions(+), 27 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 1ee6d3ecdac6..4562111259b4 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2534,11 +2534,8 @@ static s64 ocfs2_file_clone_range(struct file *file_in, u64 len, unsigned int flags) { - int ret; - - ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, false); - return ret < 0 ? ret : len; + return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, flags); } static s64 ocfs2_file_dedupe_range(struct file *file_in, @@ -2547,11 +2544,8 @@ static s64 ocfs2_file_dedupe_range(struct file *file_in, loff_t pos_out, u64 len) { - int ret; - - ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, true); - return ret < 0 ? ret : len; + return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, CLONERANGE_DEDUPE | CLONERANGE_SHORT); } const struct inode_operations ocfs2_file_iops = { diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index 02283c38b5a4..60be5c76db6d 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -4499,7 +4499,7 @@ static int ocfs2_reflink_update_dest(struct inode *dest, } /* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */ -static int ocfs2_reflink_remap_extent(struct inode *s_inode, +static s64 ocfs2_reflink_remap_extent(struct inode *s_inode, struct buffer_head *s_bh, loff_t pos_in, struct inode *t_inode, @@ -4514,6 +4514,7 @@ static int ocfs2_reflink_remap_extent(struct inode *s_inode, struct buffer_head *ref_root_bh = NULL; struct ocfs2_refcount_tree *ref_tree; struct ocfs2_super *osb; + s64 remapped = 0; loff_t pstart, plen; u32 p_cluster, num_clusters, slast, spos, tpos; unsigned int ext_flags; @@ -4597,18 +4598,20 @@ static int ocfs2_reflink_remap_extent(struct inode *s_inode, next_loop: spos += num_clusters; tpos += num_clusters; + remapped += ocfs2_clusters_to_bytes(t_inode->i_sb, + num_clusters); } -out: - return ret; + return remapped; out_unlock_refcount: ocfs2_unlock_refcount_tree(osb, ref_tree, 1); brelse(ref_root_bh); - return ret; +out: + return remapped > 0 ? remapped : ret; } /* Set up refcount tree and remap s_inode to t_inode. */ -static int ocfs2_reflink_remap_blocks(struct inode *s_inode, +static s64 ocfs2_reflink_remap_blocks(struct inode *s_inode, struct buffer_head *s_bh, loff_t pos_in, struct inode *t_inode, @@ -4620,7 +4623,7 @@ static int ocfs2_reflink_remap_blocks(struct inode *s_inode, struct ocfs2_super *osb; struct ocfs2_dinode *dis; struct ocfs2_dinode *dit; - int ret; + s64 ret; osb = OCFS2_SB(s_inode->i_sb); dis = (struct ocfs2_dinode *)s_bh->b_data; @@ -4692,7 +4695,7 @@ static int ocfs2_reflink_remap_blocks(struct inode *s_inode, /* Actually remap extents now. */ ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh, pos_out, len, &dealloc); - if (ret) { + if (ret < 0) { mlog_errno(ret); goto out; } @@ -4812,18 +4815,19 @@ static void ocfs2_reflink_inodes_unlock(struct inode *s_inode, } /* Link a range of blocks from one file to another. */ -int ocfs2_reflink_remap_range(struct file *file_in, +s64 ocfs2_reflink_remap_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len, - bool is_dedupe) + unsigned int flags) { struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb); struct buffer_head *in_bh = NULL, *out_bh = NULL; bool same_inode = (inode_in == inode_out); + s64 remapped = 0; ssize_t ret; if (!ocfs2_refcount_tree(osb)) @@ -4843,7 +4847,7 @@ int ocfs2_reflink_remap_range(struct file *file_in, goto out_unlock; ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out, - &len, is_dedupe ? CLONERANGE_DEDUPE : 0); + &len, flags); if (ret <= 0) goto out_unlock; @@ -4857,12 +4861,13 @@ int ocfs2_reflink_remap_range(struct file *file_in, truncate_inode_pages_range(&inode_out->i_data, pos_out, PAGE_ALIGN(pos_out + len) - 1); - ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out, - out_bh, pos_out, len); + remapped = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, + inode_out, out_bh, pos_out, len); up_write(&OCFS2_I(inode_in)->ip_alloc_sem); if (!same_inode) up_write(&OCFS2_I(inode_out)->ip_alloc_sem); - if (ret) { + if (remapped < 0) { + ret = remapped; mlog_errno(ret); goto out_unlock; } @@ -4881,9 +4886,9 @@ int ocfs2_reflink_remap_range(struct file *file_in, } ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh); - return 0; + return remapped; out_unlock: ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh); - return ret; + return remapped > 0 ? remapped : ret; } diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h index 4af55bf4b35b..410a3145a06a 100644 --- a/fs/ocfs2/refcounttree.h +++ b/fs/ocfs2/refcounttree.h @@ -115,11 +115,11 @@ int ocfs2_reflink_ioctl(struct inode *inode, const char __user *oldname, const char __user *newname, bool preserve); -int ocfs2_reflink_remap_range(struct file *file_in, +s64 ocfs2_reflink_remap_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len, - bool is_dedupe); + unsigned int flags); #endif /* OCFS2_REFCOUNTTREE_H */
Darrick J. Wong
2018-Oct-05 00:46 UTC
[Ocfs2-devel] [PATCH 15/15] xfs: support returning partial reflink results
From: Darrick J. Wong <darrick.wong at oracle.com> Back when the XFS reflink code only supported clone_file_range, we were only able to return zero or negative error codes to userspace. However, now that copy_file_range (which returns bytes copied) can use XFS' clone_file_range, we have the opportunity to return partial results. For example, if userspace sends a 1GB clone request and we run out of space halfway through, we at least can tell userspace that we completed 512M of that request like a regular write. Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com> --- fs/xfs/xfs_file.c | 14 ++++---------- fs/xfs/xfs_reflink.c | 46 ++++++++++++++++++++++++++++++---------------- fs/xfs/xfs_reflink.h | 5 +++-- 3 files changed, 37 insertions(+), 28 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index d5d6681ca714..4f7aff6d3360 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -928,11 +928,8 @@ xfs_file_clone_range( u64 len, unsigned int flags) { - int ret; - - ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, false); - return ret < 0 ? ret : len; + return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, flags); } STATIC s64 @@ -943,11 +940,8 @@ xfs_file_dedupe_range( loff_t pos_out, u64 len) { - int ret; - - ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, true); - return ret < 0 ? ret : len; + return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, CLONERANGE_DEDUPE | CLONERANGE_SHORT); } STATIC int diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 40684dd011ee..784cebbd0157 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1090,7 +1090,7 @@ xfs_reflink_remap_extent( /* * Iteratively remap one file's extents (and holes) to another's. */ -STATIC int +STATIC int64_t xfs_reflink_remap_blocks( struct xfs_inode *src, xfs_fileoff_t srcoff, @@ -1100,6 +1100,7 @@ xfs_reflink_remap_blocks( xfs_off_t new_isize) { struct xfs_bmbt_irec imap; + int64_t remapped = 0; int nimaps; int error = 0; xfs_filblks_t range_len; @@ -1142,13 +1143,14 @@ xfs_reflink_remap_blocks( srcoff += range_len; destoff += range_len; len -= range_len; + remapped += range_len; } - return 0; + return remapped; err: trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_); - return error; + return remapped > 0 ? remapped : error; } /* @@ -1247,14 +1249,15 @@ xfs_reflink_remap_prep( loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len, - bool is_dedupe) + u64 *len, + unsigned int flags) { struct inode *inode_in = file_inode(file_in); struct xfs_inode *src = XFS_I(inode_in); struct inode *inode_out = file_inode(file_out); struct xfs_inode *dest = XFS_I(inode_out); bool same_inode = (inode_in == inode_out); + bool is_dedupe = (flags & CLONERANGE_DEDUPE); ssize_t ret; /* Lock both files against IO */ @@ -1278,7 +1281,7 @@ xfs_reflink_remap_prep( goto out_unlock; ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out, - &len, is_dedupe ? CLONERANGE_DEDUPE : 0); + len, flags); if (ret <= 0) goto out_unlock; @@ -1302,7 +1305,7 @@ xfs_reflink_remap_prep( /* Zap any page cache for the destination file's range. */ truncate_inode_pages_range(&inode_out->i_data, pos_out, - PAGE_ALIGN(pos_out + len) - 1); + PAGE_ALIGN(pos_out + *len) - 1); /* If we're altering the file contents... */ if (!is_dedupe) { @@ -1336,14 +1339,14 @@ xfs_reflink_remap_prep( /* * Link a range of blocks from one file to another. */ -int +s64 xfs_reflink_remap_range( struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, u64 len, - bool is_dedupe) + unsigned int flags) { struct inode *inode_in = file_inode(file_in); struct xfs_inode *src = XFS_I(inode_in); @@ -1352,8 +1355,10 @@ xfs_reflink_remap_range( struct xfs_mount *mp = src->i_mount; xfs_fileoff_t sfsbno, dfsbno; xfs_filblks_t fsblen; + s64 remapped; xfs_extlen_t cowextsize; - ssize_t ret; + int ret; + bool is_dedupe = (flags & CLONERANGE_DEDUPE); if (!xfs_sb_version_hasreflink(&mp->m_sb)) return -EOPNOTSUPP; @@ -1363,19 +1368,25 @@ xfs_reflink_remap_range( /* Prepare and then clone file data. */ ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out, - len, is_dedupe); + &len, flags); if (ret) return ret; trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out); + if (len == 0) + goto done; + dfsbno = XFS_B_TO_FSBT(mp, pos_out); sfsbno = XFS_B_TO_FSBT(mp, pos_in); fsblen = XFS_B_TO_FSB(mp, len); - ret = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen, + remapped = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen, pos_out + len); - if (ret) + if (remapped < 0) { + ret = remapped; goto out_unlock; + } + remapped = min_t(int64_t, len, XFS_FSB_TO_B(mp, remapped)); /* * Carry the cowextsize hint from src to dest if we're sharing the @@ -1391,11 +1402,14 @@ xfs_reflink_remap_range( ret = xfs_reflink_update_dest(dest, pos_out + len, cowextsize, is_dedupe); - + if (ret) + goto out_unlock; +done: + xfs_reflink_remap_unlock(file_in, file_out); + return remapped; out_unlock: xfs_reflink_remap_unlock(file_in, file_out); - if (ret) - trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_); + trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_); return ret; } diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h index c585ad9552b2..b53470904373 100644 --- a/fs/xfs/xfs_reflink.h +++ b/fs/xfs/xfs_reflink.h @@ -27,8 +27,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset, extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset, xfs_off_t count); extern int xfs_reflink_recover_cow(struct xfs_mount *mp); -extern int xfs_reflink_remap_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len, bool is_dedupe); +extern s64 xfs_reflink_remap_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, u64 len, + unsigned int flags); extern int xfs_reflink_inode_has_shared_extents(struct xfs_trans *tp, struct xfs_inode *ip, bool *has_shared); extern int xfs_reflink_clear_inode_flag(struct xfs_inode *ip,
Dave Chinner
2018-Oct-05 01:17 UTC
[Ocfs2-devel] [PATCH 00/15] fs: fixes for serious clone/dedupe problems
On Thu, Oct 04, 2018 at 05:44:34PM -0700, Darrick J. Wong wrote:> Hi all, > > Dave, Eric, and I have been chasing a stale data exposure bug in the XFS > reflink implementation, and tracked it down to reflink forgetting to do > some of the file-extending activities that must happen for regular > writes. > > We then started auditing the clone, dedupe, and copyfile code and > realized that from a file contents perspective, clonerange isn't any > different from a regular file write. Unfortunately, we also noticed > that *unlike* a regular write, clonerange skips a ton of overflow > checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS, > and RLIMIT_FSIZE. We also observed that cloning into a file did not > strip security privileges (suid, capabilities) like a regular write > would. I also noticed that xfs and ocfs2 need to dump the page cache > before remapping blocks, not after. > > In fixing the range checking problems I also realized that both dedupe > and copyfile tell userspace how much of the requested operation was > acted upon. Since the range validation can shorten a clone request (or > we can ENOSPC midway through), we might as well plumb the short > operation reporting back through the VFS indirection code to userspace. > > So, here's the whole giant pile of patches[1] that fix all the problems. > The patch "generic: test reflink side effects" recently sent to fstests > exercises the fixes in this series. Tests are in [2].Hmmm. I've got a couple of patches to fix dedupe/reflink partial EOF block data corruptions, too. I'll have to see how they fit into this new series - combined they add this code just after the call to vfs_clone_file_prep_inodes(): .... + u64 blkmask = i_blocksize(inode_in) - 1; .... + /* + * If the dedupe data matches, chop off the partial EOF block + * from the source file so we don't try to dedupe the partial + * EOF block. + */ + if (is_dedupe) { + len &= ~blkmask; + } else if (len & blkmask) { + /* + * The user is attempting to share a partial EOF block, + * if it's inside the destination EOF then reject it + */ + if (pos_out + len < i_size_read(inode_out)) { + ret = -EINVAL; + goto out_unlock; + } + } It might be better to put these in with the eof-zeroing patch then add all the other changes on top? Let me post them separately, as they may be candidates for 4.19-rc7 along with the eof zeroing. Cheers, Dave. -- Dave Chinner david at fromorbit.com