Mauricio Faria de Oliveira
2020-Oct-06 00:48 UTC
[Ocfs2-devel] [PATCH v5 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit
Hey Jan, Andreas, Ted et al. This series fixes the issue that buffers writeably mapped to userspace can be modified during transaction commit, in between checksumming and write-out to the journal, thus cause inconsistency in the journal that prevents recovery/mount after kernel crash or power loss. It's really ideas and patience from Jan guiding me how to write/fix it. Huge thanks! Although the synthetic test case in [v2] demonstrates the bug and that the fix works, there's stress-ng tests that still causes inconsistency. Unfortunately I couldn't look more into it yet; per discussion in [v4] cover letter I'll send it as-is now, since it already fixes something, and will continue to analyze. It might be something else/another fix. There's also some fstests that _apparently_ become more flaky w/ the patchset applied, but don't seem to be mmap() related.. so I'll look at them more carefully too. The set of consistent failures (ie, that happen most of the time) is the same between original/patched builds, on both data=ordered and data=journal (different for each mode, ofc.) data=ordered: Failures: ext4/045 generic/044 generic/045 generic/046 generic/051 generic/223 generic/388 generic/465 generic/475 generic/553 generic/554 generic/555 generic/565 generic/611 data=journal: Failures: ext4/045 generic/051 generic/223 generic/347 generic/388 generic/441 generic/475 generic/553 generic/554 generic/555 generic/565 generic/611 There's a small change to OCFS2 in patch 2, which has been tested w/ stress-ng's filesystem and io stressor classes; no regressions found. # mkfs.ocfs2 --mount local $DEV # mount $DEV $MNT # cd $MNT # stress-ng --sequential 0 --class filesystem,io The only changes from v4 are style-change suggestions from Andreas applied to patches 02/04 and where we set OCFS2 journal callbacks; plus Reviewed-By: tags. Tested on v5.9-rc7'ish and next-20200930; build tested on -rc8'ish and next-20201002 today. cheers, Mauricio [v4] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/20200928194103.244692-1-mfo at canonical.com/__;!!GqivPVa7Brio!KLi18zziDAgi-C8ol7nWzoPGTJ3JSrOijN1MmSjdZ3Zw9wnZNnCNYgoJq7LjaeoH9QUyCg$ [v3] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/20200910193127.276214-1-mfo at canonical.com/__;!!GqivPVa7Brio!KLi18zziDAgi-C8ol7nWzoPGTJ3JSrOijN1MmSjdZ3Zw9wnZNnCNYgoJq7LjaeodPYbH2w$ [v2] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/20200810010210.3305322-1-mfo at canonical.com/__;!!GqivPVa7Brio!KLi18zziDAgi-C8ol7nWzoPGTJ3JSrOijN1MmSjdZ3Zw9wnZNnCNYgoJq7LjaeocoAghAA$ [v1] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/20200423233705.5878-1-mfo at canonical.com/__;!!GqivPVa7Brio!KLi18zziDAgi-C8ol7nWzoPGTJ3JSrOijN1MmSjdZ3Zw9wnZNnCNYgoJq7LjaepKEikFzg$ Mauricio Faria de Oliveira (4): jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() ext4: data=journal: fixes for ext4_page_mkwrite() ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() fs/ext4/inode.c | 62 ++++++++++++++++++++++++++----- fs/ext4/super.c | 87 ++++++++++++++++++++++++++++++++++++++++++++ fs/jbd2/commit.c | 62 ++++++++++++++++--------------- fs/jbd2/journal.c | 2 + fs/ocfs2/journal.c | 4 ++ include/linux/jbd2.h | 29 ++++++++++++++- 6 files changed, 206 insertions(+), 40 deletions(-) -- 2.17.1
Mauricio Faria de Oliveira
2020-Oct-06 00:48 UTC
[Ocfs2-devel] [PATCH v5 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
Export functions that implement the current behavior done for an inode in journal_submit|finish_inode_data_buffers(). No functional change. Signed-off-by: Mauricio Faria de Oliveira <mfo at canonical.com> Suggested-by: Jan Kara <jack at suse.cz> Reviewed-by: Jan Kara <jack at suse.cz> Reviewed-by: Andreas Dilger <adilger at dilger.ca> --- fs/jbd2/commit.c | 36 ++++++++++++++++-------------------- fs/jbd2/journal.c | 2 ++ include/linux/jbd2.h | 4 ++++ 3 files changed, 22 insertions(+), 20 deletions(-) diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 6d2da8ad0e6f..f79b86b4241f 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -187,19 +187,17 @@ static int journal_wait_on_commit_record(journal_t *journal, * use writepages() because with delayed allocation we may be doing * block allocation in writepages(). */ -static int journal_submit_inode_data_buffers(struct address_space *mapping, - loff_t dirty_start, loff_t dirty_end) +int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode) { - int ret; + struct address_space *mapping = jinode->i_vfs_inode->i_mapping; struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = mapping->nrpages * 2, - .range_start = dirty_start, - .range_end = dirty_end, + .range_start = jinode->i_dirty_start, + .range_end = jinode->i_dirty_end, }; - ret = generic_writepages(mapping, &wbc); - return ret; + return generic_writepages(mapping, &wbc); } /* @@ -215,16 +213,11 @@ static int journal_submit_data_buffers(journal_t *journal, { struct jbd2_inode *jinode; int err, ret = 0; - struct address_space *mapping; spin_lock(&journal->j_list_lock); list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { - loff_t dirty_start = jinode->i_dirty_start; - loff_t dirty_end = jinode->i_dirty_end; - if (!(jinode->i_flags & JI_WRITE_DATA)) continue; - mapping = jinode->i_vfs_inode->i_mapping; jinode->i_flags |= JI_COMMIT_RUNNING; spin_unlock(&journal->j_list_lock); /* @@ -234,8 +227,7 @@ static int journal_submit_data_buffers(journal_t *journal, * only allocated blocks here. */ trace_jbd2_submit_inode_data(jinode->i_vfs_inode); - err = journal_submit_inode_data_buffers(mapping, dirty_start, - dirty_end); + err = jbd2_journal_submit_inode_data_buffers(jinode); if (!ret) ret = err; spin_lock(&journal->j_list_lock); @@ -248,6 +240,15 @@ static int journal_submit_data_buffers(journal_t *journal, return ret; } +int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode) +{ + struct address_space *mapping = jinode->i_vfs_inode->i_mapping; + + return filemap_fdatawait_range_keep_errors(mapping, + jinode->i_dirty_start, + jinode->i_dirty_end); +} + /* * Wait for data submitted for writeout, refile inodes to proper * transaction if needed. @@ -262,16 +263,11 @@ static int journal_finish_inode_data_buffers(journal_t *journal, /* For locking, see the comment in journal_submit_data_buffers() */ spin_lock(&journal->j_list_lock); list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) { - loff_t dirty_start = jinode->i_dirty_start; - loff_t dirty_end = jinode->i_dirty_end; - if (!(jinode->i_flags & JI_WAIT_DATA)) continue; jinode->i_flags |= JI_COMMIT_RUNNING; spin_unlock(&journal->j_list_lock); - err = filemap_fdatawait_range_keep_errors( - jinode->i_vfs_inode->i_mapping, dirty_start, - dirty_end); + err = jbd2_journal_finish_inode_data_buffers(jinode); if (!ret) ret = err; spin_lock(&journal->j_list_lock); diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 17fdc482f554..c0600405e7a2 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -91,6 +91,8 @@ EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers); EXPORT_SYMBOL(jbd2_journal_force_commit); EXPORT_SYMBOL(jbd2_journal_inode_ranged_write); EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait); +EXPORT_SYMBOL(jbd2_journal_submit_inode_data_buffers); +EXPORT_SYMBOL(jbd2_journal_finish_inode_data_buffers); EXPORT_SYMBOL(jbd2_journal_init_jbd_inode); EXPORT_SYMBOL(jbd2_journal_release_jbd_inode); EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 08f904943ab2..2865a5475888 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1421,6 +1421,10 @@ extern int jbd2_journal_inode_ranged_write(handle_t *handle, extern int jbd2_journal_inode_ranged_wait(handle_t *handle, struct jbd2_inode *inode, loff_t start_byte, loff_t length); +extern int jbd2_journal_submit_inode_data_buffers( + struct jbd2_inode *jinode); +extern int jbd2_journal_finish_inode_data_buffers( + struct jbd2_inode *jinode); extern int jbd2_journal_begin_ordered_truncate(journal_t *journal, struct jbd2_inode *inode, loff_t new_size); extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode); -- 2.17.1
Mauricio Faria de Oliveira
2020-Oct-06 00:48 UTC
[Ocfs2-devel] [PATCH v5 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
Introduce journal callbacks to allow different behaviors for an inode in journal_submit|finish_inode_data_buffers(). The existing users of the current behavior (ext4, ocfs2) are adapted to use the previously exported functions that implement the current behavior. Users are callers of jbd2_journal_inode_ranged_write|wait(), which adds the inode to the transaction's inode list with the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree. Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2, which builds fs/jbd2/commit.c and journal.c that define and export the functions, so we can call directly in ext4/ocfs2. Signed-off-by: Mauricio Faria de Oliveira <mfo at canonical.com> Suggested-by: Jan Kara <jack at suse.cz> Reviewed-by: Jan Kara <jack at suse.cz> Reviewed-by: Andreas Dilger <adilger at dilger.ca> --- fs/ext4/super.c | 4 ++++ fs/jbd2/commit.c | 30 ++++++++++++++++++------------ fs/ocfs2/journal.c | 4 ++++ include/linux/jbd2.h | 25 ++++++++++++++++++++++++- 4 files changed, 50 insertions(+), 13 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index ea425b49b345..a14c1ed39aa3 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4646,6 +4646,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) set_task_ioprio(sbi->s_journal->j_task, journal_ioprio); sbi->s_journal->j_commit_callback = ext4_journal_commit_callback; + sbi->s_journal->j_submit_inode_data_buffers + jbd2_journal_submit_inode_data_buffers; + sbi->s_journal->j_finish_inode_data_buffers + jbd2_journal_finish_inode_data_buffers; no_journal: if (!test_opt(sb, NO_MBCACHE)) { diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index f79b86b4241f..6252b4c50666 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -197,6 +197,12 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode) .range_end = jinode->i_dirty_end, }; + /* + * submit the inode data buffers. We use writepage + * instead of writepages. Because writepages can do + * block allocation with delalloc. We need to write + * only allocated blocks here. + */ return generic_writepages(mapping, &wbc); } @@ -220,16 +226,13 @@ static int journal_submit_data_buffers(journal_t *journal, continue; jinode->i_flags |= JI_COMMIT_RUNNING; spin_unlock(&journal->j_list_lock); - /* - * submit the inode data buffers. We use writepage - * instead of writepages. Because writepages can do - * block allocation with delalloc. We need to write - * only allocated blocks here. - */ + /* submit the inode data buffers. */ trace_jbd2_submit_inode_data(jinode->i_vfs_inode); - err = jbd2_journal_submit_inode_data_buffers(jinode); - if (!ret) - ret = err; + if (journal->j_submit_inode_data_buffers) { + err = journal->j_submit_inode_data_buffers(jinode); + if (!ret) + ret = err; + } spin_lock(&journal->j_list_lock); J_ASSERT(jinode->i_transaction == commit_transaction); jinode->i_flags &= ~JI_COMMIT_RUNNING; @@ -267,9 +270,12 @@ static int journal_finish_inode_data_buffers(journal_t *journal, continue; jinode->i_flags |= JI_COMMIT_RUNNING; spin_unlock(&journal->j_list_lock); - err = jbd2_journal_finish_inode_data_buffers(jinode); - if (!ret) - ret = err; + /* wait for the inode data buffers writeout. */ + if (journal->j_finish_inode_data_buffers) { + err = journal->j_finish_inode_data_buffers(jinode); + if (!ret) + ret = err; + } spin_lock(&journal->j_list_lock); jinode->i_flags &= ~JI_COMMIT_RUNNING; smp_mb(); diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index b425f0b01dce..b9a9d69dde7e 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -883,6 +883,10 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty) OCFS2_JOURNAL_DIRTY_FL); journal->j_journal = j_journal; + journal->j_journal->j_submit_inode_data_buffers + jbd2_journal_submit_inode_data_buffers; + journal->j_journal->j_finish_inode_data_buffers + jbd2_journal_finish_inode_data_buffers; journal->j_inode = inode; journal->j_bh = bh; diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 2865a5475888..4aaa408c0ca7 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -629,7 +629,9 @@ struct transaction_s struct journal_head *t_shadow_list; /* - * List of inodes whose data we've modified in data=ordered mode. + * List of inodes associated with the transaction; e.g., ext4 uses + * this to track inodes in data=ordered and data=journal mode that + * need special handling on transaction commit; also used by ocfs2. * [j_list_lock] */ struct list_head t_inode_list; @@ -1111,6 +1113,27 @@ struct journal_s void (*j_commit_callback)(journal_t *, transaction_t *); + /** + * @j_submit_inode_data_buffers: + * + * This function is called for all inodes associated with the + * committing transaction marked with JI_WRITE_DATA flag + * before we start to write out the transaction to the journal. + */ + int (*j_submit_inode_data_buffers) + (struct jbd2_inode *); + + /** + * @j_finish_inode_data_buffers: + * + * This function is called for all inodes associated with the + * committing transaction marked with JI_WAIT_DATA flag + * after we have written the transaction to the journal + * but before we write out the commit block. + */ + int (*j_finish_inode_data_buffers) + (struct jbd2_inode *); + /* * Journal statistics */ -- 2.17.1
Mauricio Faria de Oliveira
2020-Oct-06 00:48 UTC
[Ocfs2-devel] [PATCH v5 3/4] ext4: data=journal: fixes for ext4_page_mkwrite()
These are two fixes for data journalling required by the next patch, discovered while testing it. First, the optimization to return early if all buffers are mapped is not appropriate for the next patch: The inode _must_ be added to the transaction's list in data=journal mode (so to write-protect pages on commit) thus we cannot return early there. Second, once that optimization to reduce transactions was disabled for data=journal mode, more transactions happened, and occasionally hit this warning message: 'JBD2: Spotted dirty metadata buffer'. Reason is, block_page_mkwrite() will set_buffer_dirty() before do_journal_get_write_access() that is there to prevent it. This issue was masked by the optimization. So, on data=journal use __block_write_begin() instead. This also requires page locking and len recalculation. (see block_page_mkwrite() for implementation details.) Finally, as Jan noted there is little sharing between data=journal and other modes in ext4_page_mkwrite(). However, a prototype of ext4_journalled_page_mkwrite() showed there still would be lots of duplicated lines (tens of) that didn't seem worth it. Thus this patch ends up with an ugly goto to skip all non-data journalling code (to avoid long indentations, but that can be changed..) in the beginning, and just a conditional in the transaction section. Well, we skip a common part to data journalling which is the page truncated check, but we do it again after ext4_journal_start() when we re-acquire the page lock (so not to acquire the page lock twice needlessly for data journalling.) Signed-off-by: Mauricio Faria de Oliveira <mfo at canonical.com> Suggested-by: Jan Kara <jack at suse.cz> Reviewed-by: Jan Kara <jack at suse.cz> Reviewed-by: Andreas Dilger <adilger at dilger.ca> --- fs/ext4/inode.c | 51 ++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 44 insertions(+), 7 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bf596467c234..ac153e340a6f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5977,9 +5977,17 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf) if (err) goto out_ret; + /* + * On data journalling we skip straight to the transaction handle: + * there's no delalloc; page truncated will be checked later; the + * early return w/ all buffers mapped (calculates size/len) can't + * be used; and there's no dioread_nolock, so only ext4_get_block. + */ + if (ext4_should_journal_data(inode)) + goto retry_alloc; + /* Delalloc case is easy... */ if (test_opt(inode->i_sb, DELALLOC) && - !ext4_should_journal_data(inode) && !ext4_nonda_switch(inode->i_sb)) { do { err = block_page_mkwrite(vma, vmf, @@ -6005,6 +6013,9 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf) /* * Return if we have all the buffers mapped. This avoids the need to do * journal_start/journal_stop which can block and take a long time + * + * This cannot be done for data journalling, as we have to add the + * inode to the transaction's list to writeprotect pages on commit. */ if (page_has_buffers(page)) { if (!ext4_walk_page_buffers(NULL, page_buffers(page), @@ -6029,16 +6040,42 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf) ret = VM_FAULT_SIGBUS; goto out; } - err = block_page_mkwrite(vma, vmf, get_block); - if (!err && ext4_should_journal_data(inode)) { - if (ext4_walk_page_buffers(handle, page_buffers(page), 0, - PAGE_SIZE, NULL, do_journal_get_write_access)) { + /* + * Data journalling can't use block_page_mkwrite() because it + * will set_buffer_dirty() before do_journal_get_write_access() + * thus might hit warning messages for dirty metadata buffers. + */ + if (!ext4_should_journal_data(inode)) { + err = block_page_mkwrite(vma, vmf, get_block); + } else { + lock_page(page); + size = i_size_read(inode); + /* Page got truncated from under us? */ + if (page->mapping != mapping || page_offset(page) > size) { unlock_page(page); - ret = VM_FAULT_SIGBUS; + ret = VM_FAULT_NOPAGE; ext4_journal_stop(handle); goto out; } - ext4_set_inode_state(inode, EXT4_STATE_JDATA); + + if (page->index == size >> PAGE_SHIFT) + len = size & ~PAGE_MASK; + else + len = PAGE_SIZE; + + err = __block_write_begin(page, 0, len, ext4_get_block); + if (!err) { + if (ext4_walk_page_buffers(handle, page_buffers(page), + 0, len, NULL, do_journal_get_write_access)) { + unlock_page(page); + ret = VM_FAULT_SIGBUS; + ext4_journal_stop(handle); + goto out; + } + ext4_set_inode_state(inode, EXT4_STATE_JDATA); + } else { + unlock_page(page); + } } ext4_journal_stop(handle); if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) -- 2.17.1
Mauricio Faria de Oliveira
2020-Oct-06 00:48 UTC
[Ocfs2-devel] [PATCH v5 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
This implements journal callbacks j_submit|finish_inode_data_buffers() with different behavior for data=journal: to write-protect pages under commit, preventing changes to buffers writeably mapped to userspace. If a buffer's content changes between commit's checksum calculation and write-out to disk, it can cause journal recovery/mount failures upon a kernel crash or power loss. [ 27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support! [ 27.339492] JBD2: Invalid checksum recovering data block 8705 in log [ 27.342716] JBD2: recovery failed [ 27.343316] EXT4-fs (loop0): error loading journal mount: /ext4: can't read superblock on /dev/loop0. In j_submit_inode_data_buffers() we write-protect the inode's pages with write_cache_pages() and redirty w/ writepage callback if needed. In j_finish_inode_data_buffers() there is nothing do to. And in order to use the callbacks, inodes are added to the inode list in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite(). In ext4_page_mkwrite() we must make sure that the buffers are attached to the transaction as jbddirty with write_end_fn(), as already done in __ext4_journalled_writepage(). Signed-off-by: Mauricio Faria de Oliveira <mfo at canonical.com> Reported-by: Dann Frazier <dann.frazier at canonical.com> Reported-by: kernel test robot <lkp at intel.com> # wbc.nr_to_write Suggested-by: Jan Kara <jack at suse.cz> Reviewed-by: Jan Kara <jack at suse.cz> --- fs/ext4/inode.c | 25 +++++++++----- fs/ext4/super.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 101 insertions(+), 11 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index ac153e340a6f..af5de62c1214 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1910,6 +1910,9 @@ static int __ext4_journalled_writepage(struct page *page, err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL, write_end_fn); } + if (ret == 0) + ret = err; + err = ext4_jbd2_inode_add_write(handle, inode, 0, len); if (ret == 0) ret = err; EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid; @@ -6052,10 +6055,8 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf) size = i_size_read(inode); /* Page got truncated from under us? */ if (page->mapping != mapping || page_offset(page) > size) { - unlock_page(page); ret = VM_FAULT_NOPAGE; - ext4_journal_stop(handle); - goto out; + goto out_error; } if (page->index == size >> PAGE_SHIFT) @@ -6065,13 +6066,15 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf) err = __block_write_begin(page, 0, len, ext4_get_block); if (!err) { + ret = VM_FAULT_SIGBUS; if (ext4_walk_page_buffers(handle, page_buffers(page), - 0, len, NULL, do_journal_get_write_access)) { - unlock_page(page); - ret = VM_FAULT_SIGBUS; - ext4_journal_stop(handle); - goto out; - } + 0, len, NULL, do_journal_get_write_access)) + goto out_error; + if (ext4_walk_page_buffers(handle, page_buffers(page), + 0, len, NULL, write_end_fn)) + goto out_error; + if (ext4_jbd2_inode_add_write(handle, inode, 0, len)) + goto out_error; ext4_set_inode_state(inode, EXT4_STATE_JDATA); } else { unlock_page(page); @@ -6086,6 +6089,10 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf) up_read(&EXT4_I(inode)->i_mmap_sem); sb_end_pagefault(inode->i_sb); return ret; +out_error: + unlock_page(page); + ext4_journal_stop(handle); + goto out; } vm_fault_t ext4_filemap_fault(struct vm_fault *vmf) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a14c1ed39aa3..a2fc62a6d3b7 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -472,6 +472,89 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn) spin_unlock(&sbi->s_md_lock); } +/* + * This writepage callback for write_cache_pages() + * takes care of a few cases after page cleaning. + * + * write_cache_pages() already checks for dirty pages + * and calls clear_page_dirty_for_io(), which we want, + * to write protect the pages. + * + * However, we may have to redirty a page (see below.) + */ +static int ext4_journalled_writepage_callback(struct page *page, + struct writeback_control *wbc, + void *data) +{ + transaction_t *transaction = (transaction_t *) data; + struct buffer_head *bh, *head; + struct journal_head *jh; + + bh = head = page_buffers(page); + do { + /* + * We have to redirty a page in these cases: + * 1) If buffer is dirty, it means the page was dirty because it + * contains a buffer that needs checkpointing. So the dirty bit + * needs to be preserved so that checkpointing writes the buffer + * properly. + * 2) If buffer is not part of the committing transaction + * (we may have just accidentally come across this buffer because + * inode range tracking is not exact) or if the currently running + * transaction already contains this buffer as well, dirty bit + * needs to be preserved so that the buffer gets writeprotected + * properly on running transaction's commit. + */ + jh = bh2jh(bh); + if (buffer_dirty(bh) || + (jh && (jh->b_transaction != transaction || + jh->b_next_transaction))) { + redirty_page_for_writepage(wbc, page); + goto out; + } + } while ((bh = bh->b_this_page) != head); + +out: + return AOP_WRITEPAGE_ACTIVATE; +} + +static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode) +{ + struct address_space *mapping = jinode->i_vfs_inode->i_mapping; + struct writeback_control wbc = { + .sync_mode = WB_SYNC_ALL, + .nr_to_write = LONG_MAX, + .range_start = jinode->i_dirty_start, + .range_end = jinode->i_dirty_end, + }; + + return write_cache_pages(mapping, &wbc, + ext4_journalled_writepage_callback, + jinode->i_transaction); +} + +static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode) +{ + int ret; + + if (ext4_should_journal_data(jinode->i_vfs_inode)) + ret = ext4_journalled_submit_inode_data_buffers(jinode); + else + ret = jbd2_journal_submit_inode_data_buffers(jinode); + + return ret; +} + +static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode) +{ + int ret = 0; + + if (!ext4_should_journal_data(jinode->i_vfs_inode)) + ret = jbd2_journal_finish_inode_data_buffers(jinode); + + return ret; +} + static bool system_going_down(void) { return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF @@ -4647,9 +4730,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) sbi->s_journal->j_commit_callback = ext4_journal_commit_callback; sbi->s_journal->j_submit_inode_data_buffers - jbd2_journal_submit_inode_data_buffers; + ext4_journal_submit_inode_data_buffers; sbi->s_journal->j_finish_inode_data_buffers - jbd2_journal_finish_inode_data_buffers; + ext4_journal_finish_inode_data_buffers; no_journal: if (!test_opt(sb, NO_MBCACHE)) { -- 2.17.1