Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data
Hello, this is a second version of the patches to fix data corruption in mmapped data when blocksize < pagesize as tested by xfstests generic/030 test. The patchset fixes XFS and ext4. I've checked and btrfs doesn't need fixing because it doesn't support blocksize < pagesize. If that's ever going to change btrfs will likely need a similar treatment. ocfs2, ext2, ext3 are OK since they happily allocate blocks during writeback. For other filesystems like gfs2, ubifs, nilfs, ceph,... I'm not sure whether they support blocksize < pagesize at all. Interesting is also NFS which may care but I don't understand its ->page_mkwrite() handler good enough to judge. Changes since v1: - changed helper function name and moved it to mm/truncate.c - I originally thought we can make the helper function update i_size to simplify the interface but it's actually impossible due to generic_write_end() lock ordering constraints. - used round_up() instead of ALIGN() - taught truncate_setsize() to use the helper function Honza
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 1/2 RESEND] bdi: Fix hung task on sync
From: Derek Basehore <dbasehore at chromium.org> bdi_wakeup_thread_delayed() used the mod_delayed_work() function to schedule work to writeback dirty inodes. The problem with this is that it can delay work that is scheduled for immediate execution, such as the work from sync_inodes_sb(). This can happen since mod_delayed_work() can now steal work from a work_queue. This fixes the problem by using queue_delayed_work() instead. This is a regression caused by 839a8e8660b6 "writeback: replace custom worker pool implementation with unbound workqueue". The reason that this causes a problem is that laptop-mode will change the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default. In the case that bdi_wakeup_thread_delayed() races with sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung task. Even if dirty_writeback_centisecs is not long enough to cause a hung task, we still don't want to delay sync for that long. We fix the problem by using queue_delayed_work() when we want to schedule writeback sometime in future. This function doesn't change the timer if it is already armed. For the same reason, we also change bdi_writeback_workfn() to immediately queue the work again in the case that the work_list is not empty. The same problem can happen if the sync work is run on the rescue worker. Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977 CC: stable at vger.kernel.org Reviewed-by: Tejun Heo <tj at kernel.org> Signed-off-by: Derek Basehore <dbasehore at chromium.org> Signed-off-by: Jan Kara <jack at suse.cz> --- fs/fs-writeback.c | 8 ++++---- mm/backing-dev.c | 5 ++++- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index e0259a163f98..8277b76be983 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1047,10 +1047,10 @@ void bdi_writeback_workfn(struct work_struct *work) trace_writeback_pages_written(pages_written); } - if (!list_empty(&bdi->work_list) || - (wb_has_dirty_io(wb) && dirty_writeback_interval)) - queue_delayed_work(bdi_wq, &wb->dwork, - msecs_to_jiffies(dirty_writeback_interval * 10)); + if (!list_empty(&bdi->work_list)) + mod_delayed_work(bdi_wq, &wb->dwork, 0); + else if (wb_has_dirty_io(wb) && dirty_writeback_interval) + bdi_wakeup_thread_delayed(bdi); current->flags &= ~PF_SWAPWRITE; } diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7a4f29..fab8401fc54e 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -288,13 +288,16 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi) * Note, we wouldn't bother setting up the timer, but this function is on the * fast-path (used by '__mark_inode_dirty()'), so we save few context switches * by delaying the wake-up. + * + * We have to be careful not to postpone flush work if it is scheduled for + * earlier. Thus we use queue_delayed_work(). */ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi) { unsigned long timeout; timeout = msecs_to_jiffies(dirty_writeback_interval * 10); - mod_delayed_work(bdi_wq, &bdi->wb.dwork, timeout); + queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout); } /* -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
From: Dave Jones <davej at redhat.com> Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but neglected to free that allocation on the error paths that follow. Signed-off-by: Dave Jones <davej at fedoraproject.org> Acked-by: Mike Snitzer <snitzer at redhat.com> Signed-off-by: Jens Axboe <axboe at fb.com> Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org> --- block/blk-core.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 4cd5ffc18442..bfe16d5af9f9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -713,7 +713,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, return NULL; if (blk_init_rl(&q->root_rl, q, GFP_KERNEL)) - return NULL; + goto fail; q->request_fn = rfn; q->prep_rq_fn = NULL; @@ -737,12 +737,16 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, /* init elevator */ if (elevator_init(q, NULL)) { mutex_unlock(&q->sysfs_lock); - return NULL; + goto fail; } mutex_unlock(&q->sysfs_lock); return q; + +fail: + kfree(q->flush_rq); + return NULL; } EXPORT_SYMBOL(blk_init_allocated_queue); -- 1.8.1.4
From: Shaohua Li <shaohua.li at intel.com> This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1 work as expected). The purpose is to avoid an unnecessary IPI. Let's take an example. My test box has cpu 0-7, one socket. Say request is added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU 0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and CPU 7 have no difference from cache sharing point view and we can avoid an ipi if doing it in CPU 7. An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is running in interrupt handler, and currently I/O controller doesn't support multiple interrupts (I checked several LSI cards and AHCI), so only one CPU can run blk_complete_request(). This is still quite different as QUEUE_FLAG_SAME_FORCE. Since only one CPU runs softirq, the only difference with below patch is softirq not always runs at the first CPU of a group. Signed-off-by: Shaohua Li <shaohua.li at intel.com> Signed-off-by: Jens Axboe <jaxboe at fusionio.com> --- block/blk-softirq.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/block/blk-softirq.c b/block/blk-softirq.c index 487addc85bb5..58340d0cb23a 100644 --- a/block/blk-softirq.c +++ b/block/blk-softirq.c @@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = { void __blk_complete_request(struct request *req) { - int ccpu, cpu; + int ccpu, cpu, group_cpu = NR_CPUS; struct request_queue *q = req->q; unsigned long flags; @@ -117,12 +117,22 @@ void __blk_complete_request(struct request *req) */ if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) { ccpu = req->cpu; - if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) + if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) { ccpu = blk_cpu_to_group(ccpu); + group_cpu = blk_cpu_to_group(cpu); + } } else ccpu = cpu; - if (ccpu == cpu) { + /* + * If current CPU and requested CPU are in the same group, running + * softirq in current CPU. One might concern this is just like + * QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is + * running in interrupt handler, and currently I/O controller doesn't + * support multiple interrupts, so current CPU is unique actually. This + * avoids IPI sending from current CPU to the first CPU of a group. + */ + if (ccpu == cpu || ccpu == group_cpu) { struct list_head *list; do_local: list = &__get_cpu_var(blk_cpu_done); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] block: Make rq_affinity = 1 work as expected
From: Tao Ma <boyu.mt at taobao.com> Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make the request completed in the __make_request cpu. But it makes the old rq_affinity = 1 not work any more. The root cause is that if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu, ccpu will be the same as group_cpu, so the completion will be excuted in the 'cpu' not 'group_cpu'. This patch fix problem by simpling removing group_cpu and the codes are more explicit now. If ccpu == cpu, we complete in cpu, otherwise we raise_blk_irq to ccpu. Cc: Christoph Hellwig <hch at infradead.org> Cc: Roland Dreier <roland at purestorage.com> Cc: Dan Williams <dan.j.williams at intel.com> Cc: Jens Axboe <jaxboe at fusionio.com> Signed-off-by: Tao Ma <boyu.mt at taobao.com> Reviewed-by: Shaohua Li <shaohua.li at intel.com> Signed-off-by: Jens Axboe <jaxboe at fusionio.com> --- block/blk-softirq.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/block/blk-softirq.c b/block/blk-softirq.c index 475fab809a80..487addc85bb5 100644 --- a/block/blk-softirq.c +++ b/block/blk-softirq.c @@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = { void __blk_complete_request(struct request *req) { - int ccpu, cpu, group_cpu = NR_CPUS; + int ccpu, cpu; struct request_queue *q = req->q; unsigned long flags; @@ -117,14 +117,12 @@ void __blk_complete_request(struct request *req) */ if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) { ccpu = req->cpu; - if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) { + if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) ccpu = blk_cpu_to_group(ccpu); - group_cpu = blk_cpu_to_group(cpu); - } } else ccpu = cpu; - if (ccpu == cpu || ccpu == group_cpu) { + if (ccpu == cpu) { struct list_head *list; do_local: list = &__get_cpu_var(blk_cpu_done); -- 1.8.1.4
From: Dan Williams <dan.j.williams at intel.com> Some systems benefit from completions always being steered to the strict requester cpu rather than the looser "per-socket" steering that blk_cpu_to_group() attempts by default. This is because the first CPU in the group mask ends up being completely overloaded with work, while the others (including the original submitter) has power left to spare. Allow the strict mode to be set by writing '2' to the sysfs control file. This is identical to the scheme used for the nomerges file, where '2' is a more aggressive setting than just being turned on. echo 2 > /sys/block/<bdev>/queue/rq_affinity Cc: Christoph Hellwig <hch at infradead.org> Cc: Roland Dreier <roland at purestorage.com> Tested-by: Dave Jiang <dave.jiang at intel.com> Signed-off-by: Dan Williams <dan.j.williams at intel.com> Signed-off-by: Jens Axboe <jaxboe at fusionio.com> --- Documentation/block/queue-sysfs.txt | 10 +++++++--- block/blk-core.c | 6 ++---- block/blk-softirq.c | 11 +++++++---- block/blk-sysfs.c | 13 +++++++++---- include/linux/blkdev.h | 3 ++- 5 files changed, 27 insertions(+), 16 deletions(-) diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index f65274081c8d..d8147b336c35 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -45,9 +45,13 @@ device. rq_affinity (RW) ---------------- -If this option is enabled, the block layer will migrate request completions -to the CPU that originally submitted the request. For some workloads -this provides a significant reduction in CPU cycles due to caching effects. +If this option is '1', the block layer will migrate request completions to the +cpu "group" that originally submitted the request. For some workloads this +provides a significant reduction in CPU cycles due to caching effects. + +For storage configurations that need to maximize distribution of completion +processing setting this option to '2' forces the completion to run on the +requesting cpu (bypassing the "group" aggregation logic). scheduler (RW) -------------- diff --git a/block/blk-core.c b/block/blk-core.c index a56485292062..b3228255304d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1279,10 +1279,8 @@ get_rq: init_request_from_bio(req, bio); if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) || - bio_flagged(bio, BIO_CPU_AFFINE)) { - req->cpu = blk_cpu_to_group(get_cpu()); - put_cpu(); - } + bio_flagged(bio, BIO_CPU_AFFINE)) + req->cpu = smp_processor_id(); plug = current->plug; if (plug) { diff --git a/block/blk-softirq.c b/block/blk-softirq.c index ee9c21602228..475fab809a80 100644 --- a/block/blk-softirq.c +++ b/block/blk-softirq.c @@ -103,22 +103,25 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = { void __blk_complete_request(struct request *req) { + int ccpu, cpu, group_cpu = NR_CPUS; struct request_queue *q = req->q; unsigned long flags; - int ccpu, cpu, group_cpu; BUG_ON(!q->softirq_done_fn); local_irq_save(flags); cpu = smp_processor_id(); - group_cpu = blk_cpu_to_group(cpu); /* * Select completion CPU */ - if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) + if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) { ccpu = req->cpu; - else + if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) { + ccpu = blk_cpu_to_group(ccpu); + group_cpu = blk_cpu_to_group(cpu); + } + } else ccpu = cpu; if (ccpu == cpu || ccpu == group_cpu) { diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index d935bd859c87..0ee17b5e7fb6 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -244,8 +244,9 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page, static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page) { bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags); + bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags); - return queue_var_show(set, page); + return queue_var_show(set << force, page); } static ssize_t @@ -257,10 +258,14 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count) ret = queue_var_store(&val, page, count); spin_lock_irq(q->queue_lock); - if (val) + if (val) { queue_flag_set(QUEUE_FLAG_SAME_COMP, q); - else - queue_flag_clear(QUEUE_FLAG_SAME_COMP, q); + if (val == 2) + queue_flag_set(QUEUE_FLAG_SAME_FORCE, q); + } else { + queue_flag_clear(QUEUE_FLAG_SAME_COMP, q); + queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q); + } spin_unlock_irq(q->queue_lock); #endif return ret; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index c0cd9a2f22ef..0e67c45b3bc9 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -392,7 +392,7 @@ struct request_queue { #define QUEUE_FLAG_ELVSWITCH 6 /* don't use elevator, just do FIFO */ #define QUEUE_FLAG_BIDI 7 /* queue supports bidi requests */ #define QUEUE_FLAG_NOMERGES 8 /* disable merge attempts */ -#define QUEUE_FLAG_SAME_COMP 9 /* force complete on same CPU */ +#define QUEUE_FLAG_SAME_COMP 9 /* complete on same CPU-group */ #define QUEUE_FLAG_FAIL_IO 10 /* fake timeout */ #define QUEUE_FLAG_STACKABLE 11 /* supports request stacking */ #define QUEUE_FLAG_NONROT 12 /* non-rotational device (SSD) */ @@ -402,6 +402,7 @@ struct request_queue { #define QUEUE_FLAG_NOXMERGES 15 /* No extended merges */ #define QUEUE_FLAG_ADD_RANDOM 16 /* Contributes to random pool */ #define QUEUE_FLAG_SECDISCARD 17 /* supports SECDISCARD */ +#define QUEUE_FLAG_SAME_FORCE 18 /* force complete on same CPU */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_STACKABLE) | \ -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] ext3: Fix deadlock in data=journal mode when fs is frozen
When ext3 is used in data=journal mode, syncing filesystem makes sure all the data is committed in the journal but the data doesn't have to be checkpointed. ext3_freeze() then takes care of checkpointing all the data so all buffer heads are clean but pages can still have dangling dirty bits. So when flusher thread comes later when filesystem is frozen, it tries to write back dirty pages, ext3_journalled_writepage() tries to start a transaction and hangs waiting for frozen fs causing a deadlock because a holder of s_umount semaphore may be waiting for flusher thread to complete. The fix is luckily relatively easy. We don't have to start a transaction in ext3_journalled_writepage() when a page is just dirty (and doesn't have PageChecked set) because in that case all buffers should be already mapped (mapping must happen before writing a buffer to the journal) and it is enough to write them out. This optimization also solves the deadlock because block_write_full_page() will just find out there's no buffer to write and do nothing. Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext3/inode.c | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) JFYI: I've added this patch to my tree. diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index f5157d0d1b43..695abe738a24 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1716,17 +1716,17 @@ static int ext3_journalled_writepage(struct page *page, WARN_ON_ONCE(IS_RDONLY(inode) && !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ERROR_FS)); - if (ext3_journal_current_handle()) - goto no_write; - trace_ext3_journalled_writepage(page); - handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto no_write; - } - if (!page_has_buffers(page) || PageChecked(page)) { + if (ext3_journal_current_handle()) + goto no_write; + + handle = ext3_journal_start(inode, + ext3_writepage_trans_blocks(inode)); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + goto no_write; + } /* * It's mmapped pagecache. Add buffers and journal it. There * doesn't seem much point in redirtying the page here. @@ -1749,17 +1749,18 @@ static int ext3_journalled_writepage(struct page *page, atomic_set(&EXT3_I(inode)->i_datasync_tid, handle->h_transaction->t_tid); unlock_page(page); + err = ext3_journal_stop(handle); + if (!ret) + ret = err; } else { /* - * It may be a page full of checkpoint-mode buffers. We don't - * really know unless we go poke around in the buffer_heads. - * But block_write_full_page will do the right thing. + * It is a page full of checkpoint-mode buffers. Go and write + * them. They should have been already mapped when they went + * to the journal so provide NULL get_block function to catch + * errors. */ - ret = block_write_full_page(page, ext3_get_block, wbc); + ret = block_write_full_page(page, NULL, wbc); } - err = ext3_journal_stop(handle); - if (!ret) - ret = err; out: return ret; -- 1.8.1.4
When doing filesystem wide sync, there's no need to force transaction commit separately for each inode because ext3_sync_fs() takes care of forcing commit at the end. Most of the time this slowness doesn't manifest because previous WB_SYNC_NONE writeback doesn't leave much to write but when there are processes aggressively creating new files and several filesystems to sync, the sync slowness can be noticeable. In the following test script sync(1) takes around 6 minutes when there are two ext3 filesystems mounted on a standard SATA drive. After this patch sync takes a couple of seconds so we have about two orders of magnitude improvement. function run_writers { for (( i = 0; i < 10; i++ )); do mkdir $1/dir$i for (( j = 0; j < 40000; j++ )); do dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null done & done } for dir in "$@"; do run_writers $dir done sleep 40 time sync Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext3/inode.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 384b6ebb655f..14c2faf3c312 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -3253,7 +3253,12 @@ int ext3_write_inode(struct inode *inode, struct writeback_control *wbc) return -EIO; } - if (wbc->sync_mode != WB_SYNC_ALL) + /* + * No need to force transaction in WB_SYNC_NONE mode. Also + * ext3_sync_fs() will force the commit after everything is + * written. + */ + if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync) return 0; return ext3_force_commit(inode->i_sb); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and transaction start
When DAX is enabled, it uses i_mmap_mutex as a protection against truncate during page fault. This inevitably forces i_mmap_mutex to rank outside of a transaction start and thus we have to avoid calling pagecache purging operations when transaction is started. Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/inode.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 8a064734e6eb..494a8645d63e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3631,13 +3631,19 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) if (IS_SYNC(inode)) ext4_handle_sync(handle); - /* Now release the pages again to reduce race window */ + inode->i_mtime = inode->i_ctime = ext4_current_time(inode); + ext4_mark_inode_dirty(handle, inode); + ext4_journal_stop(handle); + + /* + * Now release the pages again to reduce race window. This has to happen + * outside of a transaction to avoid lock inversion on i_mmap_mutex + * when DAX is enabled. + */ if (last_block_offset > first_block_offset) truncate_pagecache_range(inode, first_block_offset, last_block_offset); - - inode->i_mtime = inode->i_ctime = ext4_current_time(inode); - ext4_mark_inode_dirty(handle, inode); + goto out_dio; out_stop: ext4_journal_stop(handle); out_dio: -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 1/2] ext4: Don't check quota format when there are no quota files
The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. CC: stable at vger.kernel.org Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/super.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0b28b36e7915..3ff598f3d4af 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1712,13 +1712,6 @@ static int parse_options(char *options, struct super_block *sb, "not specified"); return 0; } - } else { - if (sbi->s_jquota_fmt) { - ext4_msg(sb, KERN_ERR, "journaled quota format " - "specified with no journaling " - "enabled"); - return 0; - } } #endif if (test_opt(sb, DIOREAD_NOLOCK)) { -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 1/2] ext4: Fix block zeroing when punching holes in indirect block files
free_holes_block() passed local variable as a block pointer to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local variable instead of proper place in inode / indirect block. We later zero out proper place in inode / indirect block but don't dirty the inode / buffer again which can lead to subtle issues (some changes e.g. to inode can be lost). Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/indirect.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 8a57e9fcd1b9..9d381707a6fc 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -1329,8 +1329,8 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode, if (level == 0 || (bh && all_zeroes((__le32 *)bh->b_data, (__le32 *)bh->b_data + addr_per_block))) { - ext4_free_data(handle, inode, parent_bh, &blk, &blk+1); - *i_data = 0; + ext4_free_data(handle, inode, parent_bh, + i_data, i_data + 1); } brelse(bh); bh = NULL; -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] ext4: Fix buffer double free in ext4_alloc_branch()
Error recovery in ext4_alloc_branch() calls ext4_forget() even for buffer corresponding to indirect block it did not allocate. This leads to brelse() being called twice for that buffer (once from ext4_forget() and once from cleanup in ext4_ind_map_blocks()) leading to buffer use count misaccounting. Eventually (but often much later because there are other users of the buffer) we will see messages like: VFS: brelse: Trying to free free buffer Another manifestation of this problem is an error: JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh); inconsistent data on disk The fix is easy - don't forget buffer we did not allocate. Also add an explanatory comment because the indexing at ext4_alloc_branch() is somewhat subtle. Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/indirect.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 594009f5f523..3b91d240da4d 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -389,7 +389,13 @@ static int ext4_alloc_branch(handle_t *handle, struct inode *inode, return 0; failed: for (; i >= 0; i--) { - if (i != indirect_blks && branch[i].bh) + /* + * We want to ext4_forget() only freshly allocated indirect + * blocks. Buffer for new_blocks[i-1] is at branch[i].bh and + * buffer at branch[0].bh is indirect block / inode already + * existing before ext4_alloc_branch() was called. + */ + if (i > 0 && i != indirect_blks && branch[i].bh) ext4_forget(handle, 1, inode, branch[i].bh, branch[i].bh->b_blocknr); ext4_free_blocks(handle, inode, NULL, new_blocks[i], -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] ext4: Fix jbd2 warning under heavy xattr load
When heavily exercising xattr code the assertion that jbd2_journal_dirty_metadata() shouldn't return error was triggered: WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237 jbd2_journal_dirty_metadata+0x1ba/0x260() CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G W 3.10.0-ceph-00049-g68d04c9 #1 Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978 Call Trace: [<ffffffff816311b0>] dump_stack+0x19/0x1b [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0 [<ffffffff8119421e>] setxattr+0x13e/0x1e0 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b The reason for the warning is that buffer_head passed into jbd2_journal_dirty_metadata() didn't have journal_head attached. This is caused by the following race of two ext4_xattr_release_block() calls: CPU1 CPU2 ext4_xattr_release_block() ext4_xattr_release_block() lock_buffer(bh); /* False */ if (BHDR(bh)->h_refcount == cpu_to_le32(1)) } else { le32_add_cpu(&BHDR(bh)->h_refcount, -1); unlock_buffer(bh); lock_buffer(bh); /* True */ if (BHDR(bh)->h_refcount == cpu_to_le32(1)) get_bh(bh); ext4_free_blocks() ... jbd2_journal_forget() jbd2_journal_unfile_buffer() -> JH is gone error = ext4_handle_dirty_xattr_block(handle, inode, bh); -> triggers the warning We fix the problem by moving ext4_handle_dirty_xattr_block() under the buffer lock. Sadly this cannot be done in nojournal mode as that function can call sync_dirty_buffer() which would deadlock. Luckily in nojournal mode the race is harmless (we only dirty already freed buffer) and thus for nojournal mode we leave the dirtying outside of the buffer lock. Reported-by: Sage Weil <sage at inktank.com> Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/xattr.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index e175e94116ac..55e611c1513c 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -517,8 +517,8 @@ static void ext4_xattr_update_super_block(handle_t *handle, } /* - * Release the xattr block BH: If the reference count is > 1, decrement - * it; otherwise free the block. + * Release the xattr block BH: If the reference count is > 1, decrement it; + * otherwise free the block. */ static void ext4_xattr_release_block(handle_t *handle, struct inode *inode, @@ -538,16 +538,31 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode, if (ce) mb_cache_entry_free(ce); get_bh(bh); + unlock_buffer(bh); ext4_free_blocks(handle, inode, bh, 0, 1, EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET); - unlock_buffer(bh); } else { le32_add_cpu(&BHDR(bh)->h_refcount, -1); if (ce) mb_cache_entry_release(ce); + /* + * Beware of this ugliness: Releasing of xattr block references + * from different inodes can race and so we have to protect + * from a race where someone else frees the block (and releases + * its journal_head) before we are done dirtying the buffer. In + * nojournal mode this race is harmless and we actually cannot + * call ext4_handle_dirty_xattr_block() with locked buffer as + * that function can call sync_dirty_buffer() so for that case + * we handle the dirtying after unlocking the buffer. + */ + if (ext4_handle_valid(handle)) + error = ext4_handle_dirty_xattr_block(handle, inode, + bh); unlock_buffer(bh); - error = ext4_handle_dirty_xattr_block(handle, inode, bh); + if (!ext4_handle_valid(handle)) + error = ext4_handle_dirty_xattr_block(handle, inode, + bh); if (IS_SYNC(inode)) ext4_handle_sync(handle); dquot_free_block(inode, EXT4_C2B(EXT4_SB(inode->i_sb), 1)); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] ext4: Fix zeroing of page during writeback
Tail of a page straddling inode size must be zeroed when being written out due to POSIX requirement that modifications of mmaped page beyond inode size must not be written to the file. ext4_bio_write_page() did this only for blocks fully beyond inode size but didn't properly zero blocks partially beyond inode size. Fix this. The problem has been uncovered by mmap_11-4 test in openposix test suite (part of LTP). Reported-by: Xiaoguang Wang <wangxg.fnst at cn.fujitsu.com> Fixes: 5a0dc7365c240 Fixes: bd2d0210cf22f CC: stable at vger.kernel.org Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/page-io.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index c18d95b50540..b936388b834f 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -418,6 +418,17 @@ int ext4_bio_write_page(struct ext4_io_submit *io, ClearPageError(page); /* + * Comments copied from block_write_full_page_endio: + * + * The page straddles i_size. It must be zeroed out on each and every + * writepage invocation because it may be mmapped. "A file is mapped + * in multiples of the page size. For a file that is not a multiple of + * the page size, the remaining memory is zeroed when mapped, and + * writes to that region are not written out to the file." + */ + if (len < PAGE_CACHE_SIZE) + zero_user_segment(page, len, PAGE_CACHE_SIZE); + /* * In the first loop we prepare and mark buffers to submit. We have to * mark all buffers in the page before submitting so that * end_page_writeback() cannot be called from ext4_bio_end_io() when IO @@ -428,19 +439,6 @@ int ext4_bio_write_page(struct ext4_io_submit *io, do { block_start = bh_offset(bh); if (block_start >= len) { - /* - * Comments copied from block_write_full_page_endio: - * - * The page straddles i_size. It must be zeroed out on - * each and every writepage invocation because it may - * be mmapped. "A file is mapped in multiples of the - * page size. For a file that is not a multiple of - * the page size, the remaining memory is zeroed when - * mapped, and writes to that region are not written - * out to the file." - */ - zero_user_segment(page, block_start, - block_start + blocksize); clear_buffer_dirty(bh); set_buffer_uptodate(bh); continue; -- 1.8.1.4
Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/namei.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 0486fbafb808..45a0d2b8953a 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2559,6 +2559,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) int err = 0, rc; bool dirty = false; + return 0; if (!sbi->s_journal) return 0; @@ -2616,7 +2617,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) * list entries can cause panics at unmount time. */ mutex_lock(&sbi->s_orphan_lock); - list_del(&EXT4_I(inode)->i_orphan); + list_del_init(&EXT4_I(inode)->i_orphan); mutex_unlock(&sbi->s_orphan_lock); } } @@ -2641,6 +2642,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) struct ext4_iloc iloc; int err = 0; + return 0; if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) return 0; -- 1.8.1.4
When doing filesystem wide sync, there's no need to force transaction commit (or synchronously write inode buffer) separately for each inode because ext4_sync_fs() takes care of forcing commit at the end (VFS takes care of flushing buffer cache, respectively). Most of the time this slowness doesn't manifest because previous WB_SYNC_NONE writeback doesn't leave much to write but when there are processes aggressively creating new files and several filesystems to sync, the sync slowness can be noticeable. In the following test script sync(1) takes around 6 minutes when there are two ext4 filesystems mounted on a standard SATA drive. After this patch sync takes a couple of seconds so we have about two orders of magnitude improvement. function run_writers { for (( i = 0; i < 10; i++ )); do mkdir $1/dir$i for (( j = 0; j < 40000; j++ )); do dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null done & done } for dir in "$@"; do run_writers $dir done sleep 40 time sync Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/inode.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 6e39895a91b8..7850584b0679 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4443,7 +4443,12 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc) return -EIO; } - if (wbc->sync_mode != WB_SYNC_ALL) + /* + * No need to force transaction in WB_SYNC_NONE mode. Also + * ext4_sync_fs() will force the commit after everything is + * written. + */ + if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync) return 0; err = ext4_force_commit(inode->i_sb); @@ -4453,7 +4458,11 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc) err = __ext4_get_inode_loc(inode, &iloc, 0); if (err) return err; - if (wbc->sync_mode == WB_SYNC_ALL) + /* + * sync(2) will flush the whole buffer cache. No need to do + * it here separately for each inode. + */ + if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) sync_dirty_buffer(iloc.bh); if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) { EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr, -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH for 3.14-stable] fanotify: fix double free of pending permission events
commit 5838d4442bd5971687b72221736222637e03140d upstream. Commit 85816794240b ("fanotify: Fix use after free for permission events") introduced a double free issue for permission events which are pending in group's notification queue while group is being destroyed. These events are freed from fanotify_handle_event() but they are not removed from groups notification queue and thus they get freed again from fsnotify_flush_notify(). Fix the problem by removing permission events from notification queue before freeing them if we skip processing access response. Also expand comments in fanotify_release() to explain group shutdown in detail. Fixes: 85816794240b9659e66e4d9b0df7c6e814e5f603 Signed-off-by: Jan Kara <jack at suse.cz> Reported-by: Douglas Leeder <douglas.leeder at sophos.com> Tested-by: Douglas Leeder <douglas.leeder at sophos.com> Reported-by: Heinrich Schuchard <xypron.glpk at gmx.de> Cc: <stable at vger.kernel.org> Signed-off-by: Andrew Morton <akpm at linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org> Signed-off-by: Jan Kara <jack at suse.cz> --- fs/notify/fanotify/fanotify.c | 9 ++++++++- fs/notify/fanotify/fanotify_user.c | 12 ++++++++++++ fs/notify/notification.c | 18 +++++++++++++++++- include/linux/fsnotify_backend.h | 2 ++ 4 files changed, 39 insertions(+), 2 deletions(-) diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index dc638f786d5c..aadf397be4d9 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -70,8 +70,15 @@ static int fanotify_get_response_from_access(struct fsnotify_group *group, wait_event(group->fanotify_data.access_waitq, event->response || atomic_read(&group->fanotify_data.bypass_perm)); - if (!event->response) /* bypass_perm set */ + if (!event->response) { /* bypass_perm set */ + /* + * Event was canceled because group is being destroyed. Remove + * it from group's event list because we are responsible for + * freeing the permission event. + */ + fsnotify_remove_event(group, &event->fse); return 0; + } /* userspace responded, convert to something usable */ switch (event->response) { diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c index 287a22c04149..ceb130dd3ac8 100644 --- a/fs/notify/fanotify/fanotify_user.c +++ b/fs/notify/fanotify/fanotify_user.c @@ -385,6 +385,11 @@ static int fanotify_release(struct inode *ignored, struct file *file) #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS struct fanotify_response_event *re, *lre; + /* + * There may be still new events arriving in the notification queue + * but since userspace cannot use fanotify fd anymore, no event can + * enter or leave access_list by now. + */ mutex_lock(&group->fanotify_data.access_mutex); atomic_inc(&group->fanotify_data.bypass_perm); @@ -400,6 +405,13 @@ static int fanotify_release(struct inode *ignored, struct file *file) } mutex_unlock(&group->fanotify_data.access_mutex); + /* + * Since bypass_perm is set, newly queued events will not wait for + * access response. Wake up the already sleeping ones now. + * synchronize_srcu() in fsnotify_destroy_group() will wait for all + * processes sleeping in fanotify_handle_event() waiting for access + * response and thus also for all permission events to be freed. + */ wake_up(&group->fanotify_data.access_waitq); #endif diff --git a/fs/notify/notification.c b/fs/notify/notification.c index 1e58402171a5..25a07c70f1c9 100644 --- a/fs/notify/notification.c +++ b/fs/notify/notification.c @@ -73,7 +73,8 @@ void fsnotify_destroy_event(struct fsnotify_group *group, /* Overflow events are per-group and we don't want to free them */ if (!event || event->mask == FS_Q_OVERFLOW) return; - + /* If the event is still queued, we have a problem... */ + WARN_ON(!list_empty(&event->list)); group->ops->free_event(event); } @@ -125,6 +126,21 @@ queue: } /* + * Remove @event from group's notification queue. It is the responsibility of + * the caller to destroy the event. + */ +void fsnotify_remove_event(struct fsnotify_group *group, + struct fsnotify_event *event) +{ + mutex_lock(&group->notification_mutex); + if (!list_empty(&event->list)) { + list_del_init(&event->list); + group->q_len--; + } + mutex_unlock(&group->notification_mutex); +} + +/* * Remove and return the first event from the notification list. It is the * responsibility of the caller to destroy the obtained event */ diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index 64cf3ef50696..6191c2edb8c7 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -326,6 +326,8 @@ extern int fsnotify_add_notify_event(struct fsnotify_group *group, struct fsnotify_event *event, int (*merge)(struct list_head *, struct fsnotify_event *)); +/* Remove passed event from groups notification queue */ +extern void fsnotify_remove_event(struct fsnotify_group *group, struct fsnotify_event *event); /* true if the group notification queue is empty */ extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group); /* return, but do not dequeue the first event on the notification queue */ -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] fs: Avoid userspace mounting anon_inodefs filesystem
anon_inodefs filesystem is a kernel internal filesystem userspace shouldn't mess with. Remove registration of it so userspace cannot even try to mount it (which would fail anyway because the filesystem is MS_NOUSER). This fixes an oops triggered by trinity when it tried mounting anon_inodefs which overwrote anon_inode_inode pointer while other CPU has been in anon_inode_getfile() between ihold() and d_instantiate(). Thus effectively creating dentry pointing to an inode without holding a reference to it. Reported-by: Sasha Levin <sasha.levin at oracle.com> Signed-off-by: Jan Kara <jack at suse.cz> --- fs/anon_inodes.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 24084732b1d0..4b4543b8b894 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -177,9 +177,6 @@ static int __init anon_inode_init(void) { int error; - error = register_filesystem(&anon_inode_fs_type); - if (error) - goto err_exit; anon_inode_mnt = kern_mount(&anon_inode_fs_type); if (IS_ERR(anon_inode_mnt)) { error = PTR_ERR(anon_inode_mnt); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 1/2] jbd2: Avoid pointless scanning of checkpoint lists
Yuanhan has reported that when he is running fsync(2) heavy workload creating new files over ramdisk, significant amount of time is spent in __jbd2_journal_clean_checkpoint_list() trying to clean old transactions (but they cannot be cleaned up because flusher hasn't yet checkpointed those buffers). The workload can be generated by: fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096 Reduce the amount of scanning by stopping to scan the transaction list once we find a transaction that cannot be checkpointed. Note that this way of cleaning is still enough to keep freeing space in the journal after fully checkpointed transactions. Reported-and-tested-by: Yuanhan Liu <yuanhan.liu at linux.intel.com> Signed-off-by: Jan Kara <jack at suse.cz> --- fs/jbd2/checkpoint.c | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 7f34f4716165..e39b2d0e1079 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -478,7 +478,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal) * Find all the written-back checkpoint buffers in the given list and * release them. * - * Called with the journal locked. * Called with j_list_lock held. * Returns number of buffers reaped (for debug) */ @@ -498,12 +497,12 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released) jh = next_jh; next_jh = jh->b_cpnext; ret = __try_to_free_cp_buf(jh); - if (ret) { - freed++; - if (ret == 2) { - *released = 1; - return freed; - } + if (!ret) + return freed; + freed++; + if (ret == 2) { + *released = 1; + return freed; } /* * This function only frees up some memory @@ -523,7 +522,6 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released) * * Find all the written-back checkpoint buffers in the journal and release them. * - * Called with the journal locked. * Called with j_list_lock held. * Returns number of buffers reaped (for debug) */ @@ -531,7 +529,8 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released) int __jbd2_journal_clean_checkpoint_list(journal_t *journal) { transaction_t *transaction, *last_transaction, *next_transaction; - int ret = 0; + int ret; + int freed = 0; int released; transaction = journal->j_checkpoint_transactions; @@ -543,17 +542,21 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal) do { transaction = next_transaction; next_transaction = transaction->t_cpnext; - ret += journal_clean_one_cp_list(transaction-> + ret = journal_clean_one_cp_list(transaction-> t_checkpoint_list, &released); /* * This function only frees up some memory if possible so we * dont have an obligation to finish processing. Bail out if * preemption requested: */ - if (need_resched()) + if (need_resched()) { + freed += ret; goto out; - if (released) + } + if (released) { + freed += ret; continue; + } /* * It is essential that we are as careful as in the case of * t_checkpoint_list with removing the buffer from the list as @@ -561,11 +564,12 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal) */ ret += journal_clean_one_cp_list(transaction-> t_checkpoint_io_list, &released); - if (need_resched()) + freed += ret; + if (need_resched() || !ret) goto out; } while (transaction != last_transaction); out: - return ret; + return freed; } /* -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] jbd2: Optimize jbd2_log_do_checkpoint() a bit
When we discover written out buffer in transaction checkpoint list we don't have to recheck validity of a transaction. Either this is the last buffer in a transaction - and then we are done - or this isn't and then we can just take another buffer from the checkpoint list without dropping j_list_lock. Signed-off-by: Jan Kara <jack at suse.cz> --- fs/jbd2/checkpoint.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 993a187527f3..3722e2e53638 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -343,12 +343,15 @@ restart: if (!buffer_dirty(bh)) { if (unlikely(buffer_write_io_error(bh)) && !result) result = -EIO; - get_bh(bh); BUFFER_TRACE(bh, "remove from checkpoint"); - __jbd2_journal_remove_checkpoint(jh); - spin_unlock(&journal->j_list_lock); - __brelse(bh); - goto retry; + if (__jbd2_journal_remove_checkpoint(jh)) { + /* + * This was the last buffer attached to the + * transaction. We are done. + */ + goto out; + } + continue; } /* * Important: we are about to write the buffer, and -- 1.8.1.4
Signed-off-by: Jan Kara <jack at suse.cz> --- kernel/locking/lockdep.c | 707 +++++++++++++++++++++++++++-------------------- 1 file changed, 402 insertions(+), 305 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index d24e4339b46d..b15e7dec55f6 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -77,6 +77,26 @@ module_param(lock_stat, int, 0644); */ static arch_spinlock_t lockdep_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; +static void sprint_ip_sym(char *buf, unsigned long ip) +{ + sprintf(buf, "[<%p>] %pS\n", (void *) ip, (void *) ip); +} + +static void trace_print_stack_trace(struct stack_trace *trace, int spaces) +{ + int i, n; + char buf[256]; + + if (!trace->entries) + return; + + for (i = 0; i < trace->nr_entries; i++) { + n = sprintf(buf, "%*c", 1 + spaces, ' '); + sprint_ip_sym(buf + n, trace->entries[i]); + trace_printk(buf); + } +} + static int graph_lock(void) { arch_spin_lock(&lockdep_lock); @@ -382,9 +402,9 @@ static unsigned long stack_trace[MAX_STACK_TRACE_ENTRIES]; static void print_lockdep_off(const char *bug_msg) { - printk(KERN_DEBUG "%s\n", bug_msg); - printk(KERN_DEBUG "turning off the locking correctness validator.\n"); - printk(KERN_DEBUG "Please attach the output of /proc/lock_stat to the bug report\n"); + trace_printk("%s\n", bug_msg); + trace_printk("turning off the locking correctness validator.\n"); + trace_printk("Please attach the output of /proc/lock_stat to the bug report\n"); } static int save_trace(struct stack_trace *trace) @@ -417,7 +437,7 @@ static int save_trace(struct stack_trace *trace) return 0; print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); - dump_stack(); + trace_dump_stack(0); return 0; } @@ -506,7 +526,7 @@ void get_usage_chars(struct lock_class *class, char usage[LOCK_USAGE_CHARS]) usage[i] = '\0'; } -static void __print_lock_name(struct lock_class *class) +static void __sprint_lock_name(char *buf, struct lock_class *class) { char str[KSYM_NAME_LEN]; const char *name; @@ -514,28 +534,28 @@ static void __print_lock_name(struct lock_class *class) name = class->name; if (!name) { name = __get_key_name(class->key, str); - printk("%s", name); + strcpy(buf, name); } else { - printk("%s", name); + strcpy(buf, name); if (class->name_version > 1) - printk("#%d", class->name_version); + sprintf(buf + strlen(buf), "#%d", class->name_version); if (class->subclass) - printk("/%d", class->subclass); + sprintf(buf + strlen(buf), "/%d", class->subclass); } } -static void print_lock_name(struct lock_class *class) +static void sprint_lock_name(char *buf, struct lock_class *class) { char usage[LOCK_USAGE_CHARS]; get_usage_chars(class, usage); - printk(" ("); - __print_lock_name(class); - printk("){%s}", usage); + strcpy(buf, " ("); + __sprint_lock_name(buf, class); + sprintf(buf + strlen(buf), "){%s}", usage); } -static void print_lockdep_cache(struct lockdep_map *lock) +static void sprint_lockdep_cache(char *buf, struct lockdep_map *lock) { const char *name; char str[KSYM_NAME_LEN]; @@ -544,14 +564,14 @@ static void print_lockdep_cache(struct lockdep_map *lock) if (!name) name = __get_key_name(lock->key->subkeys, str); - printk("%s", name); + strcpy(buf, name); } -static void print_lock(struct held_lock *hlock) +static void sprint_lock(char *buf, struct held_lock *hlock) { - print_lock_name(hlock_class(hlock)); - printk(", at: "); - print_ip_sym(hlock->acquire_ip); + sprint_lock_name(buf, hlock_class(hlock)); + strcat(buf, ", at: "); + sprint_ip_sym(buf + strlen(buf), hlock->acquire_ip); } static void lockdep_print_held_locks(struct task_struct *curr) @@ -559,21 +579,24 @@ static void lockdep_print_held_locks(struct task_struct *curr) int i, depth = curr->lockdep_depth; if (!depth) { - printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr)); + trace_printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr)); return; } - printk("%d lock%s held by %s/%d:\n", + trace_printk("%d lock%s held by %s/%d:\n", depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr)); for (i = 0; i < depth; i++) { - printk(" #%d: ", i); - print_lock(curr->held_locks + i); + char buf[256]; + + sprintf(buf, " #%d: ", i); + sprint_lock(buf + strlen(buf), curr->held_locks + i); + trace_printk(buf); } } static void print_kernel_ident(void) { - printk("%s %.*s %s\n", init_utsname()->release, + trace_printk("%s %.*s %s\n", init_utsname()->release, (int)strcspn(init_utsname()->version, " "), init_utsname()->version, print_tainted()); @@ -669,11 +692,11 @@ look_up_lock_class(struct lockdep_map *lock, unsigned int subclass) if (unlikely(subclass >= MAX_LOCKDEP_SUBCLASSES)) { debug_locks_off(); - printk(KERN_ERR + trace_printk( "BUG: looking up invalid subclass: %u\n", subclass); - printk(KERN_ERR + trace_printk(KERN_ERR "turning off the locking correctness validator.\n"); - dump_stack(); + trace_dump_stack(0); return NULL; } @@ -737,10 +760,10 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) */ if (!static_obj(lock->key)) { debug_locks_off(); - printk("INFO: trying to register non-static key.\n"); - printk("the code is fine but needs lockdep annotation.\n"); - printk("turning off the locking correctness validator.\n"); - dump_stack(); + trace_printk("INFO: trying to register non-static key.\n"); + trace_printk("the code is fine but needs lockdep annotation.\n"); + trace_printk("turning off the locking correctness validator.\n"); + trace_dump_stack(0); return NULL; } @@ -772,7 +795,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) raw_local_irq_restore(flags); print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!"); - dump_stack(); + trace_dump_stack(0); return NULL; } class = lock_classes + nr_lock_classes++; @@ -798,11 +821,11 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) graph_unlock(); raw_local_irq_restore(flags); - printk("\nnew class %p: %s", class->key, class->name); + trace_printk("\nnew class %p: %s", class->key, class->name); if (class->name_version > 1) - printk("#%d", class->name_version); - printk("\n"); - dump_stack(); + trace_printk("#%d", class->name_version); + trace_printk("\n"); + trace_dump_stack(0); raw_local_irq_save(flags); if (!graph_lock()) { @@ -842,7 +865,7 @@ static struct lock_list *alloc_list_entry(void) return NULL; print_lockdep_off("BUG: MAX_LOCKDEP_ENTRIES too low!"); - dump_stack(); + trace_dump_stack(0); return NULL; } return list_entries + nr_list_entries++; @@ -1078,12 +1101,15 @@ static inline int __bfs_backwards(struct lock_list *src_entry, static noinline int print_circular_bug_entry(struct lock_list *target, int depth) { + char buf[256]; + if (debug_locks_silent) return 0; - printk("\n-> #%u", depth); - print_lock_name(target->class); - printk(":\n"); - print_stack_trace(&target->trace, 6); + sprintf(buf, "\n-> #%u", depth); + sprint_lock_name(buf + strlen(buf), target->class); + strcat(buf, ":\n"); + trace_printk(buf); + trace_print_stack_trace(&target->trace, 6); return 0; } @@ -1096,6 +1122,7 @@ print_circular_lock_scenario(struct held_lock *src, struct lock_class *source = hlock_class(src); struct lock_class *target = hlock_class(tgt); struct lock_class *parent = prt->class; + char buf[256]; /* * A direct locking problem where unsafe_class lock is taken @@ -1111,31 +1138,36 @@ print_circular_lock_scenario(struct held_lock *src, * from the safe_class lock to the unsafe_class lock. */ if (parent != source) { - printk("Chain exists of:\n "); - __print_lock_name(source); - printk(" --> "); - __print_lock_name(parent); - printk(" --> "); - __print_lock_name(target); - printk("\n\n"); + trace_printk("Chain exists of:\n "); + __sprint_lock_name(buf, source); + strcat(buf, " --> "); + __sprint_lock_name(buf + strlen(buf), parent); + strcat(buf, " --> "); + __sprint_lock_name(buf + strlen(buf), target); + strcat(buf, "\n\n"); + trace_printk(buf); } - printk(" Possible unsafe locking scenario:\n\n"); - printk(" CPU0 CPU1\n"); - printk(" ---- ----\n"); - printk(" lock("); - __print_lock_name(target); - printk(");\n"); - printk(" lock("); - __print_lock_name(parent); - printk(");\n"); - printk(" lock("); - __print_lock_name(target); - printk(");\n"); - printk(" lock("); - __print_lock_name(source); - printk(");\n"); - printk("\n *** DEADLOCK ***\n\n"); + trace_printk(" Possible unsafe locking scenario:\n\n"); + trace_printk(" CPU0 CPU1\n"); + trace_printk(" ---- ----\n"); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), target); + strcat(buf, ");\n"); + trace_printk(buf); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), parent); + strcat(buf, ");\n"); + trace_printk(buf); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), target); + strcat(buf, ");\n"); + trace_printk(buf); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), source); + strcat(buf, ");\n"); + trace_printk(buf); + trace_printk("\n *** DEADLOCK ***\n\n"); } /* @@ -1148,22 +1180,25 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth, struct held_lock *check_tgt) { struct task_struct *curr = current; + char buf[256]; if (debug_locks_silent) return 0; - printk("\n"); - printk("======================================================\n"); - printk("[ INFO: possible circular locking dependency detected ]\n"); + trace_printk("\n"); + trace_printk("======================================================\n"); + trace_printk("[ INFO: possible circular locking dependency detected ]\n"); print_kernel_ident(); - printk("-------------------------------------------------------\n"); - printk("%s/%d is trying to acquire lock:\n", + trace_printk("-------------------------------------------------------\n"); + sprintf(buf, "%s/%d is trying to acquire lock:\n", curr->comm, task_pid_nr(curr)); - print_lock(check_src); - printk("\nbut task is already holding lock:\n"); - print_lock(check_tgt); - printk("\nwhich lock already depends on the new lock.\n\n"); - printk("\nthe existing dependency chain (in reverse order) is:\n"); + sprint_lock(buf + strlen(buf), check_src); + trace_printk(buf); + trace_printk("\nbut task is already holding lock:\n"); + sprint_lock(buf, check_tgt); + trace_printk(buf); + trace_printk("\nwhich lock already depends on the new lock.\n\n"); + trace_printk("\nthe existing dependency chain (in reverse order) is:\n"); print_circular_bug_entry(entry, depth); @@ -1203,14 +1238,14 @@ static noinline int print_circular_bug(struct lock_list *this, parent = get_lock_parent(parent); } - printk("\nother info that might help us debug this:\n\n"); + trace_printk("\nother info that might help us debug this:\n\n"); print_circular_lock_scenario(check_src, check_tgt, first_parent); lockdep_print_held_locks(curr); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); return 0; } @@ -1367,25 +1402,28 @@ find_usage_backwards(struct lock_list *root, enum lock_usage_bit bit, static void print_lock_class_header(struct lock_class *class, int depth) { int bit; + char buf[256]; - printk("%*s->", depth, ""); - print_lock_name(class); - printk(" ops: %lu", class->ops); - printk(" {\n"); + sprintf(buf, "%*s->", depth, ""); + sprint_lock_name(buf + strlen(buf), class); + sprintf(buf + strlen(buf), " ops: %lu {\n", class->ops); + trace_printk(buf); for (bit = 0; bit < LOCK_USAGE_STATES; bit++) { if (class->usage_mask & (1 << bit)) { int len = depth; - len += printk("%*s %s", depth, "", usage_str[bit]); - len += printk(" at:\n"); - print_stack_trace(class->usage_traces + bit, len); + len += sprintf(buf, "%*s %s", depth, "", usage_str[bit]); + len += sprintf(buf + strlen(buf), " at:\n"); + trace_printk(buf); + trace_print_stack_trace(class->usage_traces + bit, len); } } - printk("%*s }\n", depth, ""); + trace_printk("%*s }\n", depth, ""); - printk("%*s ... key at: ",depth,""); - print_ip_sym((unsigned long)class->key); + sprintf(buf, "%*s ... key at: ",depth,""); + sprint_ip_sym(buf + strlen(buf), (unsigned long)class->key); + trace_printk(buf); } /* @@ -1403,12 +1441,12 @@ print_shortest_lock_dependencies(struct lock_list *leaf, do { print_lock_class_header(entry->class, depth); - printk("%*s ... acquired at:\n", depth, ""); - print_stack_trace(&entry->trace, 2); - printk("\n"); + trace_printk("%*s ... acquired at:\n", depth, ""); + trace_print_stack_trace(&entry->trace, 2); + trace_printk("\n"); if (depth == 0 && (entry != root)) { - printk("lockdep:%s bad path found in chain graph\n", __func__); + trace_printk("lockdep:%s bad path found in chain graph\n", __func__); break; } @@ -1428,6 +1466,7 @@ print_irq_lock_scenario(struct lock_list *safe_entry, struct lock_class *safe_class = safe_entry->class; struct lock_class *unsafe_class = unsafe_entry->class; struct lock_class *middle_class = prev_class; + char buf[256]; if (middle_class == safe_class) middle_class = next_class; @@ -1446,33 +1485,39 @@ print_irq_lock_scenario(struct lock_list *safe_entry, * from the safe_class lock to the unsafe_class lock. */ if (middle_class != unsafe_class) { - printk("Chain exists of:\n "); - __print_lock_name(safe_class); - printk(" --> "); - __print_lock_name(middle_class); - printk(" --> "); - __print_lock_name(unsafe_class); - printk("\n\n"); + trace_printk("Chain exists of:\n"); + strcpy(buf, " "); + __sprint_lock_name(buf + strlen(buf), safe_class); + strcat(buf, " --> "); + __sprint_lock_name(buf + strlen(buf), middle_class); + strcat(buf, " --> "); + __sprint_lock_name(buf + strlen(buf), unsafe_class); + strcat(buf, "\n\n"); + trace_printk(buf); } - printk(" Possible interrupt unsafe locking scenario:\n\n"); - printk(" CPU0 CPU1\n"); - printk(" ---- ----\n"); - printk(" lock("); - __print_lock_name(unsafe_class); - printk(");\n"); - printk(" local_irq_disable();\n"); - printk(" lock("); - __print_lock_name(safe_class); - printk(");\n"); - printk(" lock("); - __print_lock_name(middle_class); - printk(");\n"); - printk(" <Interrupt>\n"); - printk(" lock("); - __print_lock_name(safe_class); - printk(");\n"); - printk("\n *** DEADLOCK ***\n\n"); + trace_printk(" Possible interrupt unsafe locking scenario:\n\n"); + trace_printk(" CPU0 CPU1\n"); + trace_printk(" ---- ----\n"); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), unsafe_class); + strcat(buf, ");\n"); + trace_printk(buf); + trace_printk(" local_irq_disable();\n"); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), safe_class); + strcat(buf, ");\n"); + trace_printk(buf); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), middle_class); + strcat(buf, ");\n"); + trace_printk(buf); + trace_printk(" <Interrupt>\n"); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), safe_class); + strcat(buf, ");\n"); + trace_printk(buf); + trace_printk("\n *** DEADLOCK ***\n\n"); } static int @@ -1487,65 +1532,74 @@ print_bad_irq_dependency(struct task_struct *curr, enum lock_usage_bit bit2, const char *irqclass) { + char buf[256]; + if (!debug_locks_off_graph_unlock() || debug_locks_silent) return 0; - printk("\n"); - printk("======================================================\n"); - printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n", + trace_printk("\n"); + trace_printk("======================================================\n"); + trace_printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n", irqclass, irqclass); print_kernel_ident(); - printk("------------------------------------------------------\n"); - printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n", + trace_printk("------------------------------------------------------\n"); + trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n", curr->comm, task_pid_nr(curr), curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT, curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT, curr->hardirqs_enabled, curr->softirqs_enabled); - print_lock(next); - - printk("\nand this task is already holding:\n"); - print_lock(prev); - printk("which would create a new lock dependency:\n"); - print_lock_name(hlock_class(prev)); - printk(" ->"); - print_lock_name(hlock_class(next)); - printk("\n"); - - printk("\nbut this new dependency connects a %s-irq-safe lock:\n", + sprint_lock(buf, next); + trace_printk(buf); + + trace_printk("\nand this task is already holding:\n"); + sprint_lock(buf, prev); + trace_printk(buf); + trace_printk("which would create a new lock dependency:\n"); + sprint_lock_name(buf, hlock_class(prev)); + strcat(buf, " ->"); + sprint_lock_name(buf + strlen(buf), hlock_class(next)); + strcat(buf, "\n"); + trace_printk(buf); + + trace_printk("\nbut this new dependency connects a %s-irq-safe lock:\n", irqclass); - print_lock_name(backwards_entry->class); - printk("\n... which became %s-irq-safe at:\n", irqclass); + sprint_lock_name(buf, backwards_entry->class); + strcat(buf, "\n"); + trace_printk(buf); + trace_printk("... which became %s-irq-safe at:\n", irqclass); - print_stack_trace(backwards_entry->class->usage_traces + bit1, 1); + trace_print_stack_trace(backwards_entry->class->usage_traces + bit1, 1); - printk("\nto a %s-irq-unsafe lock:\n", irqclass); - print_lock_name(forwards_entry->class); - printk("\n... which became %s-irq-unsafe at:\n", irqclass); - printk("..."); + trace_printk("\nto a %s-irq-unsafe lock:\n", irqclass); + sprint_lock_name(buf, forwards_entry->class); + strcat(buf, "\n"); + trace_printk(buf); + trace_printk("... which became %s-irq-unsafe at:\n", irqclass); + trace_printk("..."); - print_stack_trace(forwards_entry->class->usage_traces + bit2, 1); + trace_print_stack_trace(forwards_entry->class->usage_traces + bit2, 1); - printk("\nother info that might help us debug this:\n\n"); + trace_printk("\nother info that might help us debug this:\n\n"); print_irq_lock_scenario(backwards_entry, forwards_entry, hlock_class(prev), hlock_class(next)); lockdep_print_held_locks(curr); - printk("\nthe dependencies between %s-irq-safe lock", irqclass); - printk(" and the holding lock:\n"); + trace_printk("\nthe dependencies between %s-irq-safe lock", irqclass); + trace_printk(" and the holding lock:\n"); if (!save_trace(&prev_root->trace)) return 0; print_shortest_lock_dependencies(backwards_entry, prev_root); - printk("\nthe dependencies between the lock to be acquired"); - printk(" and %s-irq-unsafe lock:\n", irqclass); + trace_printk("\nthe dependencies between the lock to be acquired"); + trace_printk(" and %s-irq-unsafe lock:\n", irqclass); if (!save_trace(&next_root->trace)) return 0; print_shortest_lock_dependencies(forwards_entry, next_root); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); return 0; } @@ -1699,44 +1753,51 @@ print_deadlock_scenario(struct held_lock *nxt, { struct lock_class *next = hlock_class(nxt); struct lock_class *prev = hlock_class(prv); - - printk(" Possible unsafe locking scenario:\n\n"); - printk(" CPU0\n"); - printk(" ----\n"); - printk(" lock("); - __print_lock_name(prev); - printk(");\n"); - printk(" lock("); - __print_lock_name(next); - printk(");\n"); - printk("\n *** DEADLOCK ***\n\n"); - printk(" May be due to missing lock nesting notation\n\n"); + char buf[256]; + + trace_printk(" Possible unsafe locking scenario:\n\n"); + trace_printk(" CPU0\n"); + trace_printk(" ----\n"); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), prev); + strcat(buf, ");\n"); + trace_printk(buf); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), next); + strcat(buf, ");\n"); + trace_printk(buf); + trace_printk("\n *** DEADLOCK ***\n\n"); + trace_printk(" May be due to missing lock nesting notation\n\n"); } static int print_deadlock_bug(struct task_struct *curr, struct held_lock *prev, struct held_lock *next) { + char buf[256]; + if (!debug_locks_off_graph_unlock() || debug_locks_silent) return 0; - printk("\n"); - printk("=============================================\n"); - printk("[ INFO: possible recursive locking detected ]\n"); + trace_printk("\n"); + trace_printk("=============================================\n"); + trace_printk("[ INFO: possible recursive locking detected ]\n"); print_kernel_ident(); - printk("---------------------------------------------\n"); - printk("%s/%d is trying to acquire lock:\n", + trace_printk("---------------------------------------------\n"); + trace_printk("%s/%d is trying to acquire lock:\n", curr->comm, task_pid_nr(curr)); - print_lock(next); - printk("\nbut task is already holding lock:\n"); - print_lock(prev); + sprint_lock(buf, next); + trace_printk(buf); + trace_printk("\nbut task is already holding lock:\n"); + sprint_lock(buf, prev); + trace_printk(buf); - printk("\nother info that might help us debug this:\n"); + trace_printk("\nother info that might help us debug this:\n"); print_deadlock_scenario(next, prev); lockdep_print_held_locks(curr); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); return 0; } @@ -1894,13 +1955,16 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev, * Debugging printouts: */ if (verbose(hlock_class(prev)) || verbose(hlock_class(next))) { + char buf[256]; + graph_unlock(); - printk("\n new dependency: "); - print_lock_name(hlock_class(prev)); - printk(" => "); - print_lock_name(hlock_class(next)); - printk("\n"); - dump_stack(); + strcpy(buf, "\n new dependency: "); + sprint_lock_name(buf + strlen(buf), hlock_class(prev)); + strcat(buf, " => "); + sprint_lock_name(buf + strlen(buf), hlock_class(next)); + strcat(buf, "\n"); + trace_printk(buf); + trace_dump_stack(0); return graph_lock(); } return 1; @@ -2025,7 +2089,7 @@ static inline int lookup_chain_cache(struct task_struct *curr, cache_hit: debug_atomic_inc(chain_lookup_hits); if (very_verbose(class)) - printk("\nhash chain already cached, key: " + trace_printk("\nhash chain already cached, key: " "%016Lx tail class: [%p] %s\n", (unsigned long long)chain_key, class->key, class->name); @@ -2033,7 +2097,7 @@ cache_hit: } } if (very_verbose(class)) - printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n", + trace_printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n", (unsigned long long)chain_key, class->key, class->name); /* * Allocate a new chain entry from the static array, and add @@ -2055,7 +2119,7 @@ cache_hit: return 0; print_lockdep_off("BUG: MAX_LOCKDEP_CHAINS too low!"); - dump_stack(); + trace_dump_stack(0); return 0; } chain = lock_chains + nr_lock_chains++; @@ -2203,55 +2267,61 @@ static void print_usage_bug_scenario(struct held_lock *lock) { struct lock_class *class = hlock_class(lock); - - printk(" Possible unsafe locking scenario:\n\n"); - printk(" CPU0\n"); - printk(" ----\n"); - printk(" lock("); - __print_lock_name(class); - printk(");\n"); - printk(" <Interrupt>\n"); - printk(" lock("); - __print_lock_name(class); - printk(");\n"); - printk("\n *** DEADLOCK ***\n\n"); + char buf[256]; + + trace_printk(" Possible unsafe locking scenario:\n\n"); + trace_printk(" CPU0\n"); + trace_printk(" ----\n"); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), class); + strcat(buf, ");\n"); + trace_printk(buf); + trace_printk(" <Interrupt>\n"); + strcpy(buf, " lock("); + __sprint_lock_name(buf + strlen(buf), class); + strcat(buf, ");\n"); + trace_printk(buf); + trace_printk("\n *** DEADLOCK ***\n\n"); } static int print_usage_bug(struct task_struct *curr, struct held_lock *this, enum lock_usage_bit prev_bit, enum lock_usage_bit new_bit) { + char buf[256]; + if (!debug_locks_off_graph_unlock() || debug_locks_silent) return 0; - printk("\n"); - printk("=================================\n"); - printk("[ INFO: inconsistent lock state ]\n"); + trace_printk("\n"); + trace_printk("=================================\n"); + trace_printk("[ INFO: inconsistent lock state ]\n"); print_kernel_ident(); - printk("---------------------------------\n"); + trace_printk("---------------------------------\n"); - printk("inconsistent {%s} -> {%s} usage.\n", + trace_printk("inconsistent {%s} -> {%s} usage.\n", usage_str[prev_bit], usage_str[new_bit]); - printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n", + trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n", curr->comm, task_pid_nr(curr), trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT, trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT, trace_hardirqs_enabled(curr), trace_softirqs_enabled(curr)); - print_lock(this); + sprint_lock(buf, this); + trace_printk(buf); - printk("{%s} state was registered at:\n", usage_str[prev_bit]); - print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1); + trace_printk("{%s} state was registered at:\n", usage_str[prev_bit]); + trace_print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1); print_irqtrace_events(curr); - printk("\nother info that might help us debug this:\n"); + trace_printk("\nother info that might help us debug this:\n"); print_usage_bug_scenario(this); lockdep_print_held_locks(curr); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); return 0; } @@ -2285,32 +2355,36 @@ print_irq_inversion_bug(struct task_struct *curr, struct lock_list *entry = other; struct lock_list *middle = NULL; int depth; + char buf[256]; if (!debug_locks_off_graph_unlock() || debug_locks_silent) return 0; - printk("\n"); - printk("=========================================================\n"); - printk("[ INFO: possible irq lock inversion dependency detected ]\n"); + trace_printk("\n"); + trace_printk("=========================================================\n"); + trace_printk("[ INFO: possible irq lock inversion dependency detected ]\n"); print_kernel_ident(); - printk("---------------------------------------------------------\n"); - printk("%s/%d just changed the state of lock:\n", + trace_printk("---------------------------------------------------------\n"); + trace_printk("%s/%d just changed the state of lock:\n", curr->comm, task_pid_nr(curr)); - print_lock(this); + sprint_lock(buf, this); + trace_printk(buf); if (forwards) - printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass); + trace_printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass); else - printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass); - print_lock_name(other->class); - printk("\n\nand interrupts could create inverse lock ordering between them.\n\n"); + trace_printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass); + sprint_lock_name(buf, other->class); + strcat(buf, "\n"); + trace_printk(buf); + trace_printk("\nand interrupts could create inverse lock ordering between them.\n\n"); - printk("\nother info that might help us debug this:\n"); + trace_printk("\nother info that might help us debug this:\n"); /* Find a middle lock (if one exists) */ depth = get_lock_depth(other); do { if (depth == 0 && (entry != root)) { - printk("lockdep:%s bad path found in chain graph\n", __func__); + trace_printk("lockdep:%s bad path found in chain graph\n", __func__); break; } middle = entry; @@ -2326,13 +2400,13 @@ print_irq_inversion_bug(struct task_struct *curr, lockdep_print_held_locks(curr); - printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n"); + trace_printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n"); if (!save_trace(&root->trace)) return 0; print_shortest_lock_dependencies(other, root); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); return 0; } @@ -2387,15 +2461,21 @@ check_usage_backwards(struct task_struct *curr, struct held_lock *this, void print_irqtrace_events(struct task_struct *curr) { - printk("irq event stamp: %u\n", curr->irq_events); - printk("hardirqs last enabled at (%u): ", curr->hardirq_enable_event); - print_ip_sym(curr->hardirq_enable_ip); - printk("hardirqs last disabled at (%u): ", curr->hardirq_disable_event); - print_ip_sym(curr->hardirq_disable_ip); - printk("softirqs last enabled at (%u): ", curr->softirq_enable_event); - print_ip_sym(curr->softirq_enable_ip); - printk("softirqs last disabled at (%u): ", curr->softirq_disable_event); - print_ip_sym(curr->softirq_disable_ip); + char buf[256]; + + trace_printk("irq event stamp: %u\n", curr->irq_events); + sprintf(buf, "hardirqs last enabled at (%u): ", curr->hardirq_enable_event); + sprint_ip_sym(buf + strlen(buf), curr->hardirq_enable_ip); + trace_printk(buf); + sprintf(buf, "hardirqs last disabled at (%u): ", curr->hardirq_disable_event); + sprint_ip_sym(buf + strlen(buf), curr->hardirq_disable_ip); + trace_printk(buf); + sprintf(buf, "softirqs last enabled at (%u): ", curr->softirq_enable_event); + sprint_ip_sym(buf + strlen(buf), curr->softirq_enable_ip); + trace_printk(buf); + sprintf(buf, "softirqs last disabled at (%u): ", curr->softirq_disable_event); + sprint_ip_sym(buf + strlen(buf), curr->softirq_disable_ip); + trace_printk(buf); } static int HARDIRQ_verbose(struct lock_class *class) @@ -2937,10 +3017,13 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, * We must printk outside of the graph_lock: */ if (ret == 2) { - printk("\nmarked lock as {%s}:\n", usage_str[new_bit]); - print_lock(this); + char buf[256]; + + trace_printk("\nmarked lock as {%s}:\n", usage_str[new_bit]); + sprint_lock(buf, this); + trace_printk(buf); print_irqtrace_events(curr); - dump_stack(); + trace_dump_stack(0); } return ret; @@ -2982,7 +3065,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name, * Sanity check, the lock-class key must be persistent: */ if (!static_obj(key)) { - printk("BUG: key %p not in .data!\n", key); + trace_printk("BUG: key %p not in .data!\n", key); /* * What it says above ^^^^^, I suggest you read it. */ @@ -3007,31 +3090,34 @@ print_lock_nested_lock_not_held(struct task_struct *curr, struct held_lock *hlock, unsigned long ip) { + char buf[256]; + if (!debug_locks_off()) return 0; if (debug_locks_silent) return 0; - printk("\n"); - printk("==================================\n"); - printk("[ BUG: Nested lock was not taken ]\n"); + trace_printk("\n"); + trace_printk("==================================\n"); + trace_printk("[ BUG: Nested lock was not taken ]\n"); print_kernel_ident(); - printk("----------------------------------\n"); + trace_printk("----------------------------------\n"); - printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr)); - print_lock(hlock); + trace_printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr)); + sprint_lock(buf, hlock); + trace_printk(buf); - printk("\nbut this task is not holding:\n"); - printk("%s\n", hlock->nest_lock->name); + trace_printk("\nbut this task is not holding:\n"); + trace_printk("%s\n", hlock->nest_lock->name); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); - printk("\nother info that might help us debug this:\n"); + trace_printk("\nother info that might help us debug this:\n"); lockdep_print_held_locks(curr); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); return 0; } @@ -3081,11 +3167,11 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, } atomic_inc((atomic_t *)&class->ops); if (very_verbose(class)) { - printk("\nacquire class [%p] %s", class->key, class->name); + trace_printk("\nacquire class [%p] %s", class->key, class->name); if (class->name_version > 1) - printk("#%d", class->name_version); - printk("\n"); - dump_stack(); + trace_printk("#%d", class->name_version); + trace_printk("\n"); + trace_dump_stack(0); } /* @@ -3192,12 +3278,12 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, if (unlikely(curr->lockdep_depth >= MAX_LOCK_DEPTH)) { debug_locks_off(); print_lockdep_off("BUG: MAX_LOCK_DEPTH too low!"); - printk(KERN_DEBUG "depth: %i max: %lu!\n", + trace_printk("depth: %i max: %lu!\n", curr->lockdep_depth, MAX_LOCK_DEPTH); lockdep_print_held_locks(current); debug_show_all_locks(); - dump_stack(); + trace_dump_stack(0); return 0; } @@ -3212,27 +3298,31 @@ static int print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock, unsigned long ip) { + char buf[256]; + if (!debug_locks_off()) return 0; if (debug_locks_silent) return 0; - printk("\n"); - printk("=====================================\n"); - printk("[ BUG: bad unlock balance detected! ]\n"); + trace_printk("\n"); + trace_printk("=====================================\n"); + trace_printk("[ BUG: bad unlock balance detected! ]\n"); print_kernel_ident(); - printk("-------------------------------------\n"); - printk("%s/%d is trying to release lock (", + trace_printk("-------------------------------------\n"); + sprintf(buf, "%s/%d is trying to release lock (", curr->comm, task_pid_nr(curr)); - print_lockdep_cache(lock); - printk(") at:\n"); - print_ip_sym(ip); - printk("but there are no more locks to release!\n"); - printk("\nother info that might help us debug this:\n"); + sprint_lockdep_cache(buf + strlen(buf), lock); + strcat(buf, ") at:\n"); + trace_printk(buf); + sprint_ip_sym(buf, ip); + trace_printk(buf); + trace_printk("but there are no more locks to release!\n"); + trace_printk("\nother info that might help us debug this:\n"); lockdep_print_held_locks(curr); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); return 0; } @@ -3532,11 +3622,11 @@ static void check_flags(unsigned long flags) if (irqs_disabled_flags(flags)) { if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) { - printk("possible reason: unannotated irqs-off.\n"); + trace_printk("possible reason: unannotated irqs-off.\n"); } } else { if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) { - printk("possible reason: unannotated irqs-on.\n"); + trace_printk("possible reason: unannotated irqs-on.\n"); } } @@ -3657,27 +3747,31 @@ static int print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock, unsigned long ip) { + char buf[256]; + if (!debug_locks_off()) return 0; if (debug_locks_silent) return 0; - printk("\n"); - printk("=================================\n"); - printk("[ BUG: bad contention detected! ]\n"); + trace_printk("\n"); + trace_printk("=================================\n"); + trace_printk("[ BUG: bad contention detected! ]\n"); print_kernel_ident(); - printk("---------------------------------\n"); - printk("%s/%d is trying to contend lock (", + trace_printk("---------------------------------\n"); + sprintf(buf, "%s/%d is trying to contend lock (", curr->comm, task_pid_nr(curr)); - print_lockdep_cache(lock); - printk(") at:\n"); - print_ip_sym(ip); - printk("but there are no locks held!\n"); - printk("\nother info that might help us debug this:\n"); + sprint_lockdep_cache(buf + strlen(buf), lock); + strcat(buf, ") at:\n"); + trace_printk(buf); + sprint_ip_sym(buf, ip); + trace_printk(buf); + trace_printk("but there are no locks held!\n"); + trace_printk("\nother info that might help us debug this:\n"); lockdep_print_held_locks(curr); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); return 0; } @@ -4033,23 +4127,26 @@ static void print_freed_lock_bug(struct task_struct *curr, const void *mem_from, const void *mem_to, struct held_lock *hlock) { + char buf[256]; + if (!debug_locks_off()) return; if (debug_locks_silent) return; - printk("\n"); - printk("=========================\n"); - printk("[ BUG: held lock freed! ]\n"); + trace_printk("\n"); + trace_printk("=========================\n"); + trace_printk("[ BUG: held lock freed! ]\n"); print_kernel_ident(); - printk("-------------------------\n"); - printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n", + trace_printk("-------------------------\n"); + trace_printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n", curr->comm, task_pid_nr(curr), mem_from, mem_to-1); - print_lock(hlock); + sprint_lock(buf, hlock); + trace_printk(buf); lockdep_print_held_locks(curr); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); } static inline int not_in_range(const void* mem_from, unsigned long mem_len, @@ -4096,15 +4193,15 @@ static void print_held_locks_bug(void) if (debug_locks_silent) return; - printk("\n"); - printk("=====================================\n"); - printk("[ BUG: %s/%d still has locks held! ]\n", + trace_printk("\n"); + trace_printk("=====================================\n"); + trace_printk("[ BUG: %s/%d still has locks held! ]\n", current->comm, task_pid_nr(current)); print_kernel_ident(); - printk("-------------------------------------\n"); + trace_printk("-------------------------------------\n"); lockdep_print_held_locks(current); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); } void debug_check_no_locks_held(void) @@ -4195,12 +4292,12 @@ asmlinkage __visible void lockdep_sys_exit(void) if (unlikely(curr->lockdep_depth)) { if (!debug_locks_off()) return; - printk("\n"); - printk("================================================\n"); - printk("[ BUG: lock held when returning to user space! ]\n"); + trace_printk("\n"); + trace_printk("================================================\n"); + trace_printk("[ BUG: lock held when returning to user space! ]\n"); print_kernel_ident(); - printk("------------------------------------------------\n"); - printk("%s/%d is leaving the kernel with locks still held!\n", + trace_printk("------------------------------------------------\n"); + trace_printk("%s/%d is leaving the kernel with locks still held!\n", curr->comm, curr->pid); lockdep_print_held_locks(curr); } @@ -4215,14 +4312,14 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s) return; #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */ /* Note: the following can be executed concurrently, so be careful. */ - printk("\n"); - printk("===============================\n"); - printk("[ INFO: suspicious RCU usage. ]\n"); + trace_printk("\n"); + trace_printk("===============================\n"); + trace_printk("[ INFO: suspicious RCU usage. ]\n"); print_kernel_ident(); - printk("-------------------------------\n"); - printk("%s:%d %s!\n", file, line, s); - printk("\nother info that might help us debug this:\n\n"); - printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n", + trace_printk("-------------------------------\n"); + trace_printk("%s:%d %s!\n", file, line, s); + trace_printk("\nother info that might help us debug this:\n\n"); + trace_printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n", !rcu_lockdep_current_cpu_online() ? "RCU used illegally from offline CPU!\n" : !rcu_is_watching() @@ -4249,10 +4346,10 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s) * rcu_read_lock_bh() and so on from extended quiescent states. */ if (!rcu_is_watching()) - printk("RCU used illegally from extended quiescent state!\n"); + trace_printk("RCU used illegally from extended quiescent state!\n"); lockdep_print_held_locks(curr); - printk("\nstack backtrace:\n"); - dump_stack(); + trace_printk("\nstack backtrace:\n"); + trace_dump_stack(0); } EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] mm: Fixup pagecache_isize_extended() definitions for !CONFIG_MMU
For !CONFIG_MMU systems we defined pagecache_isize_extended() in both include/linux/mm.h and mm/truncate.c which causes compilation error. Although pagecache_isize_extended() doesn't do anything useful for !CONFIG_MMU systems, it could do something in future and it's overhead isn't huge. So don't try to be too smart and remove !CONFIG_MMU definition of pagecache_isize_extended(). Signed-off-by: Jan Kara <jack at suse.cz> --- include/linux/mm.h | 7 ------- 1 file changed, 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f0e53e5a3b17..5005464fe012 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1155,14 +1155,7 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping, extern void truncate_pagecache(struct inode *inode, loff_t new); extern void truncate_setsize(struct inode *inode, loff_t newsize); -#ifdef CONFIG_MMU void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to); -#else -static inline void pagecache_isize_extended(struct inode *inode, loff_t from, - loff_t to) -{ -} -#endif void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); int generic_error_remove_page(struct address_space *mapping, struct page *page); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] ncpfs: fix rmdir returns Device or resource busy
From: Dave Chiluk <chiluk at canonical.com> 1d2ef5901483004d74947bbf78d5146c24038fe7 caused a regression in ncpfs such that directories could no longer be removed. This was because ncp_rmdir checked to see if a dentry could be unhashed before allowing it to be removed. Since 1d2ef5901483004d74947bbf78d5146c24038fe7 introduced a change that incremented dentry->d_count causing it to always be greater than 1 unhash would always fail. Thus causing the error path in ncp_rmdir to always be taken. Removing this error path is safe as unhashing is still accomplished by calls to dput from vfs_rmdir. Signed-off-by: Dave Chiluk <chiluk at canonical.com> Signed-off-by: Petr Vandrovec <petr at vandrovec.name> Signed-off-by: Al Viro <viro at zeniv.linux.org.uk> Acked-by: Jan Kara <jack at suse.cz> --- fs/ncpfs/dir.c | 9 --------- 1 file changed, 9 deletions(-) diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c index 816326093656..6792ce11f2bf 100644 --- a/fs/ncpfs/dir.c +++ b/fs/ncpfs/dir.c @@ -1029,15 +1029,6 @@ static int ncp_rmdir(struct inode *dir, struct dentry *dentry) DPRINTK("ncp_rmdir: removing %s/%s\n", dentry->d_parent->d_name.name, dentry->d_name.name); - /* - * fail with EBUSY if there are still references to this - * directory. - */ - dentry_unhash(dentry); - error = -EBUSY; - if (!d_unhashed(dentry)) - goto out; - len = sizeof(__name); error = ncp_io2vol(server, __name, &len, dentry->d_name.name, dentry->d_name.len, !ncp_preserve_case(dir)); -- 1.8.1.4
Global quota files are accessed from different nodes. Thus we cannot cache offset of quota structure in the quota file after we drop our node reference count to it because after that moment quota structure may be freed and reallocated elsewhere by a different node resulting in corruption of quota file. Fix the problem by clearing dq_off when we are releasing dquot structure. We also remove the DB_READ_B handling because it is useless - DQ_ACTIVE_B is set iff DQ_READ_B is set. CC: stable at vger.kernel.org CC: Goldwyn Rodrigues <rgoldwyn at suse.de> CC: Mark Fasheh <mfasheh at suse.de> Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ocfs2/quota_global.c | 27 +++++++++++++++++---------- fs/ocfs2/quota_local.c | 4 ---- 2 files changed, 17 insertions(+), 14 deletions(-) diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c index aaa50611ec66..d7b5108789e2 100644 --- a/fs/ocfs2/quota_global.c +++ b/fs/ocfs2/quota_global.c @@ -717,6 +717,12 @@ static int ocfs2_release_dquot(struct dquot *dquot) */ if (status < 0) mlog_errno(status); + /* + * Clear dq_off so that we search for the structure in quota file next + * time we acquire it. The structure might be deleted and reallocated + * elsewhere by another node while our dquot structure is on freelist. + */ + dquot->dq_off = 0; clear_bit(DQ_ACTIVE_B, &dquot->dq_flags); out_trans: ocfs2_commit_trans(osb, handle); @@ -756,16 +762,17 @@ static int ocfs2_acquire_dquot(struct dquot *dquot) status = ocfs2_lock_global_qf(info, 1); if (status < 0) goto out; - if (!test_bit(DQ_READ_B, &dquot->dq_flags)) { - status = ocfs2_qinfo_lock(info, 0); - if (status < 0) - goto out_dq; - status = qtree_read_dquot(&info->dqi_gi, dquot); - ocfs2_qinfo_unlock(info, 0); - if (status < 0) - goto out_dq; - } - set_bit(DQ_READ_B, &dquot->dq_flags); + status = ocfs2_qinfo_lock(info, 0); + if (status < 0) + goto out_dq; + /* + * We always want to read dquot structure from disk because we don't + * know what happened with it while it was on freelist. + */ + status = qtree_read_dquot(&info->dqi_gi, dquot); + ocfs2_qinfo_unlock(info, 0); + if (status < 0) + goto out_dq; OCFS2_DQUOT(dquot)->dq_use_count++; OCFS2_DQUOT(dquot)->dq_origspace = dquot->dq_dqb.dqb_curspace; diff --git a/fs/ocfs2/quota_local.c b/fs/ocfs2/quota_local.c index 2e4344be3b96..2001862bf2b1 100644 --- a/fs/ocfs2/quota_local.c +++ b/fs/ocfs2/quota_local.c @@ -1303,10 +1303,6 @@ int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot) ocfs2_journal_dirty(handle, od->dq_chunk->qc_headerbh); out: - /* Clear the read bit so that next time someone uses this - * dquot he reads fresh info from disk and allocates local - * dquot structure */ - clear_bit(DQ_READ_B, &dquot->dq_flags); return status; } -- 1.8.1.4
Signed-off-by: Jan Kara <jack at suse.cz> --- kernel/printk/printk.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index ea2d5f6962ed..a39f4129f848 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1718,7 +1718,6 @@ asmlinkage int vprintk_emit(int facility, int level, logbuf_cpu = UINT_MAX; raw_spin_unlock(&logbuf_lock); - lockdep_on(); local_irq_restore(flags); /* If called from the scheduler, we can not call up(). */ @@ -1738,6 +1737,9 @@ asmlinkage int vprintk_emit(int facility, int level, if (console_trylock_for_printk()) console_unlock(); preempt_enable(); + local_irq_save(flags); + lockdep_on(); + local_irq_restore(flags); return printed_len; } -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] printk: debug: Slow down printing to 9600 bauds
Signed-off-by: Jan Kara <jack at suse.cz> --- include/trace/events/printk.h | 42 ++++++++++++++++ kernel/printk/printk.c | 112 ++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 151 insertions(+), 3 deletions(-) diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h index c008bc99f9fa..7ba97681960c 100644 --- a/include/trace/events/printk.h +++ b/include/trace/events/printk.h @@ -22,6 +22,48 @@ TRACE_EVENT(console, TP_printk("%s", __get_str(msg)) ); + +DECLARE_EVENT_CLASS(printk_class, + TP_PROTO(int u), + TP_ARGS(u), + TP_STRUCT__entry( + __field( int, u) + ), + TP_fast_assign( + __entry->u = u; + ), + TP_printk("arg=%d", __entry->u) +); + +DEFINE_EVENT(printk_class, printk_hand_over, + TP_PROTO(int u), + TP_ARGS(u) +); + +DEFINE_EVENT(printk_class, printk_ask_help, + TP_PROTO(int u), + TP_ARGS(u) +); + +DEFINE_EVENT(printk_class, printk_thread_sleep, + TP_PROTO(int u), + TP_ARGS(u) +); + +DEFINE_EVENT(printk_class, printk_thread_woken, + TP_PROTO(int u), + TP_ARGS(u) +); + +DEFINE_EVENT(printk_class, printk_thread_locked, + TP_PROTO(int u), + TP_ARGS(u) +); + +DEFINE_EVENT(printk_class, printk_printing, + TP_PROTO(int u), + TP_ARGS(u) +); #endif /* _TRACE_PRINTK_H */ /* This part must be outside protection */ diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index a06ba16ba0d4..4e64abc45159 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -23,6 +23,7 @@ #include <linux/console.h> #include <linux/init.h> #include <linux/jiffies.h> +#include <linux/stacktrace.h> #include <linux/nmi.h> #include <linux/module.h> #include <linux/moduleparam.h> @@ -90,6 +91,8 @@ EXPORT_SYMBOL_GPL(console_drivers); /* This gets set if the currently printing task wants to hand over printing */ static int printk_handover; +static int thread_woken; + /* * Number of kernel threads for offloading printing. We need at least two so * that they can hand over printing from one to another one and thus switch @@ -1308,6 +1311,22 @@ static void call_console_drivers(int level, const char *text, size_t len) } } +static void printk_echo(char *fmt, ...) +{ + unsigned long flags; + va_list args; + char buf[128]; + int len; + + len = sprintf(buf, "P%d ", task_pid_nr(current)); + va_start(args, fmt); + len += vsprintf(buf + len, fmt, args); + va_end(args); + local_irq_save(flags); + call_console_drivers(0, buf, len); + local_irq_restore(flags); +} + /* * Zap console related locks when oopsing. Only zap at most once * every 10 seconds, to leave time for slow consoles to print a @@ -1512,6 +1531,7 @@ asmlinkage int vprintk_emit(int facility, int level, bool in_sched = false; /* cpu currently holding logbuf_lock in this function */ static volatile unsigned int logbuf_cpu = UINT_MAX; + bool irq_off = irqs_disabled(); if (level == SCHED_MESSAGE_LOGLEVEL) { level = -1; @@ -1566,6 +1586,8 @@ asmlinkage int vprintk_emit(int facility, int level, if (in_sched) text_len = scnprintf(text, sizeof(textbuf), KERN_WARNING "[sched_delayed] "); + if (irq_off) + text[text_len++] = 'X'; text_len += vscnprintf(text + text_len, sizeof(textbuf) - text_len, fmt, args); @@ -2030,6 +2052,31 @@ out: raw_spin_unlock_irqrestore(&logbuf_lock, flags); } +static struct task_struct *resched_task = NULL; +static unsigned long last_traced; + +static void echo_trace(struct task_struct *task) +{ + struct stack_trace trace; + unsigned long entries[16]; + int i; + + trace.nr_entries = 0; + trace.max_entries = 16; + trace.entries = entries; + trace.skip = 0; + + save_stack_trace_tsk(task, &trace); + for (i = 0; i < trace.nr_entries; i++) + printk_echo("%pS\n", (void *)entries[i]); + last_traced = jiffies; +} + +static void ipi_stacktrace(void *info) +{ + echo_trace(current); +} + /* * Returns true iff there is other cpu waiting to take over printing. This * function also takes are of setting printk_handover if we want to hand over @@ -2043,16 +2090,33 @@ static bool cpu_stop_printing(int printed_chars) if (!printk_offload_chars || printed_chars < printk_offload_chars) return false; /* Someone is sleeping on console_sem? Give away to him. */ - if (sema_contended(&console_sem)) + if (sema_contended(&console_sem)) { + printk_echo("Handing over printing\n"); + trace_printk_hand_over(0); return true; + } if (!printk_handover) { + unsigned long flags; + printk_handover = 1; + spin_lock_irqsave(&print_queue.lock, flags); + if (!list_empty(&print_queue.task_list)) + resched_task = list_entry(print_queue.task_list.next, wait_queue_t, task_list)->private; + spin_unlock_irqrestore(&print_queue.lock, flags); + printk_echo("Asking for help %p (%d)\n", resched_task, (resched_task ? task_pid_nr(resched_task) : 0)); + trace_printk_ask_help(0); /* * Paired with barrier in prepare_to_wait_exclusive() in * printing_task() */ smp_mb(); wake_up(&print_queue); + if (resched_task) { + printk_echo("Task state %ld, cpu %u, cur cpu %u\n", + resched_task->state, task_cpu(resched_task), + task_cpu(current)); + on_each_cpu(ipi_stacktrace, NULL, 1); + } } return false; } @@ -2088,6 +2152,9 @@ void console_unlock(void) return; } + if (oops_in_progress) + printk_echo("Oops!\n"); + trace_printk_printing(0); console_may_schedule = 0; /* flush buffered message fragment immediately to console */ @@ -2148,9 +2215,24 @@ skip: raw_spin_unlock(&logbuf_lock); stop_critical_timings(); /* don't trace print latency */ + if (printk_handover) { + if (sema_contended(&console_sem)) + printk_echo("B "); + else if (thread_woken) + printk_echo("A "); + else if (resched_task) { + printk_echo("X%ld ", resched_task->state); + if (time_is_before_jiffies(last_traced + HZ)) { + smp_call_function_single( + task_cpu(resched_task), + ipi_stacktrace, NULL, 1); + } + } + } call_console_drivers(level, text, len); start_critical_timings(); printed_chars += len; + mdelay(len); local_irq_restore(flags); } @@ -2159,6 +2241,7 @@ skip: exclusive_console = NULL; printk_handover = 0; + resched_task = NULL; console_locked = 0; mutex_release(&console_lock_dep_map, 1, _RET_IP_); up(&console_sem); @@ -2499,23 +2582,46 @@ static int printing_task(void *arg) DEFINE_WAIT(wait); while (1) { - prepare_to_wait_exclusive(&print_queue, &wait, + prepare_to_wait(&print_queue, &wait, TASK_INTERRUPTIBLE); - if (!printk_handover) + if (!printk_handover) { + trace_printk_thread_sleep(0); schedule(); + } finish_wait(&print_queue, &wait); + trace_printk_thread_woken(0); + thread_woken = 1; console_lock(); + thread_woken = 0; + trace_printk_thread_locked(0); console_unlock(); } return 0; } +static void do_print(struct work_struct *work) +{ + char buf[75]; + int i; + + sprintf(buf, "%p: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n", work); + for (i = 0; i < 15; i++) + printk(buf); +} + +static struct delayed_work print_work[100]; + static int __init printk_late_init(void) { struct console *con; int i; struct task_struct *task; + for (i = 0; i < 100; i++) { + INIT_DELAYED_WORK(&print_work[i], do_print); + schedule_delayed_work(&print_work[i], HZ * 180); + } + for_each_console(con) { if (!keep_bootcon && con->flags & CON_BOOT) { printk(KERN_INFO "turn off boot console %s%d\n", -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] printk: enable interrupts before calling console_trylock_for_printk()
We need interrupts disabled when calling console_trylock_for_printk() only so that cpu id we pass to can_use_console() remains valid (for other things console_sem provides all the exclusion we need and deadlocks on console_sem due to interrupts are impossible because we use down_trylock()). However if we are rescheduled, we are guaranteed to run on an online cpu so we can easily just get the cpu id in can_use_console(). We can lose a bit of performance when we enable interrupts in vprintk_emit() and then disable them again in console_unlock() but OTOH it can somewhat reduce interrupt latency caused by console_unlock(). We differ from (reverted) commit 939f04bec1a4 in that we avoid calling console_unlock() from vprintk_emit() with lockdep enabled as that has unveiled quite some bugs leading to system freezes during boot (e.g. https://lkml.org/lkml/2014/5/30/242, https://lkml.org/lkml/2014/6/28/521). Tested-by: Andreas Bombe <aeb at debian.org> Signed-off-by: Jan Kara <jack at suse.cz> --- kernel/printk/printk.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) Andrew, can you please queue this patch? Thanks. diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 13e839dbca07..fe4154d83fe4 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1416,10 +1416,9 @@ static int have_callable_console(void) /* * Can we actually use the console at this time on this cpu? * - * Console drivers may assume that per-cpu resources have - * been allocated. So unless they're explicitly marked as - * being able to cope (CON_ANYTIME) don't call them until - * this CPU is officially up. + * Console drivers may assume that per-cpu resources have been allocated. So + * unless they're explicitly marked as being able to cope (CON_ANYTIME) don't + * call them until this CPU is officially up. */ static inline int can_use_console(unsigned int cpu) { @@ -1432,8 +1431,10 @@ static inline int can_use_console(unsigned int cpu) * console_lock held, and 'console_locked' set) if it * is successful, false otherwise. */ -static int console_trylock_for_printk(unsigned int cpu) +static int console_trylock_for_printk(void) { + unsigned int cpu = smp_processor_id(); + if (!console_trylock()) return 0; /* @@ -1608,7 +1609,8 @@ asmlinkage int vprintk_emit(int facility, int level, */ if (!oops_in_progress && !lockdep_recursing(current)) { recursion_bug = 1; - goto out_restore_irqs; + local_irq_restore(flags); + return 0; } zap_locks(); } @@ -1716,21 +1718,30 @@ asmlinkage int vprintk_emit(int facility, int level, logbuf_cpu = UINT_MAX; raw_spin_unlock(&logbuf_lock); + lockdep_on(); + local_irq_restore(flags); /* If called from the scheduler, we can not call up(). */ if (!in_sched) { + lockdep_off(); + /* + * Disable preemption to avoid being preempted while holding + * console_sem which would prevent anyone from printing to + * console + */ + preempt_disable(); + /* * Try to acquire and then immediately release the console * semaphore. The release will print out buffers and wake up * /dev/kmsg and syslog() users. */ - if (console_trylock_for_printk(this_cpu)) + if (console_trylock_for_printk()) console_unlock(); + preempt_enable(); + lockdep_on(); } - lockdep_on(); -out_restore_irqs: - local_irq_restore(flags); return printed_len; } EXPORT_SYMBOL(vprintk_emit); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] quota: Fix race between dqput() and dquot_scan_active()
Currently last dqput() can race with dquot_scan_active() causing it to call callback for an already deactivated dquot. The race is as follows: CPU1 CPU2 dqput() spin_lock(&dq_list_lock); if (atomic_read(&dquot->dq_count) > 1) { - not taken if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) { spin_unlock(&dq_list_lock); ->release_dquot(dquot); if (atomic_read(&dquot->dq_count) > 1) - not taken dquot_scan_active() spin_lock(&dq_list_lock); if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) - not taken atomic_inc(&dquot->dq_count); spin_unlock(&dq_list_lock); - proceeds to release dquot ret = fn(dquot, priv); - called for inactive dquot Fix the problem by making sure possible ->release_dquot() is finished by the time we call the callback and new calls to it will notice reference dquot_scan_active() has taken and bail out. Signed-off-by: Jan Kara <jack at suse.cz> --- fs/quota/dquot.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) This is the last patch needed to make ocfs2 quotas rock solid in my testing. I will carry it in my tree and push it to Linus soon. diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c index 831d49a4111f..cfc8dcc16043 100644 --- a/fs/quota/dquot.c +++ b/fs/quota/dquot.c @@ -581,9 +581,17 @@ int dquot_scan_active(struct super_block *sb, dqstats_inc(DQST_LOOKUPS); dqput(old_dquot); old_dquot = dquot; - ret = fn(dquot, priv); - if (ret < 0) - goto out; + /* + * ->release_dquot() can be racing with us. Our reference + * protects us from new calls to it so just wait for any + * outstanding call and recheck the DQ_ACTIVE_B after that. + */ + wait_on_dquot(dquot); + if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) { + ret = fn(dquot, priv); + if (ret < 0) + goto out; + } spin_lock(&dq_list_lock); /* We are safe to continue now because our dquot could not * be moved out of the inuse list while we hold the reference */ -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] scsi: Keep interrupts disabled while submitting requests
scsi_request_fn() can be called from softirq context during IO completion. If it enables interrupts there, HW interrupts can interrupt softirq processing and queue more IO completion work which can eventually lead to softlockup reports because IO completion softirq runs for too long. Keep interrupts disabled in scsi_request_fn(). Signed-off-by: Jan Kara <jack at suse.cz> --- drivers/scsi/scsi_lib.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f7e316368c99..44b867e9adc9 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1481,7 +1481,8 @@ static void scsi_softirq_done(struct request *rq) * * Returns: Nothing * - * Lock status: IO request lock assumed to be held when called. + * Lock status: IO request lock assumed to be held when called, interrupts + * must be disabled. */ static void scsi_request_fn(struct request_queue *q) __releases(q->queue_lock) @@ -1563,7 +1564,7 @@ static void scsi_request_fn(struct request_queue *q) * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will * take the lock again. */ - spin_unlock_irq(shost->host_lock); + spin_unlock(shost->host_lock); /* * Finally, initialize any error handling parameters, and set up @@ -1575,7 +1576,7 @@ static void scsi_request_fn(struct request_queue *q) * Dispatch the command to the low-level driver. */ rtn = scsi_dispatch_cmd(cmd); - spin_lock_irq(q->queue_lock); + spin_lock(q->queue_lock); if (rtn) goto out_delay; } @@ -1583,7 +1584,7 @@ static void scsi_request_fn(struct request_queue *q) return; not_ready: - spin_unlock_irq(shost->host_lock); + spin_unlock(shost->host_lock); /* * lock q, handle tag, requeue req, and decrement device_busy. We @@ -1593,7 +1594,7 @@ static void scsi_request_fn(struct request_queue *q) * cases (host limits or settings) should run the queue at some * later time. */ - spin_lock_irq(q->queue_lock); + spin_lock(q->queue_lock); blk_requeue_request(q, req); sdev->device_busy--; out_delay: -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] sync: don't block the flusher thread waiting on IO
From: Dave Chinner <dchinner at redhat.com> When sync does it's WB_SYNC_ALL writeback, it issues data Io and then immediately waits for IO completion. This is done in the context of the flusher thread, and hence completely ties up the flusher thread for the backing device until all the dirty inodes have been synced. On filesystems that are dirtying inodes constantly and quickly, this means the flusher thread can be tied up for minutes per sync call and hence badly affect system level write IO performance as the page cache cannot be cleaned quickly. We already have a wait loop for IO completion for sync(2), so cut this out of the flusher thread and delegate it to wait_sb_inodes(). Hence we can do rapid IO submission, and then wait for it all to complete. Effect of sync on fsmark before the patch: FSUse% Count Size Files/sec App Overhead ..... 0 640000 4096 35154.6 1026984 0 720000 4096 36740.3 1023844 0 800000 4096 36184.6 916599 0 880000 4096 1282.7 1054367 0 960000 4096 3951.3 918773 0 1040000 4096 40646.2 996448 0 1120000 4096 43610.1 895647 0 1200000 4096 40333.1 921048 And a single sync pass took: real 0m52.407s user 0m0.000s sys 0m0.090s After the patch, there is no impact on fsmark results, and each individual sync(2) operation run concurrently with the same fsmark workload takes roughly 7s: real 0m6.930s user 0m0.000s sys 0m0.039s IOWs, sync is 7-8x faster on a busy filesystem and does not have an adverse impact on ongoing async data write operations. Signed-off-by: Dave Chinner <dchinner at redhat.com> Reviewed-by: Jan Kara <jack at suse.cz> Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org> --- fs/fs-writeback.c | 9 +++++++-- include/linux/writeback.h | 1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 3be57189efd5..a85ac4e33436 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -45,6 +45,7 @@ struct wb_writeback_work { unsigned int for_kupdate:1; unsigned int range_cyclic:1; unsigned int for_background:1; + unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ @@ -443,9 +444,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) /* * Make sure to wait on the data before writing out the metadata. * This is important for filesystems that modify metadata on data - * I/O completion. + * I/O completion. We don't do it for sync(2) writeback because it has a + * separate, external IO completion path and ->sync_fs for guaranteeing + * inode metadata is written back correctly. */ - if (wbc->sync_mode == WB_SYNC_ALL) { + if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) { int err = filemap_fdatawait(mapping); if (ret == 0) ret = err; @@ -578,6 +581,7 @@ static long writeback_sb_inodes(struct super_block *sb, .tagged_writepages = work->tagged_writepages, .for_kupdate = work->for_kupdate, .for_background = work->for_background, + .for_sync = work->for_sync, .range_cyclic = work->range_cyclic, .range_start = 0, .range_end = LLONG_MAX, @@ -1362,6 +1366,7 @@ void sync_inodes_sb(struct super_block *sb) .range_cyclic = 0, .done = &done, .reason = WB_REASON_SYNC, + .for_sync = 1, }; /* Nothing to do? */ diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 579a5007c696..abfe11787af3 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -78,6 +78,7 @@ struct writeback_control { unsigned tagged_writepages:1; /* tag-and-write to avoid livelock */ unsigned for_reclaim:1; /* Invoked from the page allocator */ unsigned range_cyclic:1; /* range_start is cyclic */ + unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ }; /* -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
clockevents_increase_min_delta() calls printk() from under hrtimer_bases.lock. That causes lock inversion on scheduler locks because printk() can call into the scheduler. Lockdep puts it as: =====================================================[ INFO: possible circular locking dependency detected ] 3.15.0-rc8-06195-g939f04b #2 Not tainted ------------------------------------------------------- trinity-main/74 is trying to acquire lock: (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c but task is already holding lock: (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (hrtimer_bases.lock){-.-...}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197 [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85 [<81080792>] task_clock_event_start+0x3a/0x3f [<810807a4>] task_clock_event_add+0xd/0x14 [<8108259a>] event_sched_in+0xb6/0x17a [<810826a2>] group_sched_in+0x44/0x122 [<81082885>] ctx_sched_in.isra.67+0x105/0x11f [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b [<81082bf6>] __perf_install_in_context+0x8b/0xa3 [<8107eb8e>] remote_function+0x12/0x2a [<8105f5af>] smp_call_function_single+0x2d/0x53 [<8107e17d>] task_function_call+0x30/0x36 [<8107fb82>] perf_install_in_context+0x87/0xbb [<810852c9>] SYSC_perf_event_open+0x5c6/0x701 [<810856f9>] SyS_perf_event_open+0x17/0x19 [<8142f8ee>] syscall_call+0x7/0xb -> #4 (&ctx->lock){......}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f04c>] _raw_spin_lock+0x21/0x30 [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f [<8142cacc>] __schedule+0x4c6/0x4cb [<8142cae0>] schedule+0xf/0x11 [<8142f9a6>] work_resched+0x5/0x30 -> #3 (&rq->lock){-.-.-.}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f04c>] _raw_spin_lock+0x21/0x30 [<81040873>] __task_rq_lock+0x33/0x3a [<8104184c>] wake_up_new_task+0x25/0xc2 [<8102474b>] do_fork+0x15c/0x2a0 [<810248a9>] kernel_thread+0x1a/0x1f [<814232a2>] rest_init+0x1a/0x10e [<817af949>] start_kernel+0x303/0x308 [<817af2ab>] i386_start_kernel+0x79/0x7d -> #2 (&p->pi_lock){-.-...}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<810413dd>] try_to_wake_up+0x1d/0xd6 [<810414cd>] default_wake_function+0xb/0xd [<810461f3>] __wake_up_common+0x39/0x59 [<81046346>] __wake_up+0x29/0x3b [<811b8733>] tty_wakeup+0x49/0x51 [<811c3568>] uart_write_wakeup+0x17/0x19 [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb [<811c5f28>] serial8250_handle_irq+0x54/0x6a [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c [<811c56d8>] serial8250_interrupt+0x38/0x9e [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2 [<81051296>] handle_irq_event+0x2c/0x43 [<81052cee>] handle_level_irq+0x57/0x80 [<81002a72>] handle_irq+0x46/0x5c [<810027df>] do_IRQ+0x32/0x89 [<8143036e>] common_interrupt+0x2e/0x33 [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49 [<811c25a4>] uart_start+0x2d/0x32 [<811c2c04>] uart_write+0xc7/0xd6 [<811bc6f6>] n_tty_write+0xb8/0x35e [<811b9beb>] tty_write+0x163/0x1e4 [<811b9cd9>] redirected_tty_write+0x6d/0x75 [<810b6ed6>] vfs_write+0x75/0xb0 [<810b7265>] SyS_write+0x44/0x77 [<8142f8ee>] syscall_call+0x7/0xb -> #1 (&tty->write_wait){-.....}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<81046332>] __wake_up+0x15/0x3b [<811b8733>] tty_wakeup+0x49/0x51 [<811c3568>] uart_write_wakeup+0x17/0x19 [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb [<811c5f28>] serial8250_handle_irq+0x54/0x6a [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c [<811c56d8>] serial8250_interrupt+0x38/0x9e [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2 [<81051296>] handle_irq_event+0x2c/0x43 [<81052cee>] handle_level_irq+0x57/0x80 [<81002a72>] handle_irq+0x46/0x5c [<810027df>] do_IRQ+0x32/0x89 [<8143036e>] common_interrupt+0x2e/0x33 [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49 [<811c25a4>] uart_start+0x2d/0x32 [<811c2c04>] uart_write+0xc7/0xd6 [<811bc6f6>] n_tty_write+0xb8/0x35e [<811b9beb>] tty_write+0x163/0x1e4 [<811b9cd9>] redirected_tty_write+0x6d/0x75 [<810b6ed6>] vfs_write+0x75/0xb0 [<810b7265>] SyS_write+0x44/0x77 [<8142f8ee>] syscall_call+0x7/0xb -> #0 (&port_lock_key){-.....}: [<8104a62d>] __lock_acquire+0x9ea/0xc6d [<8104a942>] lock_acquire+0x92/0x101 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<811c60be>] serial8250_console_write+0x8c/0x10c [<8104e402>] call_console_drivers.constprop.31+0x87/0x118 [<8104f5d5>] console_unlock+0x1d7/0x398 [<8104fb70>] vprintk_emit+0x3da/0x3e4 [<81425f76>] printk+0x17/0x19 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116 [<8105c548>] clockevents_program_event+0xe7/0xf3 [<8105cc1c>] tick_program_event+0x1e/0x23 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f [<8103c49e>] __remove_hrtimer+0x5b/0x79 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66 [<8103cb4b>] hrtimer_cancel+0xd/0x18 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30 [<81080705>] task_clock_event_stop+0x20/0x64 [<81080756>] task_clock_event_del+0xd/0xf [<81081350>] event_sched_out+0xab/0x11e [<810813e0>] group_sched_out+0x1d/0x66 [<81081682>] ctx_sched_out+0xaf/0xbf [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f [<8142cacc>] __schedule+0x4c6/0x4cb [<8142cae0>] schedule+0xf/0x11 [<8142f9a6>] work_resched+0x5/0x30 other info that might help us debug this: Chain exists of: &port_lock_key --> &ctx->lock --> hrtimer_bases.lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(hrtimer_bases.lock); lock(&ctx->lock); lock(hrtimer_bases.lock); lock(&port_lock_key); *** DEADLOCK *** 4 locks held by trinity-main/74: #0: (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb #1: (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f #2: (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66 #3: (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4 stack backtrace: CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003 Call Trace: [<81426f69>] dump_stack+0x16/0x18 [<81425a99>] print_circular_bug+0x18f/0x19c [<8104a62d>] __lock_acquire+0x9ea/0xc6d [<8104a942>] lock_acquire+0x92/0x101 [<811c60be>] ? serial8250_console_write+0x8c/0x10c [<811c6032>] ? wait_for_xmitr+0x76/0x76 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<811c60be>] ? serial8250_console_write+0x8c/0x10c [<811c60be>] serial8250_console_write+0x8c/0x10c [<8104af87>] ? lock_release+0x191/0x223 [<811c6032>] ? wait_for_xmitr+0x76/0x76 [<8104e402>] call_console_drivers.constprop.31+0x87/0x118 [<8104f5d5>] console_unlock+0x1d7/0x398 [<8104fb70>] vprintk_emit+0x3da/0x3e4 [<81425f76>] printk+0x17/0x19 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116 [<8105cc1c>] tick_program_event+0x1e/0x23 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f [<8103c49e>] __remove_hrtimer+0x5b/0x79 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66 [<8103cb4b>] hrtimer_cancel+0xd/0x18 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30 [<81080705>] task_clock_event_stop+0x20/0x64 [<81080756>] task_clock_event_del+0xd/0xf [<81081350>] event_sched_out+0xab/0x11e [<810813e0>] group_sched_out+0x1d/0x66 [<81081682>] ctx_sched_out+0xaf/0xbf [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f [<8104416d>] ? __dequeue_entity+0x23/0x27 [<81044505>] ? pick_next_task_fair+0xb1/0x120 [<8142cacc>] __schedule+0x4c6/0x4cb [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108 [<810475b0>] ? trace_hardirqs_off+0xb/0xd [<81056346>] ? rcu_irq_exit+0x64/0x77 Fix the problem by using printk_deferred() which does not call into the scheduler. Reported-by: Fengguang Wu <fengguang.wu at intel.com> Signed-off-by: Jan Kara <jack at suse.cz> --- kernel/time/clockevents.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index ad362c260ef4..9c94c19f1305 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -146,7 +146,8 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev) { /* Nothing to do if we already reached the limit */ if (dev->min_delta_ns >= MIN_DELTA_LIMIT) { - printk(KERN_WARNING "CE: Reprogramming failure. Giving up\n"); + printk_deferred(KERN_WARNING + "CE: Reprogramming failure. Giving up\n"); dev->next_event.tv64 = KTIME_MAX; return -ETIME; } @@ -159,9 +160,10 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev) if (dev->min_delta_ns > MIN_DELTA_LIMIT) dev->min_delta_ns = MIN_DELTA_LIMIT; - printk(KERN_WARNING "CE: %s increased min_delta_ns to %llu nsec\n", - dev->name ? dev->name : "?", - (unsigned long long) dev->min_delta_ns); + printk_deferred(KERN_WARNING + "CE: %s increased min_delta_ns to %llu nsec\n", + dev->name ? dev->name : "?", + (unsigned long long) dev->min_delta_ns); return 0; } -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] udf: Avoid infinite loop when processing indirect ICBs
We did not implement any bound on number of indirect ICBs we follow when loading inode. Thus corrupted medium could cause kernel to go into an infinite loop, possibly causing a stack overflow. Fix the possible stack overflow by removing recursion from __udf_read_inode() and limit number of indirect ICBs we follow to avoid infinite loops. Signed-off-by: Jan Kara <jack at suse.cz> --- fs/udf/inode.c | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/fs/udf/inode.c b/fs/udf/inode.c index 68cc7b144c26..a6a40536ebf1 100644 --- a/fs/udf/inode.c +++ b/fs/udf/inode.c @@ -1270,6 +1270,13 @@ update_time: return 0; } +/* + * Maximum length of linked list formed by ICB hierarchy. The chosen number is + * arbitrary - just that we hopefully don't limit any real use of rewritten + * inode on write-once media but avoid looping for too long on corrupted media. + */ +#define UDF_MAX_ICB_NESTING 1024 + static void __udf_read_inode(struct inode *inode) { struct buffer_head *bh = NULL; @@ -1279,7 +1286,9 @@ static void __udf_read_inode(struct inode *inode) struct udf_inode_info *iinfo = UDF_I(inode); struct udf_sb_info *sbi = UDF_SB(inode->i_sb); unsigned int link_count; + unsigned int indirections = 0; +reread: /* * Set defaults, but the inode is still incomplete! * Note: get_new_inode() sets the following on a new inode: @@ -1317,28 +1326,26 @@ static void __udf_read_inode(struct inode *inode) ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1, &ident); if (ident == TAG_IDENT_IE && ibh) { - struct buffer_head *nbh = NULL; struct kernel_lb_addr loc; struct indirectEntry *ie; ie = (struct indirectEntry *)ibh->b_data; loc = lelb_to_cpu(ie->indirectICB.extLocation); - if (ie->indirectICB.extLength && - (nbh = udf_read_ptagged(inode->i_sb, &loc, 0, - &ident))) { - if (ident == TAG_IDENT_FE || - ident == TAG_IDENT_EFE) { - memcpy(&iinfo->i_location, - &loc, - sizeof(struct kernel_lb_addr)); - brelse(bh); - brelse(ibh); - brelse(nbh); - __udf_read_inode(inode); + if (ie->indirectICB.extLength) { + brelse(bh); + brelse(ibh); + memcpy(&iinfo->i_location, &loc, + sizeof(struct kernel_lb_addr)); + if (++indirections > UDF_MAX_ICB_NESTING) { + udf_err(inode->i_sb, + "too many ICBs in ICB hierarchy" + " (max %d supported)\n", + UDF_MAX_ICB_NESTING); + make_bad_inode(inode); return; } - brelse(nbh); + goto reread; } } brelse(ibh); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] udf: Print error when inode is loaded
Signed-off-by: Jan Kara <jack at suse.cz> --- fs/udf/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/udf/super.c b/fs/udf/super.c index 5401fc33f5cc..479875155d77 100644 --- a/fs/udf/super.c +++ b/fs/udf/super.c @@ -962,7 +962,7 @@ struct inode *udf_find_metadata_inode_efe(struct super_block *sb, metadata_fe = udf_iget(sb, &addr); if (IS_ERR(metadata_fe)) { - udf_warn(sb, "metadata inode efe not found\n"); + udf_warn(sb, "metadata inode efe not found: %d\n", PTR_ERR(metadata_fe)); return metadata_fe; } if (UDF_I(metadata_fe)->i_alloc_type != ICBTAG_FLAG_AD_SHORT) { -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] vfs: Allocate anon_inode_inode in anon_inode_init()
Currently we allocated anon_inode_inode in anon_inodefs_mount. This is somewhat fragile as if that function ever gets called again, it will overwrite anon_inode_inode pointer. So move the initialization of anon_inode_inode to anon_inode_init(). Signed-off-by: Jan Kara <jack at suse.cz> --- fs/anon_inodes.c | 22 +++++++--------------- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 4b4543b8b894..7f34f7702204 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -41,19 +41,8 @@ static const struct dentry_operations anon_inodefs_dentry_operations = { static struct dentry *anon_inodefs_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - struct dentry *root; - root = mount_pseudo(fs_type, "anon_inode:", NULL, + return mount_pseudo(fs_type, "anon_inode:", NULL, &anon_inodefs_dentry_operations, ANON_INODE_FS_MAGIC); - if (!IS_ERR(root)) { - struct super_block *s = root->d_sb; - anon_inode_inode = alloc_anon_inode(s); - if (IS_ERR(anon_inode_inode)) { - dput(root); - deactivate_locked_super(s); - root = ERR_CAST(anon_inode_inode); - } - } - return root; } static struct file_system_type anon_inode_fs_type = { @@ -180,12 +169,15 @@ static int __init anon_inode_init(void) anon_inode_mnt = kern_mount(&anon_inode_fs_type); if (IS_ERR(anon_inode_mnt)) { error = PTR_ERR(anon_inode_mnt); - goto err_unregister_filesystem; + goto err_exit; + } + anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb); + if (IS_ERR(anon_inode_inode)) { + error = PTR_ERR(anon_inode_inode); + goto err_exit; } return 0; -err_unregister_filesystem: - unregister_filesystem(&anon_inode_fs_type); err_exit: panic(KERN_ERR "anon_inode_init() failed (%d)\n", error); } -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
->page_mkwrite() is used by filesystems to allocate blocks under a page which is becoming writeably mmapped in some process' address space. This allows a filesystem to return a page fault if there is not enough space available, user exceeds quota or similar problem happens, rather than silently discarding data later when writepage is called. However VFS fails to call ->page_mkwrite() in all the cases where filesystems need it when blocksize < pagesize. For example when blocksize = 1024, pagesize = 4096 the following is problematic: ftruncate(fd, 0); pwrite(fd, buf, 1024, 0); map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0); map[0] = 'a'; ----> page_mkwrite() for index 0 is called ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */ mremap(map, 1024, 10000, 0); map[4095] = 'a'; ----> no page_mkwrite() called At the moment ->page_mkwrite() is called, filesystem can allocate only one block for the page because i_size == 1024. Otherwise it would create blocks beyond i_size which is generally undesirable. But later at ->writepage() time, we also need to store data at offset 4095 but we don't have block allocated for it. This patch introduces a helper function filesystems can use to have ->page_mkwrite() called at all the necessary moments. Signed-off-by: Jan Kara <jack at suse.cz> --- fs/buffer.c | 3 +++ include/linux/mm.h | 8 ++++++++ mm/truncate.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 68 insertions(+) diff --git a/fs/buffer.c b/fs/buffer.c index 8f05111bbb8b..3ba5a6a1bc5f 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2080,6 +2080,7 @@ int generic_write_end(struct file *file, struct address_space *mapping, struct page *page, void *fsdata) { struct inode *inode = mapping->host; + loff_t old_size = inode->i_size; int i_size_changed = 0; copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); @@ -2099,6 +2100,8 @@ int generic_write_end(struct file *file, struct address_space *mapping, unlock_page(page); page_cache_release(page); + if (old_size < pos) + pagecache_isize_extended(inode, old_size, pos); /* * Don't mark the inode dirty under page lock. First, it unnecessarily * makes the holding time of page lock longer. Second, it forces lock diff --git a/include/linux/mm.h b/include/linux/mm.h index 8981cc882ed2..f0e53e5a3b17 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1155,6 +1155,14 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping, extern void truncate_pagecache(struct inode *inode, loff_t new); extern void truncate_setsize(struct inode *inode, loff_t newsize); +#ifdef CONFIG_MMU +void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to); +#else +static inline void pagecache_isize_extended(struct inode *inode, loff_t from, + loff_t to) +{ +} +#endif void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); int generic_error_remove_page(struct address_space *mapping, struct page *page); diff --git a/mm/truncate.c b/mm/truncate.c index 96d167372d89..261eaf6e5a19 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -20,6 +20,7 @@ #include <linux/buffer_head.h> /* grr. try_to_release_page, do_invalidatepage */ #include <linux/cleancache.h> +#include <linux/rmap.h> #include "internal.h" static void clear_exceptional_entry(struct address_space *mapping, @@ -719,12 +720,68 @@ EXPORT_SYMBOL(truncate_pagecache); */ void truncate_setsize(struct inode *inode, loff_t newsize) { + loff_t oldsize = inode->i_size; + i_size_write(inode, newsize); + if (newsize > oldsize) + pagecache_isize_extended(inode, oldsize, newsize); truncate_pagecache(inode, newsize); } EXPORT_SYMBOL(truncate_setsize); /** + * pagecache_isize_extended - update pagecache after extension of i_size + * @inode: inode for which i_size was extended + * @from: original inode size + * @to: new inode size + * + * Handle extension of inode size either caused by extending truncate or by + * write starting after current i_size. We mark the page straddling current + * i_size RO so that page_mkwrite() is called on the nearest write access to + * the page. This way filesystem can be sure that page_mkwrite() is called on + * the page before user writes to the page via mmap after the i_size has been + * changed. + * + * The function must be called after i_size is updated so that page fault + * coming after we unlock the page will already see the new i_size. + * The function must be called while we still hold i_mutex - this not only + * makes sure i_size is stable but also that userspace cannot observe new + * i_size value before we are prepared to store mmap writes at new inode size. + */ +void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to) +{ + int bsize = 1 << inode->i_blkbits; + loff_t rounded_from; + struct page *page; + pgoff_t index; + + WARN_ON(!mutex_is_locked(&inode->i_mutex)); + WARN_ON(to > inode->i_size); + + if (from >= to || bsize == PAGE_CACHE_SIZE) + return; + /* Page straddling @from will not have any hole block created? */ + rounded_from = round_up(from, bsize); + if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1))) + return; + + index = from >> PAGE_CACHE_SHIFT; + page = find_lock_page(inode->i_mapping, index); + /* Page not cached? Nothing to do */ + if (!page) + return; + /* + * See clear_page_dirty_for_io() for details why set_page_dirty() + * is needed. + */ + if (page_mkclean(page)) + set_page_dirty(page); + unlock_page(page); + page_cache_release(page); +} +EXPORT_SYMBOL(pagecache_isize_extended); + +/** * truncate_pagecache_range - unmap and remove pagecache that is hole-punched * @inode: inode * @lstart: offset of beginning of hole -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH RESEND] vfs: Return EINVAL for default SEEK_HOLE, SEEK_DATA implementation
Generic implementation of SEEK_HOLE & SEEK_DATA in generic_file_llseek_size() and default_llseek() behaved as if everything within i_size is data and everything beyond i_size is a hole. That makes sense at the first sight (and definitely is a valid implementation of the spec) but at the second sight it isn't very useful. If anyone bothers with looking for holes / data, he should be better told we don't really know so that he can fall back to his chosen backup strategy (think of e.g. cp(1)). This is a userspace API change however the kernel interface is there only for two years so hopefully userspace is still prepared to handle EINVAL return value. Signed-off-by: Jan Kara <jack at suse.cz> --- fs/read_write.c | 38 ++------------------------------------ 1 file changed, 2 insertions(+), 36 deletions(-) Al, what do you think about this change? diff --git a/fs/read_write.c b/fs/read_write.c index edc5746a902a..71525d75742f 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -111,22 +111,8 @@ generic_file_llseek_size(struct file *file, loff_t offset, int whence, spin_unlock(&file->f_lock); return offset; case SEEK_DATA: - /* - * In the generic case the entire file is data, so as long as - * offset isn't at the end of the file then the offset is data. - */ - if (offset >= eof) - return -ENXIO; - break; case SEEK_HOLE: - /* - * There is a virtual hole at the end of the file, so as long as - * offset isn't i_size or larger, return i_size. - */ - if (offset >= eof) - return -ENXIO; - offset = eof; - break; + return -EINVAL; } return vfs_setpos(file, offset, maxsize); @@ -214,28 +200,8 @@ loff_t default_llseek(struct file *file, loff_t offset, int whence) offset += file->f_pos; break; case SEEK_DATA: - /* - * In the generic case the entire file is data, so as - * long as offset isn't at the end of the file then the - * offset is data. - */ - if (offset >= inode->i_size) { - retval = -ENXIO; - goto out; - } - break; case SEEK_HOLE: - /* - * There is a virtual hole at the end of the file, so - * as long as offset isn't i_size or larger, return - * i_size. - */ - if (offset >= inode->i_size) { - retval = -ENXIO; - goto out; - } - offset = inode->i_size; - break; + return -EINVAL; } retval = -EINVAL; if (offset >= 0 || unsigned_offsets(file)) { -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] writeback: plug writeback at a high level
From: Dave Chinner <dchinner at redhat.com> tl;dr: 3 lines of code, 86% better fsmark thoughput consuming 13% less CPU and 43% lower runtime. Doing writeback on lots of little files causes terrible IOPS storms because of the per-mapping writeback plugging we do. This essentially causes imeediate dispatch of IO for each mapping, regardless of the context in which writeback is occurring. IOWs, running a concurrent write-lots-of-small 4k files using fsmark on XFS results in a huge number of IOPS being issued for data writes. Metadata writes are sorted and plugged at a high level by XFS, so aggregate nicely into large IOs. However, data writeback IOs are dispatched in individual 4k IOs - even when the blocks of two consecutively written files are adjacent - because the underlying block device is fast enough not to congest on such IO. This behaviour is not SSD related - anything with hardware caches is going to see the same benefits as the IO rates are limited only by how fast adjacent IOs can be sent to the hardware caches for aggregation. Hence the speed of the physical device is irrelevant to this common writeback workload (happens every time you untar a tarball!) - performance is limited by the overhead of dispatching individual IOs from a single writeback thread. Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem, metadata CRCs enabled. Test: $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 Result: wall sys create rate Physical write IO time CPU (avg files/s) IOPS Bandwidth ----- ----- ------------- ------ --------- unpatched 5m54s 15m32s 32,500+/-2200 28,000 150MB/s patched 3m19s 13m28s 52,900+/-1800 1,500 280MB/s improvement -43.8% -13.3% +62.7% -94.6% +86.6% Signed-off-by: Dave Chinner <dchinner at redhat.com> Signed-off-by: Jan Kara <jack at suse.cz> --- fs/fs-writeback.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 279292ba9403..d935fd3796ba 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -599,6 +599,9 @@ static long generic_writeback_inodes(struct wb_writeback_work *work) unsigned long end_time = jiffies + HZ / 10; long write_chunk; long wrote = 0; /* count both pages and inodes */ + struct blk_plug plug; + + blk_start_plug(&plug); spin_lock(&wb->list_lock); while (1) { @@ -688,6 +691,8 @@ static long generic_writeback_inodes(struct wb_writeback_work *work) out: spin_unlock(&wb->list_lock); + blk_finish_plug(&plug); + return wrote; } -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH] x86: Fixup lockdep complaint caused by io apic code
0day kernel testing guys have reported following lockdep complaint: =====================================================[ INFO: possible circular locking dependency detected ] 3.15.0-rc5-00567-gbafe980 #1 Not tainted ------------------------------------------------------- swapper/1 is trying to acquire lock: (&irq_desc_lock_class){-.-...}, at: [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70 but task is already holding lock: (&port_lock_key){......}, at: [<815f5b27>] serial8250_startup+0x337/0x720 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&port_lock_key){......}: <this stack looks somewhat bogus but we do end up taking port->lock when printing to serial console> [<810750e5>] lock_acquire+0x85/0x190 [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60 [<8106eb1c>] down_trylock+0xc/0x30 [<8107b795>] console_trylock+0x15/0xb0 [<8107be8f>] vprintk_emit+0x14f/0x4d0 [<81b969b9>] printk+0x38/0x3a [<82137f78>] print_ICs+0x5b/0x3e7 [<8212bb41>] do_one_initcall+0x8b/0x128 [<8212bd7d>] kernel_init_freeable+0x19f/0x236 [<81b9238b>] kernel_init+0xb/0xd0 [<81bb0080>] ret_from_kernel_thread+0x20/0x30 -> #1 (i8259A_lock){-.....}: [<810750e5>] lock_acquire+0x85/0x190 [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60 [<81005af1>] unmask_8259A_irq+0x11/0x60 [<81005b4b>] enable_8259A_irq+0xb/0x10 [<8107fffb>] irq_enable+0x2b/0x40 [<8108005d>] irq_startup+0x4d/0x60 [<8107f2bc>] __setup_irq+0x39c/0x460 [<8107f433>] setup_irq+0x33/0x80 [<8212db15>] setup_default_timer_irq+0xf/0x11 [<8212db2d>] hpet_time_init+0x16/0x18 [<8212daff>] x86_late_time_init+0x9/0x10 [<8212ba3d>] start_kernel+0x331/0x3aa [<8212b380>] i386_start_kernel+0x12e/0x131 -> #0 (&irq_desc_lock_class){-.-...}: [<810743c2>] __lock_acquire+0x19c2/0x1b20 [<810750e5>] lock_acquire+0x85/0x190 [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60 [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70 [<8107eb1e>] __disable_irq_nosync+0x1e/0x50 [<8107eb58>] disable_irq_nosync+0x8/0x10 [<815f5c78>] serial8250_startup+0x488/0x720 [<815f205e>] uart_startup.part.4+0x6e/0x1e0 [<815f2a40>] uart_open+0xe0/0x140 [<815e4b51>] tty_open+0x141/0x510 [<81118bc0>] chrdev_open+0x60/0x140 [<8111372c>] do_dentry_open+0x14c/0x230 [<8111459e>] finish_open+0x2e/0x40 [<8112132a>] do_last+0x4aa/0xd30 [<81121c5a>] path_openat+0xaa/0x610 [<811221ec>] do_filp_open+0x2c/0x70 [<81114a81>] do_sys_open+0x111/0x210 [<81114b9d>] SyS_open+0x1d/0x20 [<8212bda4>] kernel_init_freeable+0x1c6/0x236 [<81b9238b>] kernel_init+0xb/0xd0 [<81bb0080>] ret_from_kernel_thread+0x20/0x30 I believe the core of the problem is that print_PIC() calls printk() from under i8259A_lock. I've checked and that doesn't seem to happen anywhere else in the kernel. So let's move printk() from under that lock. As a side note this problem has existed for a long time but it was uncovered by my patch resulting in extended lockdep coverage of printk code. Reported-by: Jet Chen <jet.chen at intel.com> Signed-off-by: Jan Kara <jack at suse.cz> --- arch/x86/kernel/apic/io_apic.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 6ad4658de705..b815a4c9e5e5 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1770,7 +1770,7 @@ __apicdebuginit(void) print_local_APICs(int maxcpu) __apicdebuginit(void) print_PIC(void) { - unsigned int v; + unsigned int v0, v1, v2; unsigned long flags; if (!legacy_pic->nr_legacy_irqs) @@ -1780,24 +1780,23 @@ __apicdebuginit(void) print_PIC(void) raw_spin_lock_irqsave(&i8259A_lock, flags); - v = inb(0xa1) << 8 | inb(0x21); - printk(KERN_DEBUG "... PIC IMR: %04x\n", v); - - v = inb(0xa0) << 8 | inb(0x20); - printk(KERN_DEBUG "... PIC IRR: %04x\n", v); + v0 = inb(0xa1) << 8 | inb(0x21); + v1 = inb(0xa0) << 8 | inb(0x20); outb(0x0b,0xa0); outb(0x0b,0x20); - v = inb(0xa0) << 8 | inb(0x20); + v2 = inb(0xa0) << 8 | inb(0x20); outb(0x0a,0xa0); outb(0x0a,0x20); raw_spin_unlock_irqrestore(&i8259A_lock, flags); - printk(KERN_DEBUG "... PIC ISR: %04x\n", v); + printk(KERN_DEBUG "... PIC IMR: %04x\n", v0); + printk(KERN_DEBUG "... PIC IRR: %04x\n", v1); + printk(KERN_DEBUG "... PIC ISR: %04x\n", v2); - v = inb(0x4d1) << 8 | inb(0x4d0); - printk(KERN_DEBUG "... PIC ELCR: %04x\n", v); + v0 = inb(0x4d1) << 8 | inb(0x4d0); + printk(KERN_DEBUG "... PIC ELCR: %04x\n", v0); } static int __initdata show_lapic = 1; -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 2/2 RESEND] bdi: Avoid oops on device removal
After 839a8e8660b67 "writeback: replace custom worker pool implementation with unbound workqueue" when device is removed while we are writing to it we crash in bdi_writeback_workfn() -> set_worker_desc() because bdi->dev is NULL. This can happen because even though bdi_unregister() cancels all pending flushing work, nothing really prevents new ones from being queued from balance_dirty_pages() or other places. Fix the problem by clearing BDI_registered bit in bdi_unregister() and checking it before scheduling of any flushing work. Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977 CC: stable at vger.kernel.org Reviewed-by: Tejun Heo <tj at kernel.org> Signed-off-by: Jan Kara <jack at suse.cz> --- fs/fs-writeback.c | 23 ++++++++++++++++++----- include/linux/backing-dev.h | 2 +- mm/backing-dev.c | 13 +++++++++---- 3 files changed, 28 insertions(+), 10 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 8277b76be983..c24e7b8b521b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -94,16 +94,29 @@ static inline struct inode *wb_inode(struct list_head *head) #define CREATE_TRACE_POINTS #include <trace/events/writeback.h> +static void bdi_wakeup_thread(struct backing_dev_info *bdi) +{ + spin_lock_bh(&bdi->wb_lock); + if (test_bit(BDI_registered, &bdi->state)) + mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); + spin_unlock_bh(&bdi->wb_lock); +} + static void bdi_queue_work(struct backing_dev_info *bdi, struct wb_writeback_work *work) { trace_writeback_queue(bdi, work); spin_lock_bh(&bdi->wb_lock); + if (!test_bit(BDI_registered, &bdi->state)) { + if (work->done) + complete(work->done); + goto out_unlock; + } list_add_tail(&work->list, &bdi->work_list); - spin_unlock_bh(&bdi->wb_lock); - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); +out_unlock: + spin_unlock_bh(&bdi->wb_lock); } static void @@ -119,7 +132,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, work = kzalloc(sizeof(*work), GFP_ATOMIC); if (!work) { trace_writeback_nowork(bdi); - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); + bdi_wakeup_thread(bdi); return; } @@ -166,7 +179,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi) * writeback as soon as there is no other work to do. */ trace_writeback_wake_background(bdi); - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); + bdi_wakeup_thread(bdi); } /* @@ -1025,7 +1038,7 @@ void bdi_writeback_workfn(struct work_struct *work) current->flags |= PF_SWAPWRITE; if (likely(!current_is_workqueue_rescuer() || - list_empty(&bdi->bdi_list))) { + !test_bit(BDI_registered, &bdi->state))) { /* * The normal path. Keep writing back @bdi until its * work_list is empty. Note that this path is also taken diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 24819001f5c8..e488e9459a93 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -95,7 +95,7 @@ struct backing_dev_info { unsigned int max_ratio, max_prop_frac; struct bdi_writeback wb; /* default writeback info for this bdi */ - spinlock_t wb_lock; /* protects work_list */ + spinlock_t wb_lock; /* protects work_list & wb.dwork scheduling */ struct list_head work_list; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index fab8401fc54e..09d9591b7708 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -297,7 +297,10 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi) unsigned long timeout; timeout = msecs_to_jiffies(dirty_writeback_interval * 10); - queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout); + spin_lock_bh(&bdi->wb_lock); + if (test_bit(BDI_registered, &bdi->state)) + queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout); + spin_unlock_bh(&bdi->wb_lock); } /* @@ -310,9 +313,6 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi) spin_unlock_bh(&bdi_lock); synchronize_rcu_expedited(); - - /* bdi_list is now unused, clear it to mark @bdi dying */ - INIT_LIST_HEAD(&bdi->bdi_list); } int bdi_register(struct backing_dev_info *bdi, struct device *parent, @@ -363,6 +363,11 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi) */ bdi_remove_from_list(bdi); + /* Make sure nobody queues further work */ + spin_lock_bh(&bdi->wb_lock); + clear_bit(BDI_registered, &bdi->state); + spin_unlock_bh(&bdi->wb_lock); + /* * Drain work list and shutdown the delayed_work. At this point, * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 2/2] ext3: Don't check quota format when there are no quota files
The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. CC: stable at vger.kernel.org Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext3/super.c | 7 ------- 1 file changed, 7 deletions(-) I'm going to queue this patch unless someone objects. diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 622e88249024..2c42e739e3d1 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -1354,13 +1354,6 @@ set_qf_format: "not specified."); return 0; } - } else { - if (sbi->s_jquota_fmt) { - ext3_msg(sb, KERN_ERR, "error: journaled quota format " - "specified with no journaling " - "enabled."); - return 0; - } } #endif return 1; -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 2/2] ext4: Fix hole punching for files with indirect blocks
Hole punching code for files with indirect blocks wrongly computed number of blocks which need to be cleared when traversing the indirect block tree. That could result in punching more blocks than actually requested and thus effectively cause a data loss. For example: fallocate -n -p 10240000 4096 will punch the range 10240000 - 12632064 instead of the range 1024000 - 10244096. Fix the calculation. CC: stable at vger.kernel.org Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72 Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/indirect.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 9d381707a6fc..771949c82715 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -1310,16 +1310,24 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode, blk = *i_data; if (level > 0) { ext4_lblk_t first2; + ext4_lblk_t count2; + bh = sb_bread(inode->i_sb, le32_to_cpu(blk)); if (!bh) { EXT4_ERROR_INODE_BLOCK(inode, le32_to_cpu(blk), "Read failure"); return -EIO; } - first2 = (first > offset) ? first - offset : 0; + if (first > offset) { + first2 = first - offset; + count2 = count; + } else { + first2 = 0; + count2 = count - (offset - first); + } ret = free_hole_blocks(handle, inode, bh, (__le32 *)bh->b_data, level - 1, - first2, count - offset, + first2, count2, inode->i_sb->s_blocksize >> 2); if (ret) { brelse(bh); -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize
Use truncate_isize_extended() when hole is being created in a file so that ->page_mkwrite() will get called for the partial tail page if it is mmaped (see the first patch in the series for details). Signed-off-by: Jan Kara <jack at suse.cz> --- fs/ext4/inode.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3aa26e9117c4..9e269489e094 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4536,8 +4536,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) ext4_orphan_del(NULL, inode); goto err_out; } - } else + } else { + loff_t oldsize = inode->i_size; + i_size_write(inode, attr->ia_size); + pagecache_isize_extended(inode, oldsize, inode->i_size); + } /* * Blocks are going to be removed from the inode. Wait -- 1.8.1.4
Jan Kara
2014-Oct-10 14:23 UTC
[Ocfs2-devel] [PATCH 2/2] jbd2: Simplify calling convention around __jbd2_journal_clean_checkpoint_list
__jbd2_journal_clean_checkpoint_list() returns number of buffers it freed but noone was using the value so just stop doing that. This also allows for simplifying the calling convention for journal_clean_once_cp_list(). Signed-off-by: Jan Kara <jack at suse.cz> --- fs/jbd2/checkpoint.c | 56 ++++++++++++++++++++++------------------------------ include/linux/jbd2.h | 2 +- 2 files changed, 25 insertions(+), 33 deletions(-) diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index e39b2d0e1079..ef23d6f4a6a2 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -479,16 +479,15 @@ int jbd2_cleanup_journal_tail(journal_t *journal) * release them. * * Called with j_list_lock held. - * Returns number of buffers reaped (for debug) + * Returns 1 if we freed the transaction, 0 otherwise. */ - -static int journal_clean_one_cp_list(struct journal_head *jh, int *released) +static int journal_clean_one_cp_list(struct journal_head *jh) { struct journal_head *last_jh; struct journal_head *next_jh = jh; - int ret, freed = 0; + int ret; + int freed = 0; - *released = 0; if (!jh) return 0; @@ -499,11 +498,9 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released) ret = __try_to_free_cp_buf(jh); if (!ret) return freed; - freed++; - if (ret == 2) { - *released = 1; - return freed; - } + if (ret == 2) + return 1; + freed = 1; /* * This function only frees up some memory * if possible so we dont have an obligation @@ -523,53 +520,48 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released) * Find all the written-back checkpoint buffers in the journal and release them. * * Called with j_list_lock held. - * Returns number of buffers reaped (for debug) */ - -int __jbd2_journal_clean_checkpoint_list(journal_t *journal) +void __jbd2_journal_clean_checkpoint_list(journal_t *journal) { transaction_t *transaction, *last_transaction, *next_transaction; int ret; - int freed = 0; - int released; transaction = journal->j_checkpoint_transactions; if (!transaction) - goto out; + return; last_transaction = transaction->t_cpprev; next_transaction = transaction; do { transaction = next_transaction; next_transaction = transaction->t_cpnext; - ret = journal_clean_one_cp_list(transaction-> - t_checkpoint_list, &released); + ret = journal_clean_one_cp_list(transaction->t_checkpoint_list); /* * This function only frees up some memory if possible so we * dont have an obligation to finish processing. Bail out if * preemption requested: */ - if (need_resched()) { - freed += ret; - goto out; - } - if (released) { - freed += ret; + if (need_resched()) + return; + if (ret) continue; - } /* * It is essential that we are as careful as in the case of * t_checkpoint_list with removing the buffer from the list as * we can possibly see not yet submitted buffers on io_list */ - ret += journal_clean_one_cp_list(transaction-> - t_checkpoint_io_list, &released); - freed += ret; - if (need_resched() || !ret) - goto out; + ret = journal_clean_one_cp_list(transaction-> + t_checkpoint_io_list); + if (need_resched()) + return; + /* + * Stop scanning if we couldn't free the transaction. This + * avoids pointless scanning of transactions which still + * weren't checkpointed. + */ + if (!ret) + return; } while (transaction != last_transaction); -out: - return freed; } /* diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 0dae71e9971c..704b9a599b26 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1042,7 +1042,7 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block); extern void jbd2_journal_commit_transaction(journal_t *); /* Checkpoint list management */ -int __jbd2_journal_clean_checkpoint_list(journal_t *journal); +void __jbd2_journal_clean_checkpoint_list(journal_t *journal); int __jbd2_journal_remove_checkpoint(struct journal_head *); void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *); -- 1.8.1.4
Signed-off-by: Jan Kara <jack at suse.cz> --- kernel/printk/printk.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index a39f4129f848..00a9ad5c2708 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1718,17 +1718,19 @@ asmlinkage int vprintk_emit(int facility, int level, logbuf_cpu = UINT_MAX; raw_spin_unlock(&logbuf_lock); - local_irq_restore(flags); - - /* If called from the scheduler, we can not call up(). */ - if (in_sched) - return printed_len; - /* * Disable preemption to avoid being preempted while holding * console_sem which would prevent anyone from printing to console */ preempt_disable(); + local_irq_restore(flags); + + /* If called from the scheduler, we can not call up(). */ + if (in_sched) { + preempt_enable(); + return printed_len; + } + /* * Try to acquire and then immediately release the console semaphore. * The release will print out buffers and wake up /dev/kmsg and syslog() -- 1.8.1.4
Dave Jones
2014-Oct-10 15:19 UTC
Re: [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
On Fri, Oct 10, 2014 at 04:23:07PM +0200, Jan Kara wrote: > From: Dave Jones <davej@redhat.com> > > Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on > dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but > neglected to free that allocation on the error paths that follow. > > Signed-off-by: Dave Jones <davej@fedoraproject.org> > Acked-by: Mike Snitzer <snitzer@redhat.com> > Signed-off-by: Jens Axboe <axboe@fb.com> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Um, This got applied months ago. Dave _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs