thr3ads.net - Btrfs devel - wait_block_group_cache_progress() waits forever in case of drive failure [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Alex Lyakas

2013-Jun-04 16:23 UTC

wait_block_group_cache_progress() waits forever in case of drive failure

Greetings all,
when testing drive failures, I occasionally hit the following hang:

# Block group is being cached-in by caching_thread()
# caching_thread() experiences an error, e.g., in btrfs_search_slot,
because of drive failure:
	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
	if (ret < 0)
		goto err;

# caching thread exits:
err:
	btrfs_free_path(path);
	up_read(&fs_info->extent_commit_sem);

	free_excluded_extents(extent_root, block_group);

	mutex_unlock(&caching_ctl->mutex);
out:
	wake_up(&caching_ctl->wait);

	put_caching_control(caching_ctl);
	btrfs_put_block_group(block_group);

However, wait_block_group_cache_progress() is still stuck in a stack like this:
[<ffffffff816ec509>] schedule+0x29/0x70
[<ffffffffa044bd42>] wait_block_group_cache_progress+0xe2/0x110 [btrfs]
[<ffffffff8107fc10>] ? add_wait_queue+0x60/0x60
[<ffffffff8107fc10>] ? add_wait_queue+0x60/0x60
[<ffffffffa04568d6>] find_free_extent+0x306/0xb90 [btrfs]
[<ffffffffa04462ee>] ? btrfs_search_slot+0x2fe/0x820 [btrfs]
[<ffffffffa0457200>] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
...
because of:
	wait_event(caching_ctl->wait, block_group_cache_done(cache) ||
		   (cache->free_space_ctl->free_space >= num_bytes));

But cache->cached never becomes BTRFS_CACHE_FINISHED, and
cache->free_space_ctl->free_space will also not grow enough, so the
wait never finishes.
At this point, the system totally hangs.

Same problem can happen with wait_block_group_cache_done().

I am thinking: can we add additional condition, like:
	wait_event(caching_ctl->wait,
                       test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state)
||
                       block_group_cache_done(cache) ||
                       (cache->free_space_ctl->free_space >=
num_bytes));

So that when transaction aborts, FS is marked as "bad", and then all
these waits will complete, so that the user can unmount?

Or some other way to fix this problem?

Thanks,
Alex.

P.S: should I open a bugzilla for this?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2013-Jun-05 09:17 UTC

head link

Re: wait_block_group_cache_progress() waits forever in case of drive failure

On Tue, 4 Jun 2013 19:23:18 +0300, Alex Lyakas wrote:
[...]> P.S: should I open a bugzilla for this?
Yes.
Otherwise the bug report gets lost.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jun 2013 - wait_block_group_cache_progress() waits forever in case of drive failure

wait_block_group_cache_progress() waits forever in case of drive failure

Re: wait_block_group_cache_progress() waits forever in case of drive failure