fs_info->alloc_start was not protected strictly, it might be changed while we were accessing it. This patch fixes this problem by using two locks to protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we just need get one of these two locks, and on the write side, we must lock all of them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> --- fs/btrfs/ctree.h | 10 ++++++++++ fs/btrfs/super.c | 4 ++++ 2 files changed, 14 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3e672916..201be7d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1295,6 +1295,16 @@ struct btrfs_fs_info { * so it is also safe. */ u64 max_inline; + /* + * Protected by ->chunk_mutex and sb->s_umount. + * + * The reason that we use two lock to protect it is because only + * remount and mount operations can change it and these two operations + * are under sb->s_umount, but the read side (chunk allocation) can not + * acquire sb->s_umount or the deadlock would happen. So we use two + * locks to protect it. On the write side, we must acquire two locks, + * and on the read side, we just need acquire one of them. + */ u64 alloc_start; struct btrfs_transaction *running_transaction; wait_queue_head_t transaction_throttle; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 073756a..6f0524d 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -519,7 +519,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_alloc_start: num = match_strdup(&args[0]); if (num) { + mutex_lock(&info->chunk_mutex); info->alloc_start = memparse(num, NULL); + mutex_unlock(&info->chunk_mutex); kfree(num); printk(KERN_INFO "btrfs: allocations start at %llu\n", @@ -1289,7 +1291,9 @@ restore: fs_info->mount_opt = old_opts; fs_info->compress_type = old_compress_type; fs_info->max_inline = old_max_inline; + mutex_lock(&fs_info->chunk_mutex); fs_info->alloc_start = old_alloc_start; + mutex_unlock(&fs_info->chunk_mutex); btrfs_resize_thread_pool(fs_info, old_thread_pool_size, fs_info->thread_pool_size); fs_info->metadata_ratio = old_metadata_ratio; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote:> fs_info->alloc_start was not protected strictly, it might be changed while > we were accessing it. This patch fixes this problem by using two locks to > protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we > just need get one of these two locks, and on the write side, we must lock > all of them.Can you please describe how this can theoretically cause any problems? Out of curiosity I''ve been running this command on a 60G ssd disk for a while for i in `seq 1 10000|shuf`;do mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done #sleep 0 / 1 / 5 done together with one of my favourite stresstests (heavy writes, subvolume activity). There are two direct users of fs_info->alloc_start: * statfs via btrfs_calc_avail_data_space * find_free_dev_extent 960 search_start = max(root->fs_info->alloc_start, 1024ull * 1024); as remount calls sync, there is a very tiny window where an allocation could get the old value of alloc_start just between sync and do_remount. Theoretical and without any visible effect. My NAK for this patch. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote:> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote: >> fs_info->alloc_start was not protected strictly, it might be changed while >> we were accessing it. This patch fixes this problem by using two locks to >> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we >> just need get one of these two locks, and on the write side, we must lock >> all of them. > > Can you please describe how this can theoretically cause any problems? > > Out of curiosity I''ve been running this command on a 60G ssd disk for a while > > for i in `seq 1 10000|shuf`;do > mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done > #sleep 0 / 1 / 5 > done > > together with one of my favourite stresstests (heavy writes, subvolume > activity). > > There are two direct users of fs_info->alloc_start: > > * statfs via btrfs_calc_avail_data_space > * find_free_dev_extent > > 960 search_start = max(root->fs_info->alloc_start, 1024ull * 1024); > > as remount calls sync, there is a very tiny window where an allocation could > get the old value of alloc_start just between sync and do_remount. Theoretical > and without any visible effect.->alloc_start is a 64bits variant, on the 32bits machine, we have to load it by two instructions. Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and set ->alloc_start to 0x00000100 Task0 Task1 load low 32 bits set high 32 bits load high 32 bits set low 32 bits Task1 will get 0. Thanks Miao> > My NAK for this patch. > > david >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 14, 2013 at 04:04:59PM +0800, Miao Xie wrote:> On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote: > > On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote: > >> fs_info->alloc_start was not protected strictly, it might be changed while > >> we were accessing it. This patch fixes this problem by using two locks to > >> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we > >> just need get one of these two locks, and on the write side, we must lock > >> all of them. > > > > Can you please describe how this can theoretically cause any problems? > > > > Out of curiosity I''ve been running this command on a 60G ssd disk for a while > > > > for i in `seq 1 10000|shuf`;do > > mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done > > #sleep 0 / 1 / 5 > > done > > > > together with one of my favourite stresstests (heavy writes, subvolume > > activity). > > > > There are two direct users of fs_info->alloc_start: > > > > * statfs via btrfs_calc_avail_data_space > > * find_free_dev_extent > > > > 960 search_start = max(root->fs_info->alloc_start, 1024ull * 1024); > > > > as remount calls sync, there is a very tiny window where an allocation could > > get the old value of alloc_start just between sync and do_remount. Theoretical > > and without any visible effect. > > ->alloc_start is a 64bits variant, on the 32bits machine, we have to load it > by two instructions. > > Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and > set ->alloc_start to 0x00000100 > Task0 Task1 > load low 32 bits > set high 32 bits > load high 32 bits > set low 32 bits > > Task1 will get 0.That''s an insufficient description of how a race could actually happen, it''s only a high-level scheme how a 64bit value can get mixed up with 32bit access. How exactly does it cause problems in btrfs code? IMO the bug is not there due to various reasons like instruction scheduling and cache effect that may hide the actual split of the 32bit value. * the value set on one cpu is not immediatelly visible on another cpu * the full 64bit value of alloc_start lives inside a single cacheline and will be updated as a whole in context of btrfs -- there''s only one place where it''s set so multiple places that would store the value and fight over the cacheline exclusivity are out of question; the storing cpu will flush the pending writes at some point in time so they''re visible by other cpus, until then the load side reads a consistent value -- in almost all cases, and this gets even more wild and counts on exact timing of a memory barriers triggered from other cpu and cacheline ownership, so the store finishes only one half, barrier, load finds half new and half old value, store does the other half * this needs to happen within 4-5 instructions for an operation that involves a IO sync -- makes it much harder to provoke right timing Your example with 0x00010000 and 0x00000100 uses only 32bit values, so it cannot be applied here. Also, alloc_start is always compared with the hardcoded 1M limit so it cannot go below to 0. The point is not the one extra mutex lock, but that you''re attempting to fix a bug highly improbable if possible at all without a proper description and analysis. I don''t want to spend the time on this and do the work for instead of you next time, please try harder to convince us. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On wed, 16 Jan 2013 13:43:00 +0100, David Sterba wrote:> On Mon, Jan 14, 2013 at 04:04:59PM +0800, Miao Xie wrote: >> On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote: >>> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote: >>>> fs_info->alloc_start was not protected strictly, it might be changed while >>>> we were accessing it. This patch fixes this problem by using two locks to >>>> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we >>>> just need get one of these two locks, and on the write side, we must lock >>>> all of them. >>> >>> Can you please describe how this can theoretically cause any problems? >>> >>> Out of curiosity I''ve been running this command on a 60G ssd disk for a while >>> >>> for i in `seq 1 10000|shuf`;do >>> mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done >>> #sleep 0 / 1 / 5 >>> done >>> >>> together with one of my favourite stresstests (heavy writes, subvolume >>> activity). >>> >>> There are two direct users of fs_info->alloc_start: >>> >>> * statfs via btrfs_calc_avail_data_space >>> * find_free_dev_extent >>> >>> 960 search_start = max(root->fs_info->alloc_start, 1024ull * 1024); >>> >>> as remount calls sync, there is a very tiny window where an allocation could >>> get the old value of alloc_start just between sync and do_remount. Theoretical >>> and without any visible effect. >> >> ->alloc_start is a 64bits variant, on the 32bits machine, we have to load it >> by two instructions. >> >> Assuming -> alloc_start is 0x00010000 at the beginning, then we remount and^ it should be 0x0000 0001 0000 0000>> set ->alloc_start to 0x00000100^ should be 0x0000 0000 0001 0000>> Task0 Task1 >> load low 32 bits >> set high 32 bits >> load high 32 bits >> set low 32 bits >> >> Task1 will get 0. > > That''s an insufficient description of how a race could actually happen, > it''s only a high-level scheme how a 64bit value can get mixed up with > 32bit access. > > How exactly does it cause problems in btrfs code? > > IMO the bug is not there due to various reasons like instruction > scheduling and cache effect that may hide the actual split of the 32bit > value. > > * the value set on one cpu is not immediatelly visible on another cpu > * the full 64bit value of alloc_start lives inside a single cacheline > and will be updated as a whole in context of btrfs -- there''s only oneI know it lives inside a single cacheline, but only the cacheline will be updated as a whole, not the 64bit value. And cacheline is not a lock, it can not prevent that it is acquired by the other CPU when only a half is updated. Right?> place where it''s set so multiple places that would store the value and > fight over the cacheline exclusivity are out of question; > the storing cpu will flush the pending writes at some point in time > so they''re visible by other cpus, until then the load side reads a > consistent value -- in almost all cases, and this gets even more wild > and counts on exact timing of a memory barriers triggered from other > cpu and cacheline ownership, so the store finishes only one half, > barrier, load finds half new and half old value, store does the other > half > * this needs to happen within 4-5 instructions for an operation that > involves a IO sync -- makes it much harder to provoke right timing > > Your example with 0x00010000 and 0x00000100 uses only 32bit values, soMy mistake, sorry. The number we want to use is 64bit(see above).> it cannot be applied here. Also, alloc_start is always compared with the > hardcoded 1M limit so it cannot go below to 0.0 is just a sample. The point is we should not allocate a extent from a unexpected address.> The point is not the one extra mutex lock, but that you''re attempting to > fix a bug highly improbable if possible at all without a properOn 32bit machines, it is not highly improbable, we can trigger this bug easily (The attached code which simulates the process of the btrfs can reproduce the bug). This problem don''t happen on btrfs is because: 1. Most of the users use btrfs on 64bit machine 2. Changing alloc_start is infrequent,and chunk allocation is not frequent. So the race is hard to be triggered on btrfs. But it doesn''t means it can not happen.> description and analysis. I don''t want to spend the time on this and do > the work for instead of you next time, please try harder to convince us.Miao
Hi David, do you think you can share this favorite stress-test that you have mentioned? Thanks, Alex. On Thu, Jan 10, 2013 at 7:10 PM, David Sterba <dsterba@suse.cz> wrote:> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote: >> fs_info->alloc_start was not protected strictly, it might be changed while >> we were accessing it. This patch fixes this problem by using two locks to >> protect it (fs_info->chunk_mutex and sb->s_umount). On the read side, we >> just need get one of these two locks, and on the write side, we must lock >> all of them. > > Can you please describe how this can theoretically cause any problems? > > Out of curiosity I''ve been running this command on a 60G ssd disk for a while > > for i in `seq 1 10000|shuf`;do > mount -o remount,alloc_start=$(($i*1024*1024)) /mnt/sdc;done > #sleep 0 / 1 / 5 > done > > together with one of my favourite stresstests (heavy writes, subvolume > activity). > > There are two direct users of fs_info->alloc_start: > > * statfs via btrfs_calc_avail_data_space > * find_free_dev_extent > > 960 search_start = max(root->fs_info->alloc_start, 1024ull * 1024); > > as remount calls sync, there is a very tiny window where an allocation could > get the old value of alloc_start just between sync and do_remount. Theoretical > and without any visible effect. > > My NAK for this patch. > > david > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html