thr3ads.net - Btrfs devel - [PATCH] Btrfs: fix race between removing a dev and writing sbs [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Filipe David Borba Manana

2013-Aug-08 20:00 UTC

[PATCH] Btrfs: fix race between removing a dev and writing sbs

Since all code paths that update the number of devices in the
super copy (fs_info->super_copy) first lock the device list
(fs_info->fs_devices->device_list_mutex), and write_all_supers()
also needs to lock the devices list mutex, make write_all_supers()
read the number of devices from the super copy after it locks
the device list mutex (and before unlocking it of course).

The only code path that doesn''t lock the device list mutex
before updating the number of devices in the super copy is
disk-io.c:next_root_backup(), called by open_ctree() during
mount time where concurrency issues can''t happen.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
---
 fs/btrfs/disk-io.c |    2 +-
 fs/btrfs/volumes.c |   11 ++++-------
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 254cdc8..c4b24c7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3313,7 +3313,6 @@ static int write_all_supers(struct btrfs_root *root, int
max_mirrors)
 	int total_errors = 0;
 	u64 flags;
 
-	max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
 	do_barriers = !btrfs_test_opt(root, NOBARRIER);
 	backup_super_roots(root->fs_info);
 
@@ -3322,6 +3321,7 @@ static int write_all_supers(struct btrfs_root *root, int
max_mirrors)
 
 	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
 	head = &root->fs_info->fs_devices->devices;
+	max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
 
 	if (do_barriers) {
 		ret = barrier_all_devices(root->fs_info);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 090f57c..eddf386 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1568,11 +1568,6 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 	if (ret)
 		goto error_undo;
 
-	/*
-	 * TODO: the superblock still includes this device in its num_devices
-	 * counter although write_all_supers() is not locked out. This
-	 * could give a filesystem state which requires a degraded mount.
-	 */
 	ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
 	if (ret)
 		goto error_undo;
@@ -1588,7 +1583,9 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 	/*
 	 * the device list mutex makes sure that we don''t change
 	 * the device list while someone else is writing out all
-	 * the device supers.
+	 * the device supers. Whoever is writing all supers, should
+	 * lock the device list mutex before getting the number of
+	 * devices in the super block (super_copy).
 	 */
 
 	cur_devices = device->fs_devices;
@@ -1612,10 +1609,10 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 		device->fs_devices->open_devices--;
 
 	call_rcu(&device->rcu, free_device);
-	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
 	num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
 	btrfs_set_super_num_devices(root->fs_info->super_copy, num_devices);
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
 	if (cur_devices->open_devices == 0) {
 		struct btrfs_fs_devices *fs_devices;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2013-Aug-09 13:07 UTC

head link

Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs

On Thu,  8 Aug 2013 21:00:52 +0100, Filipe David Borba Manana
wrote:> Since all code paths that update the number of devices in the
> super copy (fs_info->super_copy) first lock the device list
> (fs_info->fs_devices->device_list_mutex), and write_all_supers()
> also needs to lock the devices list mutex, make write_all_supers()
> read the number of devices from the super copy after it locks
> the device list mutex (and before unlocking it of course).
> 
> The only code path that doesn''t lock the device list mutex
> before updating the number of devices in the super copy is
> disk-io.c:next_root_backup(), called by open_ctree() during
> mount time where concurrency issues can''t happen.
> 
> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
> ---
>  fs/btrfs/disk-io.c |    2 +-
>  fs/btrfs/volumes.c |   11 ++++-------
>  2 files changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 254cdc8..c4b24c7 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3313,7 +3313,6 @@ static int write_all_supers(struct btrfs_root *root,
int max_mirrors)
>  	int total_errors = 0;
>  	u64 flags;
>  
> -	max_errors = btrfs_super_num_devices(root->fs_info->super_copy) -
1;
>  	do_barriers = !btrfs_test_opt(root, NOBARRIER);
>  	backup_super_roots(root->fs_info);
>  
> @@ -3322,6 +3321,7 @@ static int write_all_supers(struct btrfs_root *root,
int max_mirrors)
>  
>  	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>  	head = &root->fs_info->fs_devices->devices;
> +	max_errors = btrfs_super_num_devices(root->fs_info->super_copy) -
1;
>  
>  	if (do_barriers) {
>  		ret = barrier_all_devices(root->fs_info);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 090f57c..eddf386 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1568,11 +1568,6 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
>  	if (ret)
>  		goto error_undo;
>  
> -	/*
> -	 * TODO: the superblock still includes this device in its num_devices
> -	 * counter although write_all_supers() is not locked out. This
> -	 * could give a filesystem state which requires a degraded mount.
> -	 */
>  	ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
The problem that I had seen when I added that comment is something
different than what you are addressing.

The call to btrfs_rm_dev_item() is the place where the device is removed
in the filesystem device tree. The transaction is commited.

root->fs_info->super_copy is not updated and still includes the device
that is not part of the device tree anymore.

19 lines later, the device_list_mutex is acquired. Until then, nobody
prevents write_all_supers() to write the superblock to disk. This means,
until then, you can create a state on disk with an updated device tree
and a num_devices value which is too high by one.

If you now crash or the power drops, the on-disk state is not
consistent. However, this is not a severe problem. btrfs_rm_device()
relocates all chunks that are located on the removed device. On next
mount, at first the device items are read which do not include the
deleted device anymore, afterwards the chunks are checked, whether they
reference a device that is not present. And this is not the case.
Therefore this situation is not a severe problem and my comment was
wrong that says "could require a degraded mount".

But the field num_devices in the superblock will stay wrong for the
lifetime of the filesystem, causing malfunction of the ioctl
BTRFS_IOC_DEVICES_READY, and potentially causing trouble in the future
when somebody adds code that relies on fs_devices->total_devices being
correct.

It''s simply not correct like it is now. And your patch doesn''t
fix the
issue that the TODO comment describes.

>  	if (ret)
>  		goto error_undo;
> @@ -1588,7 +1583,9 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
>  	/*
>  	 * the device list mutex makes sure that we don''t change
>  	 * the device list while someone else is writing out all
> -	 * the device supers.
> +	 * the device supers. Whoever is writing all supers, should
> +	 * lock the device list mutex before getting the number of
> +	 * devices in the super block (super_copy).
>  	 */
>  
>  	cur_devices = device->fs_devices;
> @@ -1612,10 +1609,10 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
>  		device->fs_devices->open_devices--;
>  
>  	call_rcu(&device->rcu, free_device);
> -	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>  
>  	num_devices = btrfs_super_num_devices(root->fs_info->super_copy) -
1;
>  	btrfs_set_super_num_devices(root->fs_info->super_copy,
num_devices);
> +	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>  
>  	if (cur_devices->open_devices == 0) {
>  		struct btrfs_fs_devices *fs_devices;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Filipe David Manana

2013-Aug-09 13:50 UTC

head link

Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs

On Fri, Aug 9, 2013 at 2:07 PM, Stefan Behrens
<sbehrens@giantdisaster.de> wrote:> On Thu,  8 Aug 2013 21:00:52 +0100, Filipe David Borba Manana wrote:
>> Since all code paths that update the number of devices in the
>> super copy (fs_info->super_copy) first lock the device list
>> (fs_info->fs_devices->device_list_mutex), and write_all_supers()
>> also needs to lock the devices list mutex, make write_all_supers()
>> read the number of devices from the super copy after it locks
>> the device list mutex (and before unlocking it of course).
>>
>> The only code path that doesn''t lock the device list mutex
>> before updating the number of devices in the super copy is
>> disk-io.c:next_root_backup(), called by open_ctree() during
>> mount time where concurrency issues can''t happen.
>>
>> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
>> ---
>>  fs/btrfs/disk-io.c |    2 +-
>>  fs/btrfs/volumes.c |   11 ++++-------
>>  2 files changed, 5 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 254cdc8..c4b24c7 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3313,7 +3313,6 @@ static int write_all_supers(struct btrfs_root
*root, int max_mirrors)
>>       int total_errors = 0;
>>       u64 flags;
>>
>> -     max_errors =
btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>       do_barriers = !btrfs_test_opt(root, NOBARRIER);
>>       backup_super_roots(root->fs_info);
>>
>> @@ -3322,6 +3321,7 @@ static int write_all_supers(struct btrfs_root
*root, int max_mirrors)
>>
>>      
mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>>       head = &root->fs_info->fs_devices->devices;
>> +     max_errors =
btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>
>>       if (do_barriers) {
>>               ret = barrier_all_devices(root->fs_info);
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 090f57c..eddf386 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1568,11 +1568,6 @@ int btrfs_rm_device(struct btrfs_root *root,
char *device_path)
>>       if (ret)
>>               goto error_undo;
>>
>> -     /*
>> -      * TODO: the superblock still includes this device in its
num_devices
>> -      * counter although write_all_supers() is not locked out. This
>> -      * could give a filesystem state which requires a degraded mount.
>> -      */
>>       ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
>
> The problem that I had seen when I added that comment is something
> different than what you are addressing.
>
> The call to btrfs_rm_dev_item() is the place where the device is removed
> in the filesystem device tree. The transaction is commited.
So, it would only be super correct if the call to btrfs_rm_dev_item()
(and the following code) is run inside the critical section delimited
by the device list mutex (and have the super_copy num devices updated
inside that section too, like I did).

Other than a potentially much longer critical section, or mutex
deadlock (because btrfs_scrub_cancel locks scrub_lock), any reason to
not do it?
>
> root->fs_info->super_copy is not updated and still includes the
device
> that is not part of the device tree anymore.
>
> 19 lines later, the device_list_mutex is acquired. Until then, nobody
> prevents write_all_supers() to write the superblock to disk. This means,
> until then, you can create a state on disk with an updated device tree
> and a num_devices value which is too high by one.
>
> If you now crash or the power drops, the on-disk state is not
> consistent. However, this is not a severe problem. btrfs_rm_device()
> relocates all chunks that are located on the removed device. On next
> mount, at first the device items are read which do not include the
> deleted device anymore, afterwards the chunks are checked, whether they
> reference a device that is not present. And this is not the case.
> Therefore this situation is not a severe problem and my comment was
> wrong that says "could require a degraded mount".
>
> But the field num_devices in the superblock will stay wrong for the
> lifetime of the filesystem, causing malfunction of the ioctl
> BTRFS_IOC_DEVICES_READY, and potentially causing trouble in the future
> when somebody adds code that relies on fs_devices->total_devices being
> correct.
>
> It''s simply not correct like it is now. And your patch
doesn''t fix the
> issue that the TODO comment describes.
Thanks for the explanation, very helpful.

Indeed, it doesn''t fix the issue you described. I thought more about
fixing the following issue:

1) Write super gets a number of N devices from super_copy, so it will
not panic if it fails to write dbs for N - 1 devices;

2) Then tries to acquire device_list_mutex, but blocks because
btrfs_rm_device() got it first

3) btrfs_rm_device() removes the device from the list, and does all
those things it does and then unlocks the dev list mutex;

4) write_all_supers() acquires the mutex, iterates over all devices in
the list and gets N - 1 errors (failed to write db to all devices)

5) Because N - 1 is less than N, it thinks all is ok, when it''s not
because there''s actually only N - 1 devices now. Therefore the
BUG_ON() won''t get executed.

This is more likely to happen for a small number of devices only (2 ->
1 for e.g.). I will revert re-add your comment, as this fixes
something different.
>
>
>>       if (ret)
>>               goto error_undo;
>> @@ -1588,7 +1583,9 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
>>       /*
>>        * the device list mutex makes sure that we don''t change
>>        * the device list while someone else is writing out all
>> -      * the device supers.
>> +      * the device supers. Whoever is writing all supers, should
>> +      * lock the device list mutex before getting the number of
>> +      * devices in the super block (super_copy).
>>        */
>>
>>       cur_devices = device->fs_devices;
>> @@ -1612,10 +1609,10 @@ int btrfs_rm_device(struct btrfs_root *root,
char *device_path)
>>               device->fs_devices->open_devices--;
>>
>>       call_rcu(&device->rcu, free_device);
>> -    
mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>
>>       num_devices =
btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>       btrfs_set_super_num_devices(root->fs_info->super_copy,
num_devices);
>> +    
mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>
>>       if (cur_devices->open_devices == 0) {
>>               struct btrfs_fs_devices *fs_devices;
>>
>


-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That''s why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Filipe David Borba Manana

2013-Aug-09 14:41 UTC

head link

[PATCH v2] Btrfs: fix race between removing a dev and writing sbs

This change fixes an issue when removing a device and writing
all super blocks run simultaneously. Here''s the steps necessary
for the issue to happen:

1) disk-io.c:write_all_supers() gets a number of N devices from the
   super_copy, so it will not panic if it fails to write super blocks
   for N - 1 devices;

2) Then it tries to acquire the device_list_mutex, but blocks because
   volumes.c:btrfs_rm_device() got it first;

3) btrfs_rm_device() removes the device from the list, then unlocks the
   mutex and after the unlock it updates the number of devices in
   super_copy to N - 1.

4) write_all_supers() finally acquires the mutex, iterates over all the
   devices in the list and gets N - 1 errors, that is, it failed to write
   super blocks to all the devices;

5) Because write_all_supers() thinks there are a total of N devices, it
   considers N - 1 errors to be ok, and therefore won''t panic.

So this change just makes sure that write_all_supers() reads the number
of devices from super_copy after it acquires the device_list_mutex.
Conversely, it changes btrfs_rm_device() to update the number of devices
in super_copy before it releases the device list mutex.

The code path to add a new device (volumes.c:btrfs_init_new_device),
already has the right behaviour: it updates the number of devices in
super_copy while holding the device_list_mutex.

The only code path that doesn''t lock the device list mutex
before updating the number of devices in the super copy is
disk-io.c:next_root_backup(), called by open_ctree() during
mount time where concurrency issues can''t happen.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
---

V2: Restored Stefan Behrens''s comment as it''s about a
different issue,
    and updated the commit message to be more detailed.

 fs/btrfs/disk-io.c |    2 +-
 fs/btrfs/volumes.c |    8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5de9ad7..e70f9c5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3319,7 +3319,6 @@ static int write_all_supers(struct btrfs_root *root, int
max_mirrors)
 	int total_errors = 0;
 	u64 flags;
 
-	max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
 	do_barriers = !btrfs_test_opt(root, NOBARRIER);
 	backup_super_roots(root->fs_info);
 
@@ -3328,6 +3327,7 @@ static int write_all_supers(struct btrfs_root *root, int
max_mirrors)
 
 	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
 	head = &root->fs_info->fs_devices->devices;
+	max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
 
 	if (do_barriers) {
 		ret = barrier_all_devices(root->fs_info);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c74d702..f62f535 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1588,7 +1588,11 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 	/*
 	 * the device list mutex makes sure that we don''t change
 	 * the device list while someone else is writing out all
-	 * the device supers.
+	 * the device supers. Whoever is writing all supers, should
+	 * lock the device list mutex before getting the number of
+	 * devices in the super block (super_copy). Conversely,
+	 * whoever updates the number of devices in the super block
+	 * (super_copy) should hold the device list mutex.
 	 */
 
 	cur_devices = device->fs_devices;
@@ -1612,10 +1616,10 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 		device->fs_devices->open_devices--;
 
 	call_rcu(&device->rcu, free_device);
-	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
 	num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
 	btrfs_set_super_num_devices(root->fs_info->super_copy, num_devices);
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
 	if (cur_devices->open_devices == 0) {
 		struct btrfs_fs_devices *fs_devices;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2013-Aug-09 14:58 UTC

head link

Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs

On Fri, 9 Aug 2013 14:50:35 +0100, Filipe David Manana
wrote:> On Fri, Aug 9, 2013 at 2:07 PM, Stefan Behrens
> <sbehrens@giantdisaster.de> wrote:
>> On Thu,  8 Aug 2013 21:00:52 +0100, Filipe David Borba Manana wrote:
>>> Since all code paths that update the number of devices in the
>>> super copy (fs_info->super_copy) first lock the device list
>>> (fs_info->fs_devices->device_list_mutex), and
write_all_supers()
>>> also needs to lock the devices list mutex, make write_all_supers()
>>> read the number of devices from the super copy after it locks
>>> the device list mutex (and before unlocking it of course).
>>>
>>> The only code path that doesn''t lock the device list mutex
>>> before updating the number of devices in the super copy is
>>> disk-io.c:next_root_backup(), called by open_ctree() during
>>> mount time where concurrency issues can''t happen.
>>>
>>> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
>>> ---
>>>  fs/btrfs/disk-io.c |    2 +-
>>>  fs/btrfs/volumes.c |   11 ++++-------
>>>  2 files changed, 5 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 254cdc8..c4b24c7 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -3313,7 +3313,6 @@ static int write_all_supers(struct btrfs_root
*root, int max_mirrors)
>>>       int total_errors = 0;
>>>       u64 flags;
>>>
>>> -     max_errors =
btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>>       do_barriers = !btrfs_test_opt(root, NOBARRIER);
>>>       backup_super_roots(root->fs_info);
>>>
>>> @@ -3322,6 +3321,7 @@ static int write_all_supers(struct btrfs_root
*root, int max_mirrors)
>>>
>>>      
mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>>>       head = &root->fs_info->fs_devices->devices;
>>> +     max_errors =
btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>>
>>>       if (do_barriers) {
>>>               ret = barrier_all_devices(root->fs_info);
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 090f57c..eddf386 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -1568,11 +1568,6 @@ int btrfs_rm_device(struct btrfs_root *root,
char *device_path)
>>>       if (ret)
>>>               goto error_undo;
>>>
>>> -     /*
>>> -      * TODO: the superblock still includes this device in its
num_devices
>>> -      * counter although write_all_supers() is not locked out.
This
>>> -      * could give a filesystem state which requires a degraded
mount.
>>> -      */
>>>       ret = btrfs_rm_dev_item(root->fs_info->chunk_root,
device);
>>
>> The problem that I had seen when I added that comment is something
>> different than what you are addressing.
>>
>> The call to btrfs_rm_dev_item() is the place where the device is
removed
>> in the filesystem device tree. The transaction is commited.
> 
> So, it would only be super correct if the call to btrfs_rm_dev_item()
> (and the following code) is run inside the critical section delimited
> by the device list mutex (and have the super_copy num devices updated
> inside that section too, like I did).
> 
> Other than a potentially much longer critical section, or mutex
> deadlock (because btrfs_scrub_cancel locks scrub_lock), any reason to
> not do it?
Yes, committing a transaction while holding such a mutex usually causes
a deadlock.

> 
>>
>> root->fs_info->super_copy is not updated and still includes the
device
>> that is not part of the device tree anymore.
>>
>> 19 lines later, the device_list_mutex is acquired. Until then, nobody
>> prevents write_all_supers() to write the superblock to disk. This
means,
>> until then, you can create a state on disk with an updated device tree
>> and a num_devices value which is too high by one.
>>
>> If you now crash or the power drops, the on-disk state is not
>> consistent. However, this is not a severe problem. btrfs_rm_device()
>> relocates all chunks that are located on the removed device. On next
>> mount, at first the device items are read which do not include the
>> deleted device anymore, afterwards the chunks are checked, whether they
>> reference a device that is not present. And this is not the case.
>> Therefore this situation is not a severe problem and my comment was
>> wrong that says "could require a degraded mount".
>>
>> But the field num_devices in the superblock will stay wrong for the
>> lifetime of the filesystem, causing malfunction of the ioctl
>> BTRFS_IOC_DEVICES_READY, and potentially causing trouble in the future
>> when somebody adds code that relies on fs_devices->total_devices
being
>> correct.
>>
>> It''s simply not correct like it is now. And your patch
doesn''t fix the
>> issue that the TODO comment describes.
> 
> Thanks for the explanation, very helpful.
> 
> Indeed, it doesn''t fix the issue you described. I thought more
about
> fixing the following issue:
> 
> 1) Write super gets a number of N devices from super_copy, so it will
> not panic if it fails to write dbs for N - 1 devices;
> 
> 2) Then tries to acquire device_list_mutex, but blocks because
> btrfs_rm_device() got it first
> 
> 3) btrfs_rm_device() removes the device from the list, and does all
> those things it does and then unlocks the dev list mutex;
> 
> 4) write_all_supers() acquires the mutex, iterates over all devices in
> the list and gets N - 1 errors (failed to write db to all devices)
> 
> 5) Because N - 1 is less than N, it thinks all is ok, when it''s
not
> because there''s actually only N - 1 devices now. Therefore the
> BUG_ON() won''t get executed.
Right, this is also a problem.

> 
> This is more likely to happen for a small number of devices only (2 ->
> 1 for e.g.). I will revert re-add your comment, as this fixes
> something different.
> 
>>
>>
>>>       if (ret)
>>>               goto error_undo;
>>> @@ -1588,7 +1583,9 @@ int btrfs_rm_device(struct btrfs_root *root,
char *device_path)
>>>       /*
>>>        * the device list mutex makes sure that we don''t
change
>>>        * the device list while someone else is writing out all
>>> -      * the device supers.
>>> +      * the device supers. Whoever is writing all supers, should
>>> +      * lock the device list mutex before getting the number of
>>> +      * devices in the super block (super_copy).
>>>        */
>>>
>>>       cur_devices = device->fs_devices;
>>> @@ -1612,10 +1609,10 @@ int btrfs_rm_device(struct btrfs_root
*root, char *device_path)
>>>               device->fs_devices->open_devices--;
>>>
>>>       call_rcu(&device->rcu, free_device);
>>> -    
mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>>
>>>       num_devices =
btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>>       btrfs_set_super_num_devices(root->fs_info->super_copy,
num_devices);
>>> +    
mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>>
>>>       if (cur_devices->open_devices == 0) {
>>>               struct btrfs_fs_devices *fs_devices;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Aug 2013 - [PATCH] Btrfs: fix race between removing a dev and writing sbs

[PATCH] Btrfs: fix race between removing a dev and writing sbs

Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs

Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs

[PATCH v2] Btrfs: fix race between removing a dev and writing sbs

Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs