thr3ads.net - Btrfs devel - [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log

If this information is useful, please help other people find it:
Share via:

Josef Bacik

2013-Jun-30 03:17 UTC

[PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
forward in the tree mod entries, otherwise we''ll end up with random
entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
people were seeing when running

find /whatever -type f -exec btrfs fi defrag {} \;

Thansk,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
---
 fs/btrfs/ctree.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index c32d03d..7921e1d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1161,8 +1161,8 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
  * time_seq).
  */
 static void
-__tree_mod_log_rewind(struct extent_buffer *eb, u64 time_seq,
-		      struct tree_mod_elem *first_tm)
+__tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb,
+		      u64 time_seq, struct tree_mod_elem *first_tm)
 {
 	u32 n;
 	struct rb_node *next;
@@ -1172,6 +1172,7 @@ __tree_mod_log_rewind(struct extent_buffer *eb, u64
time_seq,
 	unsigned long p_size = sizeof(struct btrfs_key_ptr);
 
 	n = btrfs_header_nritems(eb);
+	tree_mod_log_read_lock(fs_info);
 	while (tm && tm->seq >= time_seq) {
 		/*
 		 * all the operations are recorded with the operator used for
@@ -1226,6 +1227,7 @@ __tree_mod_log_rewind(struct extent_buffer *eb, u64
time_seq,
 		if (tm->index != first_tm->index)
 			break;
 	}
+	tree_mod_log_read_unlock(fs_info);
 	btrfs_set_header_nritems(eb, n);
 }
 
@@ -1274,7 +1276,7 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct
extent_buffer *eb,
 
 	extent_buffer_get(eb_rewin);
 	btrfs_tree_read_lock(eb_rewin);
-	__tree_mod_log_rewind(eb_rewin, time_seq, tm);
+	__tree_mod_log_rewind(fs_info, eb_rewin, time_seq, tm);
 	WARN_ON(btrfs_header_nritems(eb_rewin) >
 		BTRFS_NODEPTRS_PER_BLOCK(fs_info->tree_root));
 
@@ -1350,7 +1352,7 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
 		btrfs_set_header_generation(eb, old_generation);
 	}
 	if (tm)
-		__tree_mod_log_rewind(eb, time_seq, tm);
+		__tree_mod_log_rewind(root->fs_info, eb, time_seq, tm);
 	else
 		WARN_ON(btrfs_header_level(eb) != 0);
 	WARN_ON(btrfs_header_nritems(eb) > BTRFS_NODEPTRS_PER_BLOCK(root));
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Schmidt

2013-Jun-30 08:25 UTC

head link

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

On 30.06.2013 05:17, Josef Bacik wrote:> We need to hold the tree mod log lock in __tree_mod_log_rewind since we
walk
> forward in the tree mod entries, otherwise we''ll end up with
random entries and
> trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the
panics
> people were seeing when running
> 
> find /whatever -type f -exec btrfs fi defrag {} \;
This patch cannot help to solve the problem, as far as I''ve understood
what is going on. It does change timing, though, which presumably makes
it pass the current reproducer we''re having.

On rewinding, iteration through the tree mod log rb-tree goes backwards
in time, which means that once we''ve found our staring point we cannot
be trapped by later additions. The old items we''re rewinding towards
cannot be freed, because we''ve allocated a blocker element within the
tree and rewinding never goes beyond the allocated blocker. The blocker
element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
time_seq within the other tree mod log functions in ctree.c. To sum up,
the added lock is not required.

The debug output I''ve analyzed so far shows that after we''ve
rewinded
all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
as expected, there comes another REMOVE_WHILE_FREEING with a sequence
number much further in the past for the same buffer (but that sequence
number is still higher than out time_seq rewind barrier at that point).
This must be a logical problem I''ve not completely understood so far,
but locking doesn''t seem to be the right track.

Thanks,
-Jan

> Thansk,
> 
> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
> ---
>  fs/btrfs/ctree.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index c32d03d..7921e1d 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1161,8 +1161,8 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info
*fs_info,
>   * time_seq).
>   */
>  static void
> -__tree_mod_log_rewind(struct extent_buffer *eb, u64 time_seq,
> -		      struct tree_mod_elem *first_tm)
> +__tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer
*eb,
> +		      u64 time_seq, struct tree_mod_elem *first_tm)
>  {
>  	u32 n;
>  	struct rb_node *next;
> @@ -1172,6 +1172,7 @@ __tree_mod_log_rewind(struct extent_buffer *eb, u64
time_seq,
>  	unsigned long p_size = sizeof(struct btrfs_key_ptr);
>  
>  	n = btrfs_header_nritems(eb);
> +	tree_mod_log_read_lock(fs_info);
>  	while (tm && tm->seq >= time_seq) {
>  		/*
>  		 * all the operations are recorded with the operator used for
> @@ -1226,6 +1227,7 @@ __tree_mod_log_rewind(struct extent_buffer *eb, u64
time_seq,
>  		if (tm->index != first_tm->index)
>  			break;
>  	}
> +	tree_mod_log_read_unlock(fs_info);
>  	btrfs_set_header_nritems(eb, n);
>  }
>  
> @@ -1274,7 +1276,7 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
struct extent_buffer *eb,
>  
>  	extent_buffer_get(eb_rewin);
>  	btrfs_tree_read_lock(eb_rewin);
> -	__tree_mod_log_rewind(eb_rewin, time_seq, tm);
> +	__tree_mod_log_rewind(fs_info, eb_rewin, time_seq, tm);
>  	WARN_ON(btrfs_header_nritems(eb_rewin) >
>  		BTRFS_NODEPTRS_PER_BLOCK(fs_info->tree_root));
>  
> @@ -1350,7 +1352,7 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
>  		btrfs_set_header_generation(eb, old_generation);
>  	}
>  	if (tm)
> -		__tree_mod_log_rewind(eb, time_seq, tm);
> +		__tree_mod_log_rewind(root->fs_info, eb, time_seq, tm);
>  	else
>  		WARN_ON(btrfs_header_level(eb) != 0);
>  	WARN_ON(btrfs_header_nritems(eb) > BTRFS_NODEPTRS_PER_BLOCK(root));
> --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Jun-30 11:22 UTC

head link

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt
wrote:> On 30.06.2013 05:17, Josef Bacik wrote:
> > We need to hold the tree mod log lock in __tree_mod_log_rewind since
we walk
> > forward in the tree mod entries, otherwise we''ll end up with
random entries and
> > trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes
the panics
> > people were seeing when running
> > 
> > find /whatever -type f -exec btrfs fi defrag {} \;
> 
> This patch cannot help to solve the problem, as far as I''ve
understood
> what is going on. It does change timing, though, which presumably makes
> it pass the current reproducer we''re having.
> 
> On rewinding, iteration through the tree mod log rb-tree goes backwards
> in time, which means that once we''ve found our staring point we
cannot
> be trapped by later additions. The old items we''re rewinding
towards
> cannot be freed, because we''ve allocated a blocker element within
the
> tree and rewinding never goes beyond the allocated blocker. The blocker
> element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
> time_seq within the other tree mod log functions in ctree.c. To sum up,
> the added lock is not required.
> 
> The debug output I''ve analyzed so far shows that after
we''ve rewinded
> all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
> as expected, there comes another REMOVE_WHILE_FREEING with a sequence
> number much further in the past for the same buffer (but that sequence
> number is still higher than out time_seq rewind barrier at that point).
> This must be a logical problem I''ve not completely understood so
far,
> but locking doesn''t seem to be the right track.
> 
So I''ve seen 2 scenarios with my debug output, once with my printk()
which
seemed to be unreliable and then again with trace_printk() which was much more
reliable.

With the printk() output I saw what you described, however I was also printing
out every mod we added and I never saw the sequence number allocated that we had
suddenly tripped over, so I assumed that the printk() just got lost between the
boxes (I was using netconsole to capture the output).  However if it was never
actually allocated then it would make sense that we just pointed into some
random peice of memory.

The second bit I saw which is when I was convinced this was the problem had us
processing the keys, and suddenly we''d try to process the key again. 
So I saw
something like this (I lost the output because I was capturing the output into
the same file again when I tried to reproduce with this patch so I
don''t have
the exact output)

processing op 3, seq 10 slot 0
processing op 3, seq 9 slot 1
processing op 3, seq 8 slot 2
processing op 3, seq 7 slot 3
processing op 3, seq 6 slot 4
processing op 3, seq 5 slot 5
processing op 3, seq 8 slot 2

so we got an entry we had already processed, which is completely in line with
the rb_tree pointing to something random when we do rb_next.

Now I had another theory early on, but I haven''t seen output to confirm
this.
We can re-allocate a block if we''ve allocated it in the same
transaction, and it
appears this would completely screw up the tree mod log.  So say we allocate
block 5, we add keys to it, and then we free it, so we get this
MOVE_WHILE_FREEING thing done on the entire block, but then we re-allocate the
same block and start doing normal operations to it.  Then we go to replay back
and we will replay all the normal operations, but then we will walk into the
REMOVE_WHILE_FREEING entries from when the block was free''d last and
this will
trigger the BUG_ON().  Now I hadn''t actually had this happen so either
it can''t
happen some how and I''ve missed the reasoning for it, or what I saw was
just the
rb_tree messup because we aren''t locking it and what you are seeing is
this
particular case.  Either way I''m going to back my patch out and re-run
a few
more times with my trace_printk()''s so I can have several runs to look
at and
verify what I''m seeing.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Jun-30 13:55 UTC

head link

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt
wrote:> On 30.06.2013 05:17, Josef Bacik wrote:
> > We need to hold the tree mod log lock in __tree_mod_log_rewind since
we walk
> > forward in the tree mod entries, otherwise we''ll end up with
random entries and
> > trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes
the panics
> > people were seeing when running
> > 
> > find /whatever -type f -exec btrfs fi defrag {} \;
> 
> This patch cannot help to solve the problem, as far as I''ve
understood
> what is going on. It does change timing, though, which presumably makes
> it pass the current reproducer we''re having.
> 
> On rewinding, iteration through the tree mod log rb-tree goes backwards
> in time, which means that once we''ve found our staring point we
cannot
> be trapped by later additions. The old items we''re rewinding
towards
> cannot be freed, because we''ve allocated a blocker element within
the
> tree and rewinding never goes beyond the allocated blocker. The blocker
> element is allocated by btrfs_get_tree_mod_seq and mostly referred to as
> time_seq within the other tree mod log functions in ctree.c. To sum up,
> the added lock is not required.
> 
> The debug output I''ve analyzed so far shows that after
we''ve rewinded
> all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
> as expected, there comes another REMOVE_WHILE_FREEING with a sequence
> number much further in the past for the same buffer (but that sequence
> number is still higher than out time_seq rewind barrier at that point).
> This must be a logical problem I''ve not completely understood so
far,
> but locking doesn''t seem to be the right track.
> 
Finally reproduced it, this is my output

 btrfs-endio-wri-23110 [000] ...2  9556.882103: __tree_mod_log_rewind: rewinding
15450537984
 btrfs-endio-wri-23110 [000] ...2  9556.882104: __tree_mod_log_rewind:
15450537984: processing ffff880246590a40, op 3, seq 68719476829, slot 0
 btrfs-endio-wri-23110 [000] ...2  9556.882106: __tree_mod_log_rewind:
15450537984: processing ffff880246590ac0, op 3, seq 68719476828, slot 1
 btrfs-endio-wri-23110 [000] ...2  9556.882108: __tree_mod_log_rewind:
15450537984: processing ffff880246590a40, op 3, seq 68719476829, slot 0
 btrfs-endio-wri-23110 [000] ...2  9556.882110: __tree_mod_log_rewind:
15450537984: this tm is failing, ffff880246590a40, seq 68719476829, slot 0

so I''m inclined to beleive I''ve got it right.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo

2013-Jun-30 15:02 UTC

head link

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

On Sun, Jun 30, 2013 at 07:22:00AM -0400, Josef Bacik
wrote:> On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
> > On 30.06.2013 05:17, Josef Bacik wrote:
> > > We need to hold the tree mod log lock in __tree_mod_log_rewind
since we walk
> > > forward in the tree mod entries, otherwise we''ll end up
with random entries and
> > > trip the BUG_ON() at the front of __tree_mod_log_rewind.  This
fixes the panics
> > > people were seeing when running
> > > 
> > > find /whatever -type f -exec btrfs fi defrag {} \;
> > 
> > This patch cannot help to solve the problem, as far as I''ve
understood
> > what is going on. It does change timing, though, which presumably
makes
> > it pass the current reproducer we''re having.
> > 
> > On rewinding, iteration through the tree mod log rb-tree goes
backwards
> > in time, which means that once we''ve found our staring point
we cannot
> > be trapped by later additions. The old items we''re rewinding
towards
> > cannot be freed, because we''ve allocated a blocker element
within the
> > tree and rewinding never goes beyond the allocated blocker. The
blocker
> > element is allocated by btrfs_get_tree_mod_seq and mostly referred to
as
> > time_seq within the other tree mod log functions in ctree.c. To sum
up,
> > the added lock is not required.
> > 
> > The debug output I''ve analyzed so far shows that after
we''ve rewinded
> > all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
> > as expected, there comes another REMOVE_WHILE_FREEING with a sequence
> > number much further in the past for the same buffer (but that sequence
> > number is still higher than out time_seq rewind barrier at that
point).
> > This must be a logical problem I''ve not completely understood
so far,
> > but locking doesn''t seem to be the right track.
> > 
> 
> So I''ve seen 2 scenarios with my debug output, once with my
printk() which
> seemed to be unreliable and then again with trace_printk() which was much
more
> reliable.
> 
> With the printk() output I saw what you described, however I was also
printing
> out every mod we added and I never saw the sequence number allocated that
we had
> suddenly tripped over, so I assumed that the printk() just got lost between
the
> boxes (I was using netconsole to capture the output).  However if it was
never
> actually allocated then it would make sense that we just pointed into some
> random peice of memory.
> 
> The second bit I saw which is when I was convinced this was the problem had
us
> processing the keys, and suddenly we''d try to process the key
again.  So I saw
> something like this (I lost the output because I was capturing the output
into
> the same file again when I tried to reproduce with this patch so I
don''t have
> the exact output)
> 
> processing op 3, seq 10 slot 0
> processing op 3, seq 9 slot 1
> processing op 3, seq 8 slot 2
> processing op 3, seq 7 slot 3
> processing op 3, seq 6 slot 4
> processing op 3, seq 5 slot 5
> processing op 3, seq 8 slot 2
> 
> so we got an entry we had already processed, which is completely in line
with
> the rb_tree pointing to something random when we do rb_next.
> 
> Now I had another theory early on, but I haven''t seen output to
confirm this.
> We can re-allocate a block if we''ve allocated it in the same
transaction, and it
> appears this would completely screw up the tree mod log.  So say we
allocate
> block 5, we add keys to it, and then we free it, so we get this
> MOVE_WHILE_FREEING thing done on the entire block, but then we re-allocate
the
> same block and start doing normal operations to it.  Then we go to replay
back
> and we will replay all the normal operations, but then we will walk into
the
> REMOVE_WHILE_FREEING entries from when the block was free''d last
and this will
> trigger the BUG_ON().  Now I hadn''t actually had this happen so
either it can''t
> happen some how and I''ve missed the reasoning for it, or what I
saw was just the
> rb_tree messup because we aren''t locking it and what you are
seeing is this
> particular case.  Either way I''m going to back my patch out and
re-run a few
> more times with my trace_printk()''s so I can have several runs to
look at and
> verify what I''m seeing.  Thanks,
I doubt the theory here, if we free a block, it''s marked as pinned and
will then be marked as free only when we commit the transaction, isn''t
it?

- liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Jun-30 18:01 UTC

head link

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

On Sun, Jun 30, 2013 at 11:02:10PM +0800, Liu Bo wrote:> On Sun, Jun 30, 2013 at 07:22:00AM -0400, Josef Bacik wrote:
> > On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
> > > On 30.06.2013 05:17, Josef Bacik wrote:
> > > > We need to hold the tree mod log lock in
__tree_mod_log_rewind since we walk
> > > > forward in the tree mod entries, otherwise we''ll
end up with random entries and
> > > > trip the BUG_ON() at the front of __tree_mod_log_rewind. 
This fixes the panics
> > > > people were seeing when running
> > > > 
> > > > find /whatever -type f -exec btrfs fi defrag {} \;
> > > 
> > > This patch cannot help to solve the problem, as far as
I''ve understood
> > > what is going on. It does change timing, though, which presumably
makes
> > > it pass the current reproducer we''re having.
> > > 
> > > On rewinding, iteration through the tree mod log rb-tree goes
backwards
> > > in time, which means that once we''ve found our staring
point we cannot
> > > be trapped by later additions. The old items we''re
rewinding towards
> > > cannot be freed, because we''ve allocated a blocker
element within the
> > > tree and rewinding never goes beyond the allocated blocker. The
blocker
> > > element is allocated by btrfs_get_tree_mod_seq and mostly
referred to as
> > > time_seq within the other tree mod log functions in ctree.c. To
sum up,
> > > the added lock is not required.
> > > 
> > > The debug output I''ve analyzed so far shows that after
we''ve rewinded
> > > all REMOVE_WHILE_FREEING operations on a buffer, ordered
consecutively
> > > as expected, there comes another REMOVE_WHILE_FREEING with a
sequence
> > > number much further in the past for the same buffer (but that
sequence
> > > number is still higher than out time_seq rewind barrier at that
point).
> > > This must be a logical problem I''ve not completely
understood so far,
> > > but locking doesn''t seem to be the right track.
> > > 
> > 
> > So I''ve seen 2 scenarios with my debug output, once with my
printk() which
> > seemed to be unreliable and then again with trace_printk() which was
much more
> > reliable.
> > 
> > With the printk() output I saw what you described, however I was also
printing
> > out every mod we added and I never saw the sequence number allocated
that we had
> > suddenly tripped over, so I assumed that the printk() just got lost
between the
> > boxes (I was using netconsole to capture the output).  However if it
was never
> > actually allocated then it would make sense that we just pointed into
some
> > random peice of memory.
> > 
> > The second bit I saw which is when I was convinced this was the
problem had us
> > processing the keys, and suddenly we''d try to process the key
again.  So I saw
> > something like this (I lost the output because I was capturing the
output into
> > the same file again when I tried to reproduce with this patch so I
don''t have
> > the exact output)
> > 
> > processing op 3, seq 10 slot 0
> > processing op 3, seq 9 slot 1
> > processing op 3, seq 8 slot 2
> > processing op 3, seq 7 slot 3
> > processing op 3, seq 6 slot 4
> > processing op 3, seq 5 slot 5
> > processing op 3, seq 8 slot 2
> > 
> > so we got an entry we had already processed, which is completely in
line with
> > the rb_tree pointing to something random when we do rb_next.
> > 
> > Now I had another theory early on, but I haven''t seen output
to confirm this.
> > We can re-allocate a block if we''ve allocated it in the same
transaction, and it
> > appears this would completely screw up the tree mod log.  So say we
allocate
> > block 5, we add keys to it, and then we free it, so we get this
> > MOVE_WHILE_FREEING thing done on the entire block, but then we
re-allocate the
> > same block and start doing normal operations to it.  Then we go to
replay back
> > and we will replay all the normal operations, but then we will walk
into the
> > REMOVE_WHILE_FREEING entries from when the block was free''d
last and this will
> > trigger the BUG_ON().  Now I hadn''t actually had this happen
so either it can''t
> > happen some how and I''ve missed the reasoning for it, or what
I saw was just the
> > rb_tree messup because we aren''t locking it and what you are
seeing is this
> > particular case.  Either way I''m going to back my patch out
and re-run a few
> > more times with my trace_printk()''s so I can have several
runs to look at and
> > verify what I''m seeing.  Thanks,
> 
> I doubt the theory here, if we free a block, it''s marked as pinned
and
> will then be marked as free only when we commit the transaction,
isn''t it?
> 
Not sure if my other reply made it through, my phone only responds with HTML
which is kind of a pain.

If you look at btrfs_free_tree_block we will just add the space back if the
buffer was allocated in this transaction and hasn''t been written yet,
so this
case can happen.  I think we should probably add another mod log operation that
gets added when this happens so that if we do re-allocate it and try to rewind
back to when it belonged to another tree we can do that without panicing.
Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo

2013-Jul-01 09:18 UTC

head link

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

On Sun, Jun 30, 2013 at 02:01:17PM -0400, Josef Bacik
wrote:> On Sun, Jun 30, 2013 at 11:02:10PM +0800, Liu Bo wrote:
> > On Sun, Jun 30, 2013 at 07:22:00AM -0400, Josef Bacik wrote:
> > > On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
> > > > On 30.06.2013 05:17, Josef Bacik wrote:
> > > > > We need to hold the tree mod log lock in
__tree_mod_log_rewind since we walk
> > > > > forward in the tree mod entries, otherwise
we''ll end up with random entries and
> > > > > trip the BUG_ON() at the front of
__tree_mod_log_rewind.  This fixes the panics
> > > > > people were seeing when running
> > > > > 
> > > > > find /whatever -type f -exec btrfs fi defrag {} \;
> > > > 
> > > > This patch cannot help to solve the problem, as far as
I''ve understood
> > > > what is going on. It does change timing, though, which
presumably makes
> > > > it pass the current reproducer we''re having.
> > > > 
> > > > On rewinding, iteration through the tree mod log rb-tree
goes backwards
> > > > in time, which means that once we''ve found our
staring point we cannot
> > > > be trapped by later additions. The old items we''re
rewinding towards
> > > > cannot be freed, because we''ve allocated a blocker
element within the
> > > > tree and rewinding never goes beyond the allocated blocker.
The blocker
> > > > element is allocated by btrfs_get_tree_mod_seq and mostly
referred to as
> > > > time_seq within the other tree mod log functions in ctree.c.
To sum up,
> > > > the added lock is not required.
> > > > 
> > > > The debug output I''ve analyzed so far shows that
after we''ve rewinded
> > > > all REMOVE_WHILE_FREEING operations on a buffer, ordered
consecutively
> > > > as expected, there comes another REMOVE_WHILE_FREEING with a
sequence
> > > > number much further in the past for the same buffer (but
that sequence
> > > > number is still higher than out time_seq rewind barrier at
that point).
> > > > This must be a logical problem I''ve not completely
understood so far,
> > > > but locking doesn''t seem to be the right track.
> > > > 
> > > 
> > > So I''ve seen 2 scenarios with my debug output, once with
my printk() which
> > > seemed to be unreliable and then again with trace_printk() which
was much more
> > > reliable.
> > > 
> > > With the printk() output I saw what you described, however I was
also printing
> > > out every mod we added and I never saw the sequence number
allocated that we had
> > > suddenly tripped over, so I assumed that the printk() just got
lost between the
> > > boxes (I was using netconsole to capture the output).  However if
it was never
> > > actually allocated then it would make sense that we just pointed
into some
> > > random peice of memory.
> > > 
> > > The second bit I saw which is when I was convinced this was the
problem had us
> > > processing the keys, and suddenly we''d try to process
the key again.  So I saw
> > > something like this (I lost the output because I was capturing
the output into
> > > the same file again when I tried to reproduce with this patch so
I don''t have
> > > the exact output)
> > > 
> > > processing op 3, seq 10 slot 0
> > > processing op 3, seq 9 slot 1
> > > processing op 3, seq 8 slot 2
> > > processing op 3, seq 7 slot 3
> > > processing op 3, seq 6 slot 4
> > > processing op 3, seq 5 slot 5
> > > processing op 3, seq 8 slot 2
> > > 
> > > so we got an entry we had already processed, which is completely
in line with
> > > the rb_tree pointing to something random when we do rb_next.
> > > 
> > > Now I had another theory early on, but I haven''t seen
output to confirm this.
> > > We can re-allocate a block if we''ve allocated it in the
same transaction, and it
> > > appears this would completely screw up the tree mod log.  So say
we allocate
> > > block 5, we add keys to it, and then we free it, so we get this
> > > MOVE_WHILE_FREEING thing done on the entire block, but then we
re-allocate the
> > > same block and start doing normal operations to it.  Then we go
to replay back
> > > and we will replay all the normal operations, but then we will
walk into the
> > > REMOVE_WHILE_FREEING entries from when the block was
free''d last and this will
> > > trigger the BUG_ON().  Now I hadn''t actually had this
happen so either it can''t
> > > happen some how and I''ve missed the reasoning for it, or
what I saw was just the
> > > rb_tree messup because we aren''t locking it and what you
are seeing is this
> > > particular case.  Either way I''m going to back my patch
out and re-run a few
> > > more times with my trace_printk()''s so I can have
several runs to look at and
> > > verify what I''m seeing.  Thanks,
> > 
> > I doubt the theory here, if we free a block, it''s marked as
pinned and
> > will then be marked as free only when we commit the transaction,
isn''t it?
> > 
> 
> Not sure if my other reply made it through, my phone only responds with
HTML
> which is kind of a pain.
> 
> If you look at btrfs_free_tree_block we will just add the space back if the
> buffer was allocated in this transaction and hasn''t been written
yet, so this
> case can happen.  I think we should probably add another mod log operation
that
> gets added when this happens so that if we do re-allocate it and try to
rewind
> back to when it belonged to another tree we can do that without panicing.
> Thanks,
Yeah it''s right, thanks for the explanation.

Another option is to do a bisect if time is available, I was testing
tree mod log intensively while working on snapshot-aware defrag
and never hit this bug, so it might be introduced by some later commits.

- liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Schmidt

2013-Jul-02 07:24 UTC

head link

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

On Sun, June 30, 2013 at 15:55 (+0200), Josef Bacik
wrote:> On Sun, Jun 30, 2013 at 10:25:05AM +0200, Jan Schmidt wrote:
>> On 30.06.2013 05:17, Josef Bacik wrote:
>>> We need to hold the tree mod log lock in __tree_mod_log_rewind
since we walk
>>> forward in the tree mod entries, otherwise we''ll end up
with random entries and
>>> trip the BUG_ON() at the front of __tree_mod_log_rewind.  This
fixes the panics
>>> people were seeing when running
>>>
>>> find /whatever -type f -exec btrfs fi defrag {} \;
>>
>> This patch cannot help to solve the problem, as far as I''ve
understood
>> what is going on. It does change timing, though, which presumably makes
>> it pass the current reproducer we''re having.
>>
>> On rewinding, iteration through the tree mod log rb-tree goes backwards
>> in time, which means that once we''ve found our staring point
we cannot
>> be trapped by later additions. The old items we''re rewinding
towards
>> cannot be freed, because we''ve allocated a blocker element
within the
>> tree and rewinding never goes beyond the allocated blocker. The blocker
>> element is allocated by btrfs_get_tree_mod_seq and mostly referred to
as
>> time_seq within the other tree mod log functions in ctree.c. To sum up,
>> the added lock is not required.
>>
>> The debug output I''ve analyzed so far shows that after
we''ve rewinded
>> all REMOVE_WHILE_FREEING operations on a buffer, ordered consecutively
>> as expected, there comes another REMOVE_WHILE_FREEING with a sequence
>> number much further in the past for the same buffer (but that sequence
>> number is still higher than out time_seq rewind barrier at that point).
>> This must be a logical problem I''ve not completely understood
so far,
>> but locking doesn''t seem to be the right track.
>>
> 
> Finally reproduced it, this is my output
> 
>  btrfs-endio-wri-23110 [000] ...2  9556.882103: __tree_mod_log_rewind:
rewinding 15450537984
>  btrfs-endio-wri-23110 [000] ...2  9556.882104: __tree_mod_log_rewind:
15450537984: processing ffff880246590a40, op 3, seq 68719476829, slot 0
>  btrfs-endio-wri-23110 [000] ...2  9556.882106: __tree_mod_log_rewind:
15450537984: processing ffff880246590ac0, op 3, seq 68719476828, slot 1
>  btrfs-endio-wri-23110 [000] ...2  9556.882108: __tree_mod_log_rewind:
15450537984: processing ffff880246590a40, op 3, seq 68719476829, slot 0
>  btrfs-endio-wri-23110 [000] ...2  9556.882110: __tree_mod_log_rewind:
15450537984: this tm is failing, ffff880246590a40, seq 68719476829, slot 0
> 
> so I''m inclined to beleive I''ve got it right.  Thanks,
Looking at the code I agree we should have a read lock around rb_next,
protecting it against reorganization during insertions. Fits to that kind of
debug output.

How about just getting the lock for the rb_next call? There can be quite a lot
of operations to rewind and I''d rather not have every other fs tree
modification
block on that.

Thanks,
-Jan
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jun 2013 - [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

[PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

Re: [PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind