thr3ads.net - Btrfs devel - btrfs raid5 [Oct 2013]

If this information is useful, please help other people find it:
Share via:

lilofile

2013-Oct-21 15:45 UTC

btrfs raid5

hi:
    since RAID 5/6 code merged into Btrfs from  2013.2, no update and bug are
found in maillist? is any development plan with btrfs raid5? such as adjusting
stripe width、 reconstruction？
  compared to md raid5  what is advantage in btrfs raid5 ?



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

lilofile

2013-Oct-21 15:53 UTC

head link

BTRFS_INODE_NODATACOW

BTRFS_INODE_NODATACOW
what does this macro means?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2013-Oct-21 15:58 UTC

head link

Re: BTRFS_INODE_NODATACOW

On Mon, Oct 21, 2013 at 11:53:42PM +0800, lilofile
wrote:> 
> BTRFS_INODE_NODATACOW
> what does this macro means?
   A file with that attribute set will be modified in place, rather
than have extents CoWed when they''re modified. This is the
implementation of the +C attribute.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
      --- Normaliser unix c''est comme pasteuriser le Camembert ---

lilofile

2013-Oct-21 16:05 UTC

head link

btrfs flush data cache

transaction_kthread  can periodically commit data and metedata to disk,similarly
btrfs_writepages can write data page to disk, in which situation
btrfs_writepages function is called? and i cannot find btrfs_writepages is
called in btrfs code?  who can tell me?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

shuo lv

2013-Oct-22 02:30 UTC

head link

btrfs raid5

hi:
since RAID 5/6 code merged into Btrfs from 2013.2, no update and bug
are found in maillist? is any development plan with btrfs raid5? such
as adjusting stripe width、 reconstruction？
compared to md raid5 what is advantage in btrfs raid5 ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Oct-22 13:27 UTC

head link

Re: btrfs raid5

lilofile posted on Mon, 21 Oct 2013 23:45:58 +0800 as excerpted:
> hi:
>     since RAID 5/6 code merged into Btrfs from  2013.2, no update and
>     bug are found in maillist? is any development plan with btrfs raid5?
>     such as adjusting stripe width、 reconstruction？
>   compared to md raid5  what is advantage in btrfs raid5 ?
AFAIK, btrfs raid5/6 modes are still not considered ready for deployed 
use, only for testing (tho with each new kernel cycle I wonder if that 
has changed, but no word on it changing yet).  This is because there''s
a
hole in the recovery process in case of a lost device, making it 
dangerous to use except for the pure test-case.

Yes, flushing out the features a bit is planned, tho I''ve not tracked 
specifics.  (My primary interest and use-case is the N-way-mirroring 
raid1 case, which is roadmapped for merging after raid5/6 stabilize; 
current "raid1" case is limited to 2-way-mirroring.  So mostly
I''m simply
tracking raid5/6 progress in relation to that, not for its own merits, 
thus I''m not personally tracking the specifics too closely.)

The advantage in btrfs raid5/6 is that unlike md/raid, btrfs knows what 
blocks are actually used by data/metadata, and can use that information 
in a rebuild/recovery situation to only sync/rebuild the actually used 
blocks on a re-added or replacement device, skipping blocks that were 
entirely unused/empty in the first place.

md/raid can''t do that, because it tries to be a filesystem agnostic
layer
that doesn''t know nor care what blocks on the layers above it were 
actually used or empty.  For it to try to track that would be a layering 
violation and would seriously complicate the code and/or limit usage to 
only those filesystems or other layers above that it supported/understood/
could-properly-track.

A comparable relationship exists between a ramdisk (comparable to md/
raid) and tmpfs (comparable to btrfs) -- the first is transparent and 
allows the flexibility of putting whatever filesystem or other upper 
layer on top, while the latter is the filesystem layer itself, allowing 
nothing else above it.  But the ramdisk/tmpfs case deals with memory 
emulating block device storage, while the mdraid/btrfs case deals with 
multiple block devices emulating a single device.  In both cases each has 
its purpose, with the strengths of one being the limitations of the 
other, and you choose the one that best matches your use case.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Oct-22 13:30 UTC

head link

Re: btrfs raid5

shuo lv posted on Tue, 22 Oct 2013 10:30:06 +0800 as excerpted:
> hi:
> since RAID 5/6 code merged into Btrfs from 2013.2, no update and bug are
> found in maillist? is any development plan with btrfs raid5? such as
> adjusting stripe width、 reconstruction？
> compared to md raid5 what is advantage in btrfs raid5 ?
See my reply to your other post asking the same question from a different 
email address...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba

2013-Oct-22 16:00 UTC

head link

Re: btrfs raid5

On Tue, Oct 22, 2013 at 01:27:44PM +0000, Duncan wrote:> >     since RAID 5/6 code merged into Btrfs from  2013.2, no update and
> >     bug are found in maillist? is any development plan with btrfs
raid5?
> >     such as adjusting stripe width、 reconstruction？
> >   compared to md raid5  what is advantage in btrfs raid5 ?
Thank you for explaining the differences in detail, I copied the last 3
paragraphs verbatim to wiki (modulo formatting).

https://btrfs.wiki.kernel.org/index.php/FAQ#Case_study:_btrfs-raid_5.2F6_versus_MD-RAID_5.2F6

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexandre Oliva

2013-Oct-22 17:18 UTC

head link

Re: btrfs raid5

On Oct 22, 2013, Duncan <1i5t5.duncan@cox.net> wrote:
> This is because there''s a hole in the recovery process in case of
a
> lost device, making it dangerous to use except for the pure test-case.
It''s not just that; any I/O error in raid56 chunks will trigger a BUG
and make the filesystem unusable until the next reboot, because the
mirror number is zero.  I wrote this patch last week, just before
leaving on a trip, and I was happy to find out it enabled a
frequently-failing disk to hold a filesystem that turned out to be
surprisingly reliable!

btrfs: some progress in raid56 recovery

From: Alexandre Oliva <oliva@gnu.org>

This patch is WIP, but it has enabled a raid6 filesystem on a bad disk
(frequent read failures at random blocks) to work flawlessly for a
couple of weeks, instead of hanging the entire filesystem upon the
first read error.

One of the problems is that we have the mirror number set to zero on
most raid56 reads.  That''s unexpected, for mirror numbers start at
one.  I couldn''t quite figure out where to fix the mirror number in
the bio construction, but by simply refraining from failing when the
mirror number is zero, I found out we end up retrying the read with
the next mirror, which becomes a read retry that, on my bad disk,
often succeeds.  So, that was the first win.

After that, I had to make a few further tweaks so that other BUG_ONs
wouldn''t hit, and we''d instead fail the read altogether, i.e.,
in the
extent_io layer, we still don''t repair/rewrite the raid56 blocks, nor
do we attempt to rebuild bad blocks out of the other blocks in the
stride.  In a few cases in which the read retry didn''t succeed,
I''d
get an extent cksum verify failure, which I regarded as ok.

What did surprise me was that, for some of these failures, but not
all, the raid56 recovery code would kick in and rebuild the bad block,
so that we''d get the correct data back in spite of the cksum failure
and the bad block.  I''m still puzzled by that; I can''t explain
what
I''m observing, but surely the correct data is coming out of somewhere
;-)

Another oddity I noticed is that sometimes the mirror numbers appear
to be totally out of range; I suspect there might be some type
mismatch or out-of-range memory access that causes some other
information to be read as a mirror number from bios or somesuch.  I
couldn''t track that down yet.

As it stands, although I know this still doesn''t kick in the recovery
or repair code at the right place, the patch is usable on its own, and
it is surely an improvement over the current state of raid56 in btrfs,
so it might be a good idea to put it in.  So far, I''ve put more than
1TB of data on that failing disk with 16 partitions on raid6, and
somehow I got all the data back successfully: every file passed an
md5sum check, in spite of tons of I/O errors in the process.

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 fs/btrfs/extent_io.c |   17 ++++++++++++-----
 fs/btrfs/raid56.c    |   18 ++++++++++++++----
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fe443fe..4a592a3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2061,11 +2061,11 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64
start,
 	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
 	int ret;

-	BUG_ON(!mirror_num);
-
 	/* we can''t repair anything in raid56 yet */
 	if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num))
-		return 0;
+		return -EIO;
+
+	BUG_ON(!mirror_num);

 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
 	if (!bio)
@@ -2157,7 +2157,6 @@ static int clean_io_failure(u64 start, struct page *page)
 		return 0;

 	failrec = (struct io_failure_record *)(unsigned long) private_failure;
-	BUG_ON(!failrec->this_mirror);

 	if (failrec->in_validation) {
 		/* there was no real error, just free the record */
@@ -2167,6 +2166,12 @@ static int clean_io_failure(u64 start, struct page *page)
 		goto out;
 	}

+	if (!failrec->this_mirror) {
+		pr_debug("clean_io_failure: failrec->this_mirror not set, assuming
%llu not repaired\n",
+			   failrec->start);
+		goto out;
+	}
+
 	spin_lock(&BTRFS_I(inode)->io_tree.lock);
 	state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree,
 					    failrec->start,
@@ -2338,7 +2343,9 @@ static int bio_readpage_error(struct bio *failed_bio,
struct page *page,
 		 * everything for repair_io_failure to do the rest for us.
 		 */
 		if (failrec->in_validation) {
-			BUG_ON(failrec->this_mirror != failed_mirror);
+			if (failrec->this_mirror != failed_mirror)
+				pr_debug("bio_readpage_error: this_mirror equals failed_mirror:
%i\n",
+					 failed_mirror);
 			failrec->in_validation = 0;
 			failrec->this_mirror = 0;
 		}
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0525e13..2d1a960 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1732,6 +1732,8 @@ static void __raid_recover_end_io(struct btrfs_raid_bio
*rbio)
 	int err;
 	int i;

+	pr_debug("__raid_recover_end_io: attempting error recovery\n");
+
 	pointers = kzalloc(rbio->bbio->num_stripes * sizeof(void *),
 			   GFP_NOFS);
 	if (!pointers) {
@@ -1886,17 +1888,22 @@ cleanup:
 cleanup_io:

 	if (rbio->read_rebuild) {
-		if (err == 0)
+		if (err == 0) {
+			pr_debug("__raid_recover_end_io: successful read_rebuild\n");
 			cache_rbio_pages(rbio);
-		else
+		} else {
+			pr_debug("__raid_recover_end_io: failed read_rebuild\n");
 			clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
+		}

 		rbio_orig_end_io(rbio, err, err == 0);
 	} else if (err == 0) {
+		pr_debug("__raid_recover_end_io: successful recovery, on to
fnish_rmw\n");
 		rbio->faila = -1;
 		rbio->failb = -1;
 		finish_rmw(rbio);
 	} else {
+		pr_debug("__raid_recover_end_io: failed recovery\n");
 		rbio_orig_end_io(rbio, err, 0);
 	}
 }
@@ -1922,10 +1929,13 @@ static void raid_recover_end_io(struct bio *bio, int
err)
 	if (!atomic_dec_and_test(&rbio->bbio->stripes_pending))
 		return;

-	if (atomic_read(&rbio->bbio->error) >
rbio->bbio->max_errors)
+	if (atomic_read(&rbio->bbio->error) >
rbio->bbio->max_errors) {
+		pr_debug("raid_recover_end_io: unrecoverable error\n");
 		rbio_orig_end_io(rbio, -EIO, 0);
-	else
+	} else {
+		pr_debug("raid_recover_end_io: attempting error recovery\n");
 		__raid_recover_end_io(rbio);
+	}
 }

 /*

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brendan Hide

2013-Oct-22 17:40 UTC

head link

Re: btrfs raid5

On 2013/10/22 07:18 PM, Alexandre Oliva wrote:> ... and
> it is surely an improvement over the current state of raid56 in btrfs,
> so it might be a good idea to put it in.I suspect the issue is that, while it sortof works, we don''t really
want
to push people to use it half-baked. This is reassuring work, however. 
Maybe it would be nice to have some half-baked code *anyway*, even if 
Chris doesn''t put it in his pull requests juuust yet.
;)> So far, I''ve put more than
> 1TB of data on that failing disk with 16 partitions on raid6, and
> somehow I got all the data back successfully: every file passed an
> md5sum check, in spite of tons of I/O errors in the process.Is this all on a single disk? If so it must be seeking like mad! haha

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexandre Oliva

2013-Oct-22 19:24 UTC

head link

Re: btrfs raid5

On Oct 22, 2013, Brendan Hide <brendan@swiftspirit.co.za> wrote:
> On 2013/10/22 07:18 PM, Alexandre Oliva wrote:
>> ... and
>> it is surely an improvement over the current state of raid56 in btrfs,
>> so it might be a good idea to put it in.
> I suspect the issue is that, while it sortof works, we don''t
really
> want to push people to use it half-baked.
I don''t think the current state of the implementation upstream is
compatible with that statement ;-)

One can create and run a glorified raid0 that computes and updates
parity blocks it won''t use for anything, while the name gives the
illusion of a more reliable filesystem than it actually is, and it will
freeze when encountering any of the failures the name suggests it would
protect from.

If we didn''t have any raid56 support at all, or if it was configured
separately and disabled by default, I''d concur with your statement. 
But
as things stand, any improvement to the raid56 implementation that
brings at least some of the safety net raid56 are meant to provide makes
things better, without giving users an idea that the implementation is
any more full-featured than it currently is.
> Maybe it would be nice to have some half-baked code *anyway*,
> even if Chris doesn''t put it in his pull requests juuust yet. ;)
Why, sure, that''s why I posted the patch; even if it didn''t
make it to
the repository, others might find it useful ;-)
>> So far, I''ve put more than
>> 1TB of data on that failing disk with 16 partitions on raid6, and
>> somehow I got all the data back successfully: every file passed an
>> md5sum check, in spite of tons of I/O errors in the process.
> Is this all on a single disk? If so it must be seeking like mad! haha
Yeah.  It probably is, but the access pattern for most of the time is
mostly random access to smallish files, so that won''t be a problem.  I
considered doing raid1 on the data, to get some more reliability out of
the broken disk, but then I recalled there was this raid56
implementation that, in raid6, would theoretically bring about
additional reliability and be far more space efficient, so I decided to
give it a try.  Only after I''d put in most of the data did the errors
start popping out.  Then I decided to try and fix them instead of
moving data out.  it was some happy hacking ;-)

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Oct-23 00:50 UTC

head link

Re: btrfs raid5

Alexandre Oliva posted on Tue, 22 Oct 2013 17:24:37 -0200 as excerpted:
> On Oct 22, 2013, Brendan Hide <brendan@swiftspirit.co.za> wrote:
> 
>> On 2013/10/22 07:18 PM, Alexandre Oliva wrote:
>>> ... and it is surely an improvement over the current state of
raid56
>>> in btrfs,
>>> so it might be a good idea to put it in.
>> I suspect the issue is that, while it sortof works, we don''t
really
>> want to push people to use it half-baked.
> 
> I don''t think the current state of the implementation upstream is
> compatible with that statement ;-)
> 
> One can create and run a glorified raid0 that computes and updates
> parity blocks it won''t use for anything, while the name gives the
> illusion of a more reliable filesystem than it actually is, and it will
> freeze when encountering any of the failures the name suggests it would
> protect from.
> 
> If we didn''t have any raid56 support at all, or if it was
configured
> separately and disabled by default, I''d concur with your
statement.  But
> as things stand, any improvement to the raid56 implementation that
> brings at least some of the safety net raid56 are meant to provide makes
> things better, without giving users an idea that the implementation is
> any more full-featured than it currently is.
The thing is, btrfs /doesn''t/ have any raid56 support at all, in the 
practical sense of the word.  There is a preliminary partial 
implementation, exactly as announced/warned when the feature went in, on 
a filesystem that itself is still experimental/testing status, so even 
for the features that are in general working, make and test your backups 
and keep ''em handy!

Anyone running btrfs at this point should know it''s status and be
keeping
up with upstream /because/ of that status, or they shouldn''t be
testing/
using it at all as it''s not yet considered a stable filesystem.  If 
they''re already aware of upstream status and are deliberately testing,
by
definition they''ll already know the preliminary/partial nature of the 
current raid56 implementation and there won''t be an issue.  If they 
aren''t already keeping up with developments on a declared experimental 
filesystem, that''s the base problem right there, and the quick failure 
should they try raid56 in its current state simply alerts them to the 
problem they already had.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexandre Oliva

2013-Oct-26 07:21 UTC

head link

Re: btrfs raid5

On Oct 22, 2013, Duncan <1i5t5.duncan@cox.net> wrote:
> the quick failure should they try raid56 in its current state simply
> alerts them to the problem they already had.
What quick failure?  There''s no such thing in place AFAIK.  It seems to
do all the work properly, the limitations in the current implementation
will only show up when an I/O error kicks in.  I can''t see any
indication, in existing announcements, that recovery from I/O errors in
raid56 is missing, let alone that it''s so utterly and completely broken
that it will freeze the entire filesystem and require a forced reboot to
unmount the filesystem and make any other data in it accessible again.

That''s far, far worse than the general state of btrfs, and
that''s not a
documented limitation of raid56, so how would someone be expected to
know about it?  It certainly isn''t obvious by having a cursory look at
the code either.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Oct 2013 - btrfs raid5

btrfs raid5

BTRFS_INODE_NODATACOW

Re: BTRFS_INODE_NODATACOW

btrfs flush data cache

btrfs raid5

Re: btrfs raid5

Re: btrfs raid5

Re: btrfs raid5

Re: btrfs raid5

Re: btrfs raid5

Re: btrfs raid5

Re: btrfs raid5

Re: btrfs raid5