thr3ads.net - Btrfs devel - csum failure messages [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Russell Coker

2013-Nov-05 01:24 UTC

csum failure messages

The below messages are from dmesg on a system where "btrfs balance"
just
aborted.  It''s running kernel 3.11.6 (the latest Debian package).

This seems to be telling me that Inode 388 is involved, but there are over 300 
subvols on that system which could contain such an Inode.

I think that more information is needed for such log messages.  We need to at 
least be able to identify the subvol (is it possible to extract this from the 
numbers in the log messages?).  Ideally we would be able to identify the file 
name as well.


[10751.637517] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum 
2566472073 private 3193692311
[10751.646390] BTRFS info (device sda3): csum failed ino 388 off 24104960 csum 
5219137 private 2264608335
[10751.654472] BTRFS info (device sda3): csum failed ino 388 off 24154112 csum 
4084831521 private 1792217768
[10751.731830] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum 
2566472073 private 3193692311

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hans-Kristian Bakke

2013-Nov-05 06:15 UTC

head link

Re: csum failure messages

As you were in the process of a rebalance these errors may actually be
caused by this serious bug "Btrfs: relocate csums properly with
prealloc extents".

I hit that myself with several preallocated files made by rtorrent
during a rebalance and I lost several huge files as a consequence. The
only way I could rebalance without large scale corruptions was to
manually patch the 3.11.6 kernel with the small patch that fixes the
issue.
For some reason this patch is not pushed upstream yet. I think that is
strange as it leads to corruption and actual data loss and it is 100%
reproducible with preallocated files. Only systemd logs is mentioned
in the bug reports, but in my case it was actually hitting several
terabytes of files created by rtorrent.

Mvh

Hans-Kristian Bakke

On 5 November 2013 02:24, Russell Coker <russell@coker.com.au>
wrote:> The below messages are from dmesg on a system where "btrfs
balance" just
> aborted.  It''s running kernel 3.11.6 (the latest Debian package).
>
> This seems to be telling me that Inode 388 is involved, but there are over
300
> subvols on that system which could contain such an Inode.
>
> I think that more information is needed for such log messages.  We need to
at
> least be able to identify the subvol (is it possible to extract this from
the
> numbers in the log messages?).  Ideally we would be able to identify the
file
> name as well.
>
>
> [10751.637517] BTRFS info (device sda3): csum failed ino 388 off 23191552
csum
> 2566472073 private 3193692311
> [10751.646390] BTRFS info (device sda3): csum failed ino 388 off 24104960
csum
> 5219137 private 2264608335
> [10751.654472] BTRFS info (device sda3): csum failed ino 388 off 24154112
csum
> 4084831521 private 1792217768
> [10751.731830] BTRFS info (device sda3): csum failed ino 388 off 23191552
csum
> 2566472073 private 3193692311
>
> --
> My Main Blog         http://etbe.coker.com.au/
> My Documents Blog    http://doc.coker.com.au/
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba

2013-Nov-05 11:30 UTC

head link

Re: csum failure messages

On Tue, Nov 05, 2013 at 07:15:57AM +0100, Hans-Kristian Bakke
wrote:> As you were in the process of a rebalance these errors may actually be
> caused by this serious bug "Btrfs: relocate csums properly with
> prealloc extents".
> 
> I hit that myself with several preallocated files made by rtorrent
> during a rebalance and I lost several huge files as a consequence. The
> only way I could rebalance without large scale corruptions was to
> manually patch the 3.11.6 kernel with the small patch that fixes the
> issue.
> For some reason this patch is not pushed upstream yet. I think that is
> strange as it leads to corruption and actual data loss and it is 100%
> reproducible with preallocated files. Only systemd logs is mentioned
> in the bug reports, but in my case it was actually hitting several
> terabytes of files created by rtorrent.
Thanks for the summary. There''s no doubt that this is serious.

Chris, please can you somehow get the patch into stable sooner than it
gets to the 3.13 queue? The merge window will start in 1 week and based
on previous pull request schedule, the patch will be merged in ~2 weeks
from now. That''s kind of long time for for a unfixed corruption bug
that
reportedly affects common installations (with systemd) or usecases
(torrent) in combination with balance.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Russell Coker

2013-Nov-05 12:16 UTC

head link

Re: csum failure messages

On Tue, 5 Nov 2013, "Hans-Kristian Bakke" <hkbakke@gmail.com>
wrote:> As you were in the process of a rebalance these errors may actually be
> caused by this serious bug "Btrfs: relocate csums properly with
> prealloc extents".
> 
> I hit that myself with several preallocated files made by rtorrent
> during a rebalance and I lost several huge files as a consequence. The
> only way I could rebalance without large scale corruptions was to
> manually patch the 3.11.6 kernel with the small patch that fixes the
> issue.
> For some reason this patch is not pushed upstream yet. I think that is
> strange as it leads to corruption and actual data loss and it is 100%
> reproducible with preallocated files. Only systemd logs is mentioned
> in the bug reports, but in my case it was actually hitting several
> terabytes of files created by rtorrent.
I run systemd to I guess it''s the systemd logs.  That''s
fortunate as such logs
aren''t important to me.  Thanks for providing this information.

I''ve just run a scrub and I saw the following output.  There was
nothing
useful or apparently relevant in the kernel message log either.  So scrub is 
just telling me that there are 57 errors without giving me a clue as to which 
files might need to be restored from backup.

# btrfs scrub start -B /
scrub done for c55218a6-abb5-4e35-9a20-33fb1fa05879
        scrub started at Tue Nov  5 11:32:03 2013 and finished after 6762 
seconds
        total bytes scrubbed: 140.06GB with 57 errors
        error details: csum=57
        corrected errors: 0, uncorrectable errors: 57, unverified errors: 0

I can imagine a balance operation being unable to conveniently display all the 
data that one might desire.  But a scrub really should go through everything 
and should know where the inconsistencies are.  In this case the scrub gave me 
less information than the balance.

I presume that my filesystem is still corrupt.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hans-Kristian Bakke

2013-Nov-05 12:37 UTC

head link

Re: csum failure messages

I gave up on getting the filesystem to a concistent state, but my
corruption was much more severe than yours. Several 100 000''s. As the
fs was still usable and mountable I just moved all the files to
another filesystem, patched the kernel recreated the original btrfs fs
and ran a rebalance. This time without issues because of the patch. As
the corrupt files were rtorrent files in my case I could just rehash
the torrents and make rtorrent redownload the corrupt blocks. Very
lucky indeed. The other files I could verify against backup.

Luckily the reason for the rebalance in the first place was to add
another 16TB of disk to the RAID10 array, so I just happened to have
enough temporary storage lying around. After patching the kernel and
rebalance I now have a 32TB btrfs RAID10 volume.
Mvh

Hans-Kristian Bakke


On 5 November 2013 13:16, Russell Coker <russell@coker.com.au>
wrote:> On Tue, 5 Nov 2013, "Hans-Kristian Bakke"
<hkbakke@gmail.com> wrote:
>> As you were in the process of a rebalance these errors may actually be
>> caused by this serious bug "Btrfs: relocate csums properly with
>> prealloc extents".
>>
>> I hit that myself with several preallocated files made by rtorrent
>> during a rebalance and I lost several huge files as a consequence. The
>> only way I could rebalance without large scale corruptions was to
>> manually patch the 3.11.6 kernel with the small patch that fixes the
>> issue.
>> For some reason this patch is not pushed upstream yet. I think that is
>> strange as it leads to corruption and actual data loss and it is 100%
>> reproducible with preallocated files. Only systemd logs is mentioned
>> in the bug reports, but in my case it was actually hitting several
>> terabytes of files created by rtorrent.
>
> I run systemd to I guess it''s the systemd logs.  That''s
fortunate as such logs
> aren''t important to me.  Thanks for providing this information.
>
> I''ve just run a scrub and I saw the following output.  There was
nothing
> useful or apparently relevant in the kernel message log either.  So scrub
is
> just telling me that there are 57 errors without giving me a clue as to
which
> files might need to be restored from backup.
>
> # btrfs scrub start -B /
> scrub done for c55218a6-abb5-4e35-9a20-33fb1fa05879
>         scrub started at Tue Nov  5 11:32:03 2013 and finished after 6762
> seconds
>         total bytes scrubbed: 140.06GB with 57 errors
>         error details: csum=57
>         corrected errors: 0, uncorrectable errors: 57, unverified errors: 0
>
> I can imagine a balance operation being unable to conveniently display all
the
> data that one might desire.  But a scrub really should go through
everything
> and should know where the inconsistencies are.  In this case the scrub gave
me
> less information than the balance.
>
> I presume that my filesystem is still corrupt.
>
> --
> My Main Blog         http://etbe.coker.com.au/
> My Documents Blog    http://doc.coker.com.au/--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-Nov-05 14:26 UTC

head link

Re: csum failure messages

On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au>
wrote:> 
> I presume that my filesystem is still corrupt.
I''m the original reporter of the bug. The file system itself
isn''t corrupt, but the affected files probably are. In my case, systemd
journal files were reported as corrupt by systemd following a balance, as well
as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon
deleting those files, subsequent scrubs come up clean.

Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19,
3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline
3.11.6, so somehow it''s been missed.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2013-Nov-05 14:34 UTC

head link

Re: csum failure messages

On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy
wrote:> 
> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au>
wrote:
> > 
> > I presume that my filesystem is still corrupt.
> 
> I''m the original reporter of the bug. The file system itself
isn''t corrupt, but the affected files probably are. In my case, systemd
journal files were reported as corrupt by systemd following a balance, as well
as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon
deleting those files, subsequent scrubs come up clean.
> 
> Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19,
3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline
3.11.6, so somehow it''s been missed.
   Someone else tripped over it on IRC last night, and I was surprised
to discover it hadn''t made it upstream yet. :(

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Two things came out of Berkeley in the 1960s: LSD and Unix. ---   
                       This is not a coincidence.

John Williams

2013-Nov-06 00:20 UTC

head link

Re: csum failure messages

On Tue, Nov 5, 2013 at 6:34 AM, Hugo Mills <hugo@carfax.org.uk>
wrote:> On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote:
>>
>> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au>
wrote:
>> >
>> > I presume that my filesystem is still corrupt.
>>
>> I''m the original reporter of the bug. The file system itself
isn''t corrupt, but the affected files probably are. In my case, systemd
journal files were reported as corrupt by systemd following a balance, as well
as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon
deleting those files, subsequent scrubs come up clean.
>>
>> Fedora merged the fix for this bug with: 3.11.5-302.fc20,
3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to
go into mainline 3.11.6, so somehow it''s been missed.
>
>    Someone else tripped over it on IRC last night, and I was surprised
> to discover it hadn''t made it upstream yet. :(
Is there now a verification test that could detect an issue like this?
It seems like the sort of thing that needs to be added to automated
testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Nov-06 12:19 UTC

head link

Re: csum failure messages

John Williams posted on Tue, 05 Nov 2013 16:20:58 -0800 as excerpted:
> Is there now a verification test that could detect an issue like this?
> It seems like the sort of thing that needs to be added to automated
> testing.
[Your question is general enough, not mentioning xfs-tests, simply asking 
about general automated testing, I''m assuming a general answer is 
appropriate.]

I haven''t tracked this specific issue, but in general, the btrfs devs
are
pretty strict with adding an xfs-tests (NOT used for just xfs, at least 
btrfs and ext4 use it too) package test for any regressions they find, 
and people ARE regularly running those tests on new code, so past issues 
don''t happen again.  

If you watch the list you''ll see occasional patch rejections due to 
failed xfs-tests, as well as regular new xfs-tests patches adding new 
tests, as well as review discussion requesting an xfs-test be added as 
appropriate.

So I''d be /very/ surprised if this bugfix didn''t already have
a
corresponding new xfs-tests test.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba

2013-Nov-06 16:54 UTC

head link

Re: csum failure messages

On Tue, Nov 05, 2013 at 04:20:58PM -0800, John Williams
wrote:> Is there now a verification test that could detect an issue like this?
> It seems like the sort of thing that needs to be added to automated
> testing.
Yes there is:

xfstests/btrfs/013
https://bugzilla.kernel.org/show_bug.cgi?id=63411
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-Nov-14 18:37 UTC

head link

Re: csum failure messages

On Nov 5, 2013, at 7:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote:
>> 
>> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au>
wrote:
>>> 
>>> I presume that my filesystem is still corrupt.
>> 
>> I''m the original reporter of the bug. The file system itself
isn''t corrupt, but the affected files probably are. In my case, systemd
journal files were reported as corrupt by systemd following a balance, as well
as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon
deleting those files, subsequent scrubs come up clean.
>> 
>> Fedora merged the fix for this bug with: 3.11.5-302.fc20,
3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to
go into mainline 3.11.6, so somehow it''s been missed.
> 
>   Someone else tripped over it on IRC last night, and I was surprised
> to discover it hadn''t made it upstream yet. :(
I just checked kernel.org changelogs and I''m not seeing this fixed in
either 3.11.7 or 3.11.8. It is listed in today''s git pull by Chris for
3.12 as

Btrfs: relocate csums properly with prealloc extents


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba

2013-Nov-14 18:44 UTC

head link

Re: csum failure messages

On Thu, Nov 14, 2013 at 11:37:39AM -0700, Chris Murphy
wrote:> 
> On Nov 5, 2013, at 7:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> 
> > On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote:
> >> 
> >> On Nov 5, 2013, at 5:16 AM, Russell Coker
<russell@coker.com.au> wrote:
> >>> 
> >>> I presume that my filesystem is still corrupt.
> >> 
> >> I''m the original reporter of the bug. The file system
itself isn''t corrupt, but the affected files probably are. In my case,
systemd journal files were reported as corrupt by systemd following a balance,
as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected
file. Upon deleting those files, subsequent scrubs come up clean.
> >> 
> >> Fedora merged the fix for this bug with: 3.11.5-302.fc20,
3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to
go into mainline 3.11.6, so somehow it''s been missed.
> > 
> >   Someone else tripped over it on IRC last night, and I was surprised
> > to discover it hadn''t made it upstream yet. :(
> 
> I just checked kernel.org changelogs and I''m not seeing this fixed
in
> either 3.11.7 or 3.11.8. It is listed in today''s git pull by Chris
for
> 3.12 as
> 
> Btrfs: relocate csums properly with prealloc extents
The stable tree process does not normally accept patches that are not in
Linus'' tree first. The patch has yet to be submitted to stable right
after today''s pull is merged.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Nov 2013 - csum failure messages

csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages

Re: csum failure messages