The below messages are from dmesg on a system where "btrfs balance" just aborted. It''s running kernel 3.11.6 (the latest Debian package). This seems to be telling me that Inode 388 is involved, but there are over 300 subvols on that system which could contain such an Inode. I think that more information is needed for such log messages. We need to at least be able to identify the subvol (is it possible to extract this from the numbers in the log messages?). Ideally we would be able to identify the file name as well. [10751.637517] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum 2566472073 private 3193692311 [10751.646390] BTRFS info (device sda3): csum failed ino 388 off 24104960 csum 5219137 private 2264608335 [10751.654472] BTRFS info (device sda3): csum failed ino 388 off 24154112 csum 4084831521 private 1792217768 [10751.731830] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum 2566472073 private 3193692311 -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
As you were in the process of a rebalance these errors may actually be caused by this serious bug "Btrfs: relocate csums properly with prealloc extents". I hit that myself with several preallocated files made by rtorrent during a rebalance and I lost several huge files as a consequence. The only way I could rebalance without large scale corruptions was to manually patch the 3.11.6 kernel with the small patch that fixes the issue. For some reason this patch is not pushed upstream yet. I think that is strange as it leads to corruption and actual data loss and it is 100% reproducible with preallocated files. Only systemd logs is mentioned in the bug reports, but in my case it was actually hitting several terabytes of files created by rtorrent. Mvh Hans-Kristian Bakke On 5 November 2013 02:24, Russell Coker <russell@coker.com.au> wrote:> The below messages are from dmesg on a system where "btrfs balance" just > aborted. It''s running kernel 3.11.6 (the latest Debian package). > > This seems to be telling me that Inode 388 is involved, but there are over 300 > subvols on that system which could contain such an Inode. > > I think that more information is needed for such log messages. We need to at > least be able to identify the subvol (is it possible to extract this from the > numbers in the log messages?). Ideally we would be able to identify the file > name as well. > > > [10751.637517] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum > 2566472073 private 3193692311 > [10751.646390] BTRFS info (device sda3): csum failed ino 388 off 24104960 csum > 5219137 private 2264608335 > [10751.654472] BTRFS info (device sda3): csum failed ino 388 off 24154112 csum > 4084831521 private 1792217768 > [10751.731830] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum > 2566472073 private 3193692311 > > -- > My Main Blog http://etbe.coker.com.au/ > My Documents Blog http://doc.coker.com.au/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 05, 2013 at 07:15:57AM +0100, Hans-Kristian Bakke wrote:> As you were in the process of a rebalance these errors may actually be > caused by this serious bug "Btrfs: relocate csums properly with > prealloc extents". > > I hit that myself with several preallocated files made by rtorrent > during a rebalance and I lost several huge files as a consequence. The > only way I could rebalance without large scale corruptions was to > manually patch the 3.11.6 kernel with the small patch that fixes the > issue. > For some reason this patch is not pushed upstream yet. I think that is > strange as it leads to corruption and actual data loss and it is 100% > reproducible with preallocated files. Only systemd logs is mentioned > in the bug reports, but in my case it was actually hitting several > terabytes of files created by rtorrent.Thanks for the summary. There''s no doubt that this is serious. Chris, please can you somehow get the patch into stable sooner than it gets to the 3.13 queue? The merge window will start in 1 week and based on previous pull request schedule, the patch will be merged in ~2 weeks from now. That''s kind of long time for for a unfixed corruption bug that reportedly affects common installations (with systemd) or usecases (torrent) in combination with balance. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 5 Nov 2013, "Hans-Kristian Bakke" <hkbakke@gmail.com> wrote:> As you were in the process of a rebalance these errors may actually be > caused by this serious bug "Btrfs: relocate csums properly with > prealloc extents". > > I hit that myself with several preallocated files made by rtorrent > during a rebalance and I lost several huge files as a consequence. The > only way I could rebalance without large scale corruptions was to > manually patch the 3.11.6 kernel with the small patch that fixes the > issue. > For some reason this patch is not pushed upstream yet. I think that is > strange as it leads to corruption and actual data loss and it is 100% > reproducible with preallocated files. Only systemd logs is mentioned > in the bug reports, but in my case it was actually hitting several > terabytes of files created by rtorrent.I run systemd to I guess it''s the systemd logs. That''s fortunate as such logs aren''t important to me. Thanks for providing this information. I''ve just run a scrub and I saw the following output. There was nothing useful or apparently relevant in the kernel message log either. So scrub is just telling me that there are 57 errors without giving me a clue as to which files might need to be restored from backup. # btrfs scrub start -B / scrub done for c55218a6-abb5-4e35-9a20-33fb1fa05879 scrub started at Tue Nov 5 11:32:03 2013 and finished after 6762 seconds total bytes scrubbed: 140.06GB with 57 errors error details: csum=57 corrected errors: 0, uncorrectable errors: 57, unverified errors: 0 I can imagine a balance operation being unable to conveniently display all the data that one might desire. But a scrub really should go through everything and should know where the inconsistencies are. In this case the scrub gave me less information than the balance. I presume that my filesystem is still corrupt. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I gave up on getting the filesystem to a concistent state, but my corruption was much more severe than yours. Several 100 000''s. As the fs was still usable and mountable I just moved all the files to another filesystem, patched the kernel recreated the original btrfs fs and ran a rebalance. This time without issues because of the patch. As the corrupt files were rtorrent files in my case I could just rehash the torrents and make rtorrent redownload the corrupt blocks. Very lucky indeed. The other files I could verify against backup. Luckily the reason for the rebalance in the first place was to add another 16TB of disk to the RAID10 array, so I just happened to have enough temporary storage lying around. After patching the kernel and rebalance I now have a 32TB btrfs RAID10 volume. Mvh Hans-Kristian Bakke On 5 November 2013 13:16, Russell Coker <russell@coker.com.au> wrote:> On Tue, 5 Nov 2013, "Hans-Kristian Bakke" <hkbakke@gmail.com> wrote: >> As you were in the process of a rebalance these errors may actually be >> caused by this serious bug "Btrfs: relocate csums properly with >> prealloc extents". >> >> I hit that myself with several preallocated files made by rtorrent >> during a rebalance and I lost several huge files as a consequence. The >> only way I could rebalance without large scale corruptions was to >> manually patch the 3.11.6 kernel with the small patch that fixes the >> issue. >> For some reason this patch is not pushed upstream yet. I think that is >> strange as it leads to corruption and actual data loss and it is 100% >> reproducible with preallocated files. Only systemd logs is mentioned >> in the bug reports, but in my case it was actually hitting several >> terabytes of files created by rtorrent. > > I run systemd to I guess it''s the systemd logs. That''s fortunate as such logs > aren''t important to me. Thanks for providing this information. > > I''ve just run a scrub and I saw the following output. There was nothing > useful or apparently relevant in the kernel message log either. So scrub is > just telling me that there are 57 errors without giving me a clue as to which > files might need to be restored from backup. > > # btrfs scrub start -B / > scrub done for c55218a6-abb5-4e35-9a20-33fb1fa05879 > scrub started at Tue Nov 5 11:32:03 2013 and finished after 6762 > seconds > total bytes scrubbed: 140.06GB with 57 errors > error details: csum=57 > corrected errors: 0, uncorrectable errors: 57, unverified errors: 0 > > I can imagine a balance operation being unable to conveniently display all the > data that one might desire. But a scrub really should go through everything > and should know where the inconsistencies are. In this case the scrub gave me > less information than the balance. > > I presume that my filesystem is still corrupt. > > -- > My Main Blog http://etbe.coker.com.au/ > My Documents Blog http://doc.coker.com.au/-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote:> > I presume that my filesystem is still corrupt.I''m the original reporter of the bug. The file system itself isn''t corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean. Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it''s been missed. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote:> > On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote: > > > > I presume that my filesystem is still corrupt. > > I''m the original reporter of the bug. The file system itself isn''t corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean. > > Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it''s been missed.Someone else tripped over it on IRC last night, and I was surprised to discover it hadn''t made it upstream yet. :( Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Two things came out of Berkeley in the 1960s: LSD and Unix. --- This is not a coincidence.
On Tue, Nov 5, 2013 at 6:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote:> On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote: >> >> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote: >> > >> > I presume that my filesystem is still corrupt. >> >> I''m the original reporter of the bug. The file system itself isn''t corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean. >> >> Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it''s been missed. > > Someone else tripped over it on IRC last night, and I was surprised > to discover it hadn''t made it upstream yet. :(Is there now a verification test that could detect an issue like this? It seems like the sort of thing that needs to be added to automated testing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
John Williams posted on Tue, 05 Nov 2013 16:20:58 -0800 as excerpted:> Is there now a verification test that could detect an issue like this? > It seems like the sort of thing that needs to be added to automated > testing.[Your question is general enough, not mentioning xfs-tests, simply asking about general automated testing, I''m assuming a general answer is appropriate.] I haven''t tracked this specific issue, but in general, the btrfs devs are pretty strict with adding an xfs-tests (NOT used for just xfs, at least btrfs and ext4 use it too) package test for any regressions they find, and people ARE regularly running those tests on new code, so past issues don''t happen again. If you watch the list you''ll see occasional patch rejections due to failed xfs-tests, as well as regular new xfs-tests patches adding new tests, as well as review discussion requesting an xfs-test be added as appropriate. So I''d be /very/ surprised if this bugfix didn''t already have a corresponding new xfs-tests test. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 05, 2013 at 04:20:58PM -0800, John Williams wrote:> Is there now a verification test that could detect an issue like this? > It seems like the sort of thing that needs to be added to automated > testing.Yes there is: xfstests/btrfs/013 https://bugzilla.kernel.org/show_bug.cgi?id=63411 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Nov 5, 2013, at 7:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote:> On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote: >> >> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote: >>> >>> I presume that my filesystem is still corrupt. >> >> I''m the original reporter of the bug. The file system itself isn''t corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean. >> >> Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it''s been missed. > > Someone else tripped over it on IRC last night, and I was surprised > to discover it hadn''t made it upstream yet. :(I just checked kernel.org changelogs and I''m not seeing this fixed in either 3.11.7 or 3.11.8. It is listed in today''s git pull by Chris for 3.12 as Btrfs: relocate csums properly with prealloc extents Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 14, 2013 at 11:37:39AM -0700, Chris Murphy wrote:> > On Nov 5, 2013, at 7:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > > > On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote: > >> > >> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote: > >>> > >>> I presume that my filesystem is still corrupt. > >> > >> I''m the original reporter of the bug. The file system itself isn''t corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean. > >> > >> Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it''s been missed. > > > > Someone else tripped over it on IRC last night, and I was surprised > > to discover it hadn''t made it upstream yet. :( > > I just checked kernel.org changelogs and I''m not seeing this fixed in > either 3.11.7 or 3.11.8. It is listed in today''s git pull by Chris for > 3.12 as > > Btrfs: relocate csums properly with prealloc extentsThe stable tree process does not normally accept patches that are not in Linus'' tree first. The patch has yet to be submitted to stable right after today''s pull is merged. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html