thr3ads.net - Btrfs devel - kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Marc MERLIN

2013-Feb-26 06:51 UTC

kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

TL;DR;
WARNING: at fs/btrfs/tree-log.c:1984 walk_down_log_tree+0x51/0x307()
WARNING: at fs/btrfs/tree-log.c:1988 walk_down_log_tree+0x6c/0x307()
kernel BUG at fs/btrfs/volumes.c:3753!

It''s way time for btrfs to stop crashing your system with no recovery
flag
that works to clear the log if the log can''t be replayed. Hell, on non
development systems, it should just auto discard the log if it can''t be
replayed without user input.


Details:
It''s been almost a year that I''m doing my best to test btrfs
and report
bugs, but how quickly it crashes on mount if anything is off, is a huge
usability problem.

I just again, lost use of my machine today after an unrelated problem caused
a crash/reboot, and incomplete btrfs writes to my device.
That happens, it''s life.

But after that, I get to roll a dice of whether btrfs will recover, or just
crash on mount.
It''s slightly more liveable if it''s a scratch filesystem on a
developer box,
you just don''t mount it.
It''s really really sucky if it''s your root filesystem and you
need to boot
from a rescue partition/media to recover each time.

Then, I spent 3 hours reproducing the crash again, with netconsole working
so that I can get a useful bugreport, which I send here.

I also get an btrfs-image when this happens, but so far I haven''t
had any interest from anyone for doing so.
So, again, I have an image if someone wants it
787M	fs_image

I''m really losing faith in btrfs here. I see 2 problems.
1) btrfs doesn''t provide encryption. ecryptfs is so slow it''s
unusable (also it
doesn''t encrypt long file names anyway), so that leaves us with
dmcrypt. Therefore
I have to run btrfs on top of dmcrypt.

Either btrfs is buggy over dmcrypt in that its writes aren''t atomic,
and it very often
gets into a state of an unreplayable log.
I have no idea if that''s fixable, or not.

Or, btrfs truly has a problem on unclean shutdown during writes even without
dmcrypt, and more specifically pulling a sata device off a running bus.
I''m not sure if developers are testing for pulling a drive from the
sata bus while writing
to it, and trying this at least 10 times for each kernel version.


2) btrfs is a development filesystem, I understand that, but if it wants more
acceptance,
it just can''t keep crashing each time it can''t replay a log.
At this point BTRFS really must have a ''please clear my log and
continue mounting'' if the
log is corrupted in some way.
Corruption happens through multiple means, clearly getting an inconsistent log
on unclean
shutdown is not only possible, it''s very frequent.

btrfs can''t just panic and refuse to mount the filesystem without any
override option.

Will you please consider allowing sending the BUG statements that can happen
during mount
to a function that just cleans the replay log and tries again?
If ext4 has a log it can''t use, it doesn''t crash my system, it
simply tells me to run fsck
and possibly mounts read-only.

Below is the crash info, although I suppose it''s not that different
from my previous reports.

Again, I''m happy to give the fs_image to someone if you want it.


device label btrfs_pool1 devid 1 transid 89465 /dev/mapper/root
btrfs: disk space caching is enabled
Btrfs detected SSD devices, enabling SSD mode
btrfs bad tree block start 11014467582957347470 344223137792
btrfs bad tree block start 7450179628261340409 344223137792
------------[ cut here ]------------
WARNING: at fs/btrfs/tree-log.c:1984 walk_down_log_tree+0x51/0x307()
Hardware name: 2429A78
Modules linked in: iwldvm mac80211 iwlwifi cfg80211 e1000e tun cpufreq_userspace
cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4
pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) binfmt_misc uinput
nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp
parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media
hid_generic btusb usbhid hid coretemp bluetooth kvm_intel kvm
snd_hda_codec_realtek ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep
snd_pcm_oss snd_mixer_oss arc4 snd_pcm snd_page_alloc snd_seq_midi
snd_seq_midi_event snd_rawmidi snd_seq thinkpad_acpi sdhci_pci i915 iTCO_wdt
iTCO_vendor_support drm_kms_helper drm tpm_tis acpi
 _cpufreq tpm mperf psmouse evdev tpm_bios sdhci serio_raw i2c_algo_bit pcspkr
mmc_core xhci_hcd ehci_hcd ac nvram battery wmi lpc_ich usbcore video processor
button usb_common mei microcode snd_seq_device snd_timer i2c_i801 snd soundcore
i2c_core rfkill raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log
dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy
async_tx xor blowfish_x86_64 blowfish_common ecb thermal thermal_sys
crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd [last
unloaded: cfg80211]
Pid: 11214, comm: mount Tainted: G           O 3.7.8-amd64-preempt-20130222 #1
Call Trace:
 [<ffffffff81040ed4>] warn_slowpath_common+0x7e/0x96
 [<ffffffff81040f01>] warn_slowpath_null+0x15/0x17
 [<ffffffff8120dff1>] walk_down_log_tree+0x51/0x307
 [<ffffffff8120e321>] walk_log_tree+0x7a/0x1bc
 [<ffffffff81210a3d>] btrfs_recover_log_trees+0x9f/0x2ff
 [<ffffffff8120eafa>] ? replay_one_buffer+0x26d/0x26d
 [<ffffffff811e2c7c>] open_ctree+0x1443/0x1823
 [<ffffffff81299bde>] ? string.isra.3+0x3d/0xa4
 [<ffffffff811c4f97>] btrfs_mount+0x36d/0x4cd
 [<ffffffff810ea49e>] ? pcpu_next_pop+0x38/0x45
 [<ffffffff810eb51b>] ? pcpu_alloc+0x7ee/0x82a
 [<ffffffff8112c9fe>] ? alloc_vfsmnt+0xa6/0x192
 [<ffffffff81118d2c>] mount_fs+0x64/0x150
 [<ffffffff810eb562>] ? __alloc_percpu+0xb/0xd
 [<ffffffff8112cdb7>] vfs_kern_mount+0x64/0xde
 [<ffffffff8112d19a>] do_kern_mount+0x48/0xda
 [<ffffffff8112ec68>] do_mount+0x6b1/0x714
 [<ffffffff810e74de>] ? memdup_user+0x37/0x5f
 [<ffffffff810e753d>] ? strndup_user+0x37/0x4c
 [<ffffffff8112ed4e>] sys_mount+0x83/0xbd
 [<ffffffff814c6bed>] system_call_fastpath+0x1a/0x1f
---[ end trace 29a8d40b4129ace6 ]---
------------[ cut here ]------------
WARNING: at fs/btrfs/tree-log.c:1988 walk_down_log_tree+0x6c/0x307()
Hardware name: 2429A78
Modules linked in: iwldvm mac80211 iwlwifi cfg80211 e1000e tun cpufreq_userspace
cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4
pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) binfmt_misc uinput
nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp
parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media
hid_generic btusb usbhid hid coretemp bluetooth kvm_intel kvm
snd_hda_codec_realtek ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep
snd_pcm_oss snd_mixer_oss arc4 snd_pcm snd_page_alloc snd_seq_midi
snd_seq_midi_event snd_rawmidi snd_seq thinkpad_acpi sdhci_pci i915 iTCO_wdt
iTCO_vendor_support drm_kms_helper drm tpm_tis acpi
 _cpufreq tpm mperf psmouse evdev tpm_bios sdhci serio_raw i2c_algo_bit pcspkr
mmc_core xhci_hcd ehci_hcd ac nvram battery wmi lpc_ich usbcore video processor
button usb_common mei microcode snd_seq_device snd_timer i2c_i801 snd soundcore
i2c_core rfkill raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log
dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy
async_tx xor blowfish_x86_64 blowfish_common ecb thermal thermal_sys
crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd [last
unloaded: cfg80211]
Pid: 11214, comm: mount Tainted: G        W  O 3.7.8-amd64-preempt-20130222 #1
Call Trace:
 [<ffffffff81040ed4>] warn_slowpath_common+0x7e/0x96
 [<ffffffff81040f01>] warn_slowpath_null+0x15/0x17
 [<ffffffff8120e00c>] walk_down_log_tree+0x6c/0x307
 [<ffffffff8120e321>] walk_log_tree+0x7a/0x1bc
 [<ffffffff81210a3d>] btrfs_recover_log_trees+0x9f/0x2ff
 [<ffffffff8120eafa>] ? replay_one_buffer+0x26d/0x26d
 [<ffffffff811e2c7c>] open_ctree+0x1443/0x1823
 [<ffffffff81299bde>] ? string.isra.3+0x3d/0xa4
 [<ffffffff811c4f97>] btrfs_mount+0x36d/0x4cd
 [<ffffffff810ea49e>] ? pcpu_next_pop+0x38/0x45
 [<ffffffff810eb51b>] ? pcpu_alloc+0x7ee/0x82a
 [<ffffffff8112c9fe>] ? alloc_vfsmnt+0xa6/0x192
 [<ffffffff81118d2c>] mount_fs+0x64/0x150
 [<ffffffff810eb562>] ? __alloc_percpu+0xb/0xd
 [<ffffffff8112cdb7>] vfs_kern_mount+0x64/0xde
 [<ffffffff8112d19a>] do_kern_mount+0x48/0xda
 [<ffffffff8112ec68>] do_mount+0x6b1/0x714
 [<ffffffff810e74de>] ? memdup_user+0x37/0x5f
 [<ffffffff810e753d>] ? strndup_user+0x37/0x4c
 [<ffffffff8112ed4e>] sys_mount+0x83/0xbd
 [<ffffffff814c6bed>] system_call_fastpath+0x1a/0x1f
---[ end trace 29a8d40b4129ace7 ]---
parent transid verify failed on 11731997996535126505 wanted 2669605077570711694
found 0
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:3753!
invalid opcode: 0000 [#1] PREEMPT SMP 
Modules linked in: iwldvm mac80211 iwlwifi cfg80211 e1000e tun cpufreq_userspace
cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4
pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) binfmt_misc uinput
nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp
parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media
hid_generic btusb usbhid hid coretemp bluetooth kvm_intel kvm
snd_hda_codec_realtek ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep
snd_pcm_oss snd_mixer_oss arc4 snd_pcm snd_page_alloc snd_seq_midi
snd_seq_midi_event snd_rawmidi snd_seq thinkpad_acpi sdhci_pci i915 iTCO_wdt
iTCO_vendor_support drm_kms_helper drm tpm_tis acpi
 _cpufreq tpm mperf psmouse evdev tpm_bios sdhci serio_raw i2c_algo_bit pcspkr
mmc_core xhci_hcd ehci_hcd ac nvram battery wmi lpc_ich usbcore video processor
button usb_common mei microcode snd_seq_device snd_timer i2c_i801 snd soundcore
i2c_core rfkill raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log
dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy
async_tx xor blowfish_x86_64 blowfish_common ecb thermal thermal_sys
crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd [last
unloaded: cfg80211]
CPU 0 
Pid: 11214, comm: mount Tainted: G        W  O 3.7.8-amd64-preempt-20130222 #1
LENOVO 2429A78/2429A78
RIP: 0010:[<ffffffff81202160>]  [<ffffffff81202160>]
btrfs_num_copies+0x42/0x8b
RSP: 0018:ffff88013329b948  EFLAGS: 00010246
RAX: 0000000000000000 RBX: a2d06e04e11219e9 RCX: 0000000000000001
RDX: ffffffffffffffff RSI: a2d06e04e11219e9 RDI: ffff88013329a000
RBP: ffff88013329b978 R08: 0000000000000000 R09: 00000000ffe7d802
R10: 00000000ffe7d802 R11: 00000000000000c0 R12: ffff88010267e138
R13: 0000000000000000 R14: 00000000fffffffb R15: 0000000000000000
FS:  00007f3b6578d7e0(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000604130 CR3: 00000001419c5000 CR4: 00000000001407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 11214, threadinfo ffff88013329a000, task ffff880132d38180)
Stack:
 ffff88013329b968 0000000000001000 ffff8801df36a800 ffff8801ef7e25c0
 ffff8801df36a800 0000000000000000 ffff88013329b9d8 ffffffff811df295
 ffff88013043cc58 250c57ac832e708e 0000000000001000 ffff88013043cc50
Call Trace:
 [<ffffffff811df295>]
btree_read_extent_buffer_pages.constprop.116+0xa8/0x105
 [<ffffffff811e0fbf>] btrfs_read_buffer+0x2a/0x2c
 [<ffffffff8120e1d0>] walk_down_log_tree+0x230/0x307
 [<ffffffff8120e321>] walk_log_tree+0x7a/0x1bc
 [<ffffffff81210a3d>] btrfs_recover_log_trees+0x9f/0x2ff
 [<ffffffff8120eafa>] ? replay_one_buffer+0x26d/0x26d
 [<ffffffff811e2c7c>] open_ctree+0x1443/0x1823
 [<ffffffff81299bde>] ? string.isra.3+0x3d/0xa4
 [<ffffffff811c4f97>] btrfs_mount+0x36d/0x4cd
 [<ffffffff810ea49e>] ? pcpu_next_pop+0x38/0x45
 [<ffffffff810eb51b>] ? pcpu_alloc+0x7ee/0x82a
 [<ffffffff8112c9fe>] ? alloc_vfsmnt+0xa6/0x192
 [<ffffffff81118d2c>] mount_fs+0x64/0x150
 [<ffffffff810eb562>] ? __alloc_percpu+0xb/0xd
 [<ffffffff8112cdb7>] vfs_kern_mount+0x64/0xde
 [<ffffffff8112d19a>] do_kern_mount+0x48/0xda
 [<ffffffff8112ec68>] do_mount+0x6b1/0x714
 [<ffffffff810e74de>] ? memdup_user+0x37/0x5f
 [<ffffffff810e753d>] ? strndup_user+0x37/0x4c
 [<ffffffff8112ed4e>] sys_mount+0x83/0xbd
 [<ffffffff814c6bed>] system_call_fastpath+0x1a/0x1f
Code: 83 ec 18 48 89 55 d8 e8 cf fc 2b 00 48 8b 55 d8 4c 89 ef 48 89 de e8 22 26
ff ff 4c 89 e7 49 89 c5 e8 7d ff 2b 00 4d 85 ed 75 02 <0f> 0b 49 8b 45 18
48 39 d8 77 09 49 03 45 20 48 39 d8 73 02 0f
RIP  [<ffffffff81202160>] btrfs_num_copies+0x42/0x8b
 RSP <ffff88013329b948>
---[ end trace 29a8d40b4129ace8 ]---
Kernel panic - not syncing: Fatal exception


-- 
"A mouse is a device used to point at the xterm you want to type in" -
A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-Feb-26 14:23 UTC

head link

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

On Mon, Feb 25, 2013 at 11:51:02PM -0700, Marc MERLIN
wrote:> TL;DR;
> WARNING: at fs/btrfs/tree-log.c:1984 walk_down_log_tree+0x51/0x307()
> WARNING: at fs/btrfs/tree-log.c:1988 walk_down_log_tree+0x6c/0x307()
> kernel BUG at fs/btrfs/volumes.c:3753!
> 
> It''s way time for btrfs to stop crashing your system with no
recovery flag
> that works to clear the log if the log can''t be replayed. Hell, on
non
> development systems, it should just auto discard the log if it
can''t be
> replayed without user input.
> 
> 
> Details:
> It''s been almost a year that I''m doing my best to test
btrfs and report
> bugs, but how quickly it crashes on mount if anything is off, is a huge
> usability problem.
> 
> I just again, lost use of my machine today after an unrelated problem
caused
> a crash/reboot, and incomplete btrfs writes to my device.
> That happens, it''s life.
> 
> But after that, I get to roll a dice of whether btrfs will recover, or just
> crash on mount.
> It''s slightly more liveable if it''s a scratch filesystem
on a developer box,
> you just don''t mount it.
> It''s really really sucky if it''s your root filesystem and
you need to boot
> from a rescue partition/media to recover each time.
> 
> Then, I spent 3 hours reproducing the crash again, with netconsole working
> so that I can get a useful bugreport, which I send here.
So how did you reproduce it?  I''ll take a fs_image, but being able to
reproduce
the problem is more valuable.  Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marc MERLIN

2013-Feb-26 16:20 UTC

head link

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

On Tue, Feb 26, 2013 at 09:23:00AM -0500, Josef Bacik
wrote:> So how did you reproduce it?  I''ll take a fs_image, but being able
to reproduce
> the problem is more valuable.  Thanks,
Here''s the image: http://marc.merlins.org/tmp/fs_image

I just wrote details in the other message I just Cced you on.
Please let me know if I''m missing more.

Ah yes, I do use discard passthrough in dm:
dmsetup table /dev/mapper/cryptroot --showkeys | grep discard
0 926304944 crypt aes-xts-plain
fdf87a0d79b33008185258e61f890a5a92b39490e0ff62d3690e8cc4591a6e8a 0 8:4 8192 1
allow_discards

as well as in btrfs mount of course:
LABEL=btrfs_pool1 / btrfs
subvol=root,defaults,compress=lzo,discard,nossd,space_cache,noatime

But the thing that worries me is: 
I understand how it would be beneficial for you to to reproduce my crashes,
but from what I heard, the workload you code and optimize for is pretty
different from mine :)
I would think that if you put btrfs on top of dmcrypt (that part may or may
not be part of the equation) and pull the sata drive from the bus during
writes, you are bound to reproduce this within 10 times (it happens maybe
one time out of 3 for me).

That said, forgive me if I''m going to say something stupid,
I''m not a
filesystem developer :)
If somehow, you can pin down the cause of what I''m seeing, that would
be
great (assuming it''s a bug between dmcrypt and btrfs), but would you
agree
that in the end as a filesystem developer, you can''t always control
random
ordering of writes or incomplete writes to a drive before it gets shut down?
(drives do their own reordering separately from linux anyway, right?)
I''m also not talking about actual hardware caused corruption and bad
RAM
chips.

Am I wrong when saying that ending up with replay journals that have
unexpected data and that can''t be replayed is just inevitable and
something
any journalling filesystem must deal with?

If I''m not wrong, it looks like btrfs is very good at detecting
unexpected
conditions so that it doesn''t replay bad logs and cause corruption as a
result, but instead of discarding the log, it just crashes the kernel with
BUG().

My previous message explains how vexing and time consuming it is to recover
when it''s your root filesystem on a laptop :)

Given that, would it indeed be feasible to do a sweep of the log replay code
and have it discard bad logs (unless it''s in development mode for folks
like
you) and continue the mount with just a few transactions lost but a
consistent filesystem, instead of crashing the kernel and leaving the user
with an unbootable system?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" -
A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Feb-26 18:24 UTC

head link

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

> Am I wrong when saying that ending up with replay journals that have
> unexpected data and that can''t be replayed is just inevitable and
something
> any journalling filesystem must deal with?
If by journal you mean the btrfs log then yes, strictly speaking,
you''re
wrong.  btrfs does deal with the kind of incomplete and reordered writes
that you''re talking about and it should not result in corruption of
what
it calls the log.

But it''s a reasonable thing to be confused by.  I''m guessing
that you''re
being tripped up by what ext3 means by a journal and by what btrfs means
by a log.

The journal in ext3 can be partially written during a crash.  The
journal replay on mount notices this because the commit block isn''t
present and just throws it away.  No worries.

The equivalent consistent update mechanism in btrfs is cow tree updates.
The superblock that references new tree blocks written to free space is
itself only written once all those blocks are stable on disk.  If the
tree block writes are interrupted then the superblock isn''t updated and
btrfs won''t see the partially written blocks.  No worries.

The btrfs "log" is itself just a logical btree *inside these
consistent
tree updates* that records logical operations that will need to be
replayed.  For the log to be corrupted, if the btrfs code is perfect,
the storage had to have lied to btrfs and told it that tree update
blocks were stable which caused the superblock write that referenced
them prematurely.

The equivalent problem in the ext3 journal would be a transaction that
has blocks missing but which has a valid commit block.  ext3 couldn''t
just throw this transaction away because after the commit block write it
could have been in the process of replaying the transaction blocks at
their final location on disk.  And it''s now missing some of those
blocks
to replay.  This kind of corruption Shouldn''t Happen and the fs
can''t
just silently ignore it.

I absolutely agree that the error messages should be greatly improved in
this case, yes, and that it shouldn''t BUG_ON (it should *never*
BUG_ON).

But btrfs is right to refuse to silently revert previously stable
changes by just ignoring the corrupt log.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marc MERLIN

2013-Feb-26 19:09 UTC

head link

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

On Tue, Feb 26, 2013 at 10:24:42AM -0800, Zach Brown
wrote:> > Am I wrong when saying that ending up with replay journals that have
> > unexpected data and that can''t be replayed is just inevitable
and something
> > any journalling filesystem must deal with?
> 
> If by journal you mean the btrfs log then yes, strictly speaking,
you''re
I did and used the wrong term, sorry.
> wrong.  btrfs does deal with the kind of incomplete and reordered writes
> that you''re talking about and it should not result in corruption
of what
> it calls the log.
Got it. So in theory what I''m seeing really shouldn''t happen,
unless there
is a corruption bug, or hardware fault.
> But it''s a reasonable thing to be confused by.  I''m
guessing that you''re
> being tripped up by what ext3 means by a journal and by what btrfs means
> by a log.
yes :)
> The journal in ext3 can be partially written during a crash.  The
> journal replay on mount notices this because the commit block
isn''t
> present and just throws it away.  No worries.
That''s indeed what I meant.
> The equivalent consistent update mechanism in btrfs is cow tree updates.
> The superblock that references new tree blocks written to free space is
> itself only written once all those blocks are stable on disk.  If the
> tree block writes are interrupted then the superblock isn''t
updated and
> btrfs won''t see the partially written blocks.  No worries.
Ok. So basically what I seem to be hitting, are seemingly complete tree
blocks in the log, that aren''t complete or consistent afterall,
triggering
BUG() in the end.
> replayed.  For the log to be corrupted, if the btrfs code is perfect,
> the storage had to have lied to btrfs and told it that tree update
> blocks were stable which caused the superblock write that referenced
> them prematurely. 
That makes sense. Is dmcrypt with discard passthrough potentially able to 
tell btrfs that writes did make it to disk when in fact someonly some of
them have?
> The equivalent problem in the ext3 journal would be a transaction that
> has blocks missing but which has a valid commit block.  ext3
couldn''t
> just throw this transaction away because after the commit block write it
> could have been in the process of replaying the transaction blocks at
> their final location on disk.  And it''s now missing some of those
blocks
> to replay.  This kind of corruption Shouldn''t Happen and the fs
can''t
> just silently ignore it. 
I understand that, and I''m not advocating silent ignores either.
> I absolutely agree that the error messages should be greatly improved in
> this case, yes, and that it shouldn''t BUG_ON (it should *never*
BUG_ON).
> 
> But btrfs is right to refuse to silently revert previously stable
> changes by just ignoring the corrupt log.
That''s the part I''m a bit confused about.
It could complain, it could require a ''recovery'' option on the
kernel
command line (which does exist but doesn''t work for this purpose), but
effectively it forces me to indeed ignore seemingly stable changes which
are indeed unusable, and the code knows they are, except that I have to go
back to boot media and go through hoops to get a netconsole dump/crash
before that, and then use btrfs-zero-log.

Once btrfs knows that the last log entry(ies) are corrupted in some way, it
could dump a bunch of diag data in the kernel, then delete the unusable log and
continue with the mount, which is effectively what I do myself right now.

Unless I''m mistaken, this really only causes me to lose some amount of
data
that was written just before the crash/reboot. This is usually ok for most
users and typically expected anyway (although not as desirable in the case
of a database server of course).

I understand that this might not want to be a default, but it should be
something that could be done from the kernel command line, and allow the
mount to continue and the boot to succeed (including the kernel debug log to
make it to local disk, where it can then be Emailed since serial console or
netconsole is not something you can do easily)

Does that sound reasonable?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" -
A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-Feb-26 19:28 UTC

head link

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

> > The equivalent consistent update mechanism in btrfs is cow tree
updates.
> > The superblock that references new tree blocks written to free space
is
> > itself only written once all those blocks are stable on disk.  If the
> > tree block writes are interrupted then the superblock isn''t
updated and
> > btrfs won''t see the partially written blocks.  No worries.
> 
> Ok. So basically what I seem to be hitting, are seemingly complete tree
> blocks in the log, that aren''t complete or consistent afterall,
triggering
> BUG() in the end.
That''s my understanding, yes.  I think Josef is digging in.
> I understand that this might not want to be a default, but it should be
> something that could be done from the kernel command line, and allow the
> mount to continue and the boot to succeed (including the kernel debug log
to
> make it to local disk, where it can then be Emailed since serial console or
> netconsole is not something you can do easily)
> 
> Does that sound reasonable?
To me?  Absolutely.  Having to recover with btrfs-zero-log is annoying.
I''d be surprised if anyone didn''t agree that the current
process should
be improved.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Feb 2013 - kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem

Re: kernel BUG at fs/btrfs/volumes.c:3753! These btrfs crashes at mount time on log replay are really a problem