Marc MERLIN
2013-Jan-08 16:49 UTC
kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
Unfortunately my laptop deadlocks from time to time, and too often it triggers this bug in btrfs which is quite hard to recover from. The bigger problem is that all the user sees (if anything) is seemingly unrelated info, namely, "RIP: btrfs_num_copies+0x42/0x0b" or somesuch http://marc.merlins.org/tmp/btrfs_num_copies.jpg It''s only if you have serial console, or netconsole, which we can''t really assume the average users to have, that you can get the correct oops and bug info. I lost another 3 hours with many reboots and a recovery drive to recover my root drive. Question #1: I have hourly snapshots of my root filesystem, and I wasn''t able to mount any of them. I got the BUG at fs/btrfs/volumes.c:3707 each time. gandalfthegreat:~# mount -o ro,recovery /dev/mapper/root -o ''subvol=root_daily_20130108_00:01:02,defaults,compress=lzo,discard,nossd,space_cache,noatime'' If my log is damaged, why are all other snapshots also broken? Question #2: This btrfs-zero-log business, which in the end fixed my problem, should not be a routine recovery method, especially because the ooops you get on your screen doesn''t have the proper info that tells you that it''s actually the right bug as described on https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_can.27t_mount_my_filesystem.2C_and_I_get_a_kernel_oops.21 Could mainline kernels be fixed not to oops so badly and in a hard to debug way when this problem which happens too often (at least for me), is hit? If that helps, here''s what I got after the fact when trying to mount the broken filesystem before zero''ing logs [ 3964.728509] btrfs bad tree block start 2075122916315869932 4268204032 [ 3964.728714] btrfs bad tree block start 12746175583536274708 4268204032 [ 3964.728748] ------------[ cut here ]------------ [ 3964.728771] WARNING: at fs/btrfs/tree-log.c:1728 walk_down_log_tree+0x51/0x307() [ 3964.730058] WARNING: at fs/btrfs/tree-log.c:1732 walk_down_log_tree+0x6c/0x307() [ 3964.731287] kernel BUG at fs/btrfs/volumes.c:3707! (full log below) gandalfthegreat:~# btrfs-calc-size /dev/mapper/root Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=12746175583536274708 Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=2075122916315869932 read block failed check_tree_block Calculating size of root tree 212.00KB total size, 0.00 inline data, 1 nodes, 52 leaves, 2 levels Calculating size of extent tree 396.43MB total size, 0.00 inline data, 1518 nodes, 99968 leaves, 4 levels Calculating size of csum tree 430.36MB total size, 0.00 inline data, 1434 nodes, 108737 leaves, 4 levels Calculatin'' size of fs tree 16.00KB total size, 0.00 inline data, 1 nodes, 3 leaves, 2 levels gandalfthegreat:~# btrfs-find-root /dev/mapper/root Super think''s the tree root is at 4203188224, chunk root 20979712 Found tree root at 4203188224 gandalfthegreat:~# gandalfthegreat:~# btrfs filesystem show Label: ''btrfs_pool1'' uuid: 92584fa9-85cd-4df6-b182-d32198b76a0b Total devices 1 FS bytes used 315.26GB devid 1 size 441.70GB used 441.70GB path /dev/dm-1 Label: ''btrfs_pool2'' uuid: 04071703-df6b-4022-9632-6c3aeabff206 Total devices 1 FS bytes used 770.67GB devid 1 size 872.51GB used 872.51GB path /dev/dm-0 gandalfthegreat:~# btrfs-zero-log /dev/mapper/root Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=12746175583536274708 Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=2075122916315869932 read block failed check_tree_block gandalfthegreat:~# btrfs-zero-log /dev/mapper/root gandalfthegreat:~# Full kernel debug info: [ 3829.908692] device label btrfs_pool2 devid 1 transid 66367 /dev/mapper/cryptroot [ 3964.563483] device label btrfs_pool1 devid 1 transid 305644 /dev/mapper/root [ 3964.564164] btrfs: enabling auto recovery [ 3964.564172] btrfs: disk space caching is enabled [ 3964.728275] Btrfs detected SSD devices, enabling SSD mode [ 3964.728509] btrfs bad tree block start 2075122916315869932 4268204032 [ 3964.728714] btrfs bad tree block start 12746175583536274708 4268204032 [ 3964.728748] ------------[ cut here ]------------ [ 3964.728771] WARNING: at fs/btrfs/tree-log.c:1728 walk_down_log_tree+0x51/0x307() [ 3964.728796] Hardware name: 2429A78 [ 3964.728804] Modules linked in: tun cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev btusb hid_generic usbhid bluetooth media hid snd_hda_codec_realtek thinkpad_acpi snd_hda_intel snd_hda_codec nvram i915 snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq arc4 button battery iwldvm mac80211 iwlwifi iTCO_wdt coretemp iTCO_vendor_support psmouse kvm_intel snd_seq_device acpi_cpufreq drm_kms_helper drm video mperf xhci_hcd ehci_hcd i2c_i801 i2c_algo_b it kvm e1000e serio_raw evdev snd_timer processor i2c_core tpm_tis wmi tpm sdhci_pci mei sdhci pcspkr ac snd cfg80211 mmc_core lpc_ich soundcore rfkill usbcore tpm_bios ghash_clmulni_intel usb_common microcode raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy async_tx xor blowfish_x86_64 blowfish_common ecb xts gf128mul crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd thermal thermal_sys [ 3964.729795] Pid: 10083, comm: mount Not tainted 3.6.3-amd64-preempt-20120903 #1 [ 3964.729801] Call Trace: [ 3964.729816] [<ffffffff81040854>] warn_slowpath_common+0x7e/0x96 [ 3964.729836] [<ffffffff81040881>] warn_slowpath_null+0x15/0x17 [ 3964.729845] [<ffffffff81205f64>] walk_down_log_tree+0x51/0x307 [ 3964.729853] [<ffffffff81206294>] walk_log_tree+0x7a/0x1bc [ 3964.729862] [<ffffffff81207ab8>] btrfs_recover_log_trees+0x9f/0x2ff [ 3964.729881] [<ffffffff81206a32>] ? replay_one_buffer+0x235/0x235 [ 3964.729891] [<ffffffff811dcbe5>] open_ctree+0x143d/0x1820 [ 3964.729899] [<ffffffff8128ffcc>] ? string.isra.3+0x3d/0xa4 [ 3964.729911] [<ffffffff811bf3a2>] btrfs_mount+0x36d/0x4cd [ 3964.729920] [<ffffffff810e8412>] ? pcpu_next_pop+0x38/0x45 [ 3964.729942] [<ffffffff8112a102>] ? alloc_vfsmnt+0xa6/0x192 [ 3964.729953] [<ffffffff81116207>] mount_fs+0x64/0x14d [ 3964.729961] [<ffffffff810e94d8>] ? __alloc_percpu+0xb/0xd [ 3964.729971] [<ffffffff8112a4bb>] vfs_kern_mount+0x64/0xde [ 3964.729990] [<ffffffff8112a89e>] do_kern_mount+0x48/0xda [ 3964.729998] [<ffffffff8112c366>] do_mount+0x6b1/0x714 [ 3964.730010] [<ffffffff810d2fb7>] ? __get_free_pages+0x9/0x45 [ 3964.730018] [<ffffffff8112c44c>] sys_mount+0x83/0xbd [ 3964.730038] [<ffffffff814b60fd>] system_call_fastpath+0x1a/0x1f [ 3964.730045] ---[ end trace cb9b09d3eae7696a ]--- [ 3964.730050] ------------[ cut here ]------------ [ 3964.730058] WARNING: at fs/btrfs/tree-log.c:1732 walk_down_log_tree+0x6c/0x307() [ 3964.730063] Hardware name: 2429A78 [ 3964.730069] Modules linked in: tun cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev btusb hid_generic usbhid bluetooth media hid snd_hda_codec_realtek thinkpad_acpi snd_hda_intel snd_hda_codec nvram i915 snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq arc4 button battery iwldvm mac80211 iwlwifi iTCO_wdt coretemp iTCO_vendor_support psmouse kvm_intel snd_seq_device acpi_cpufreq drm_kms_helper drm video mperf xhci_hcd ehci_hcd i2c_i801 i2c_algo_b it kvm e1000e serio_raw evdev snd_timer processor i2c_core tpm_tis wmi tpm sdhci_pci mei sdhci pcspkr ac snd cfg80211 mmc_core lpc_ich soundcore rfkill usbcore tpm_bios ghash_clmulni_intel usb_common microcode raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy async_tx xor blowfish_x86_64 blowfish_common ecb xts gf128mul crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd thermal thermal_sys [ 3964.730924] Pid: 10083, comm: mount Tainted: G W 3.6.3-amd64-preempt-20120903 #1 [ 3964.730929] Call Trace: [ 3964.730951] [<ffffffff81040854>] warn_slowpath_common+0x7e/0x96 [ 3964.730961] [<ffffffff81040881>] warn_slowpath_null+0x15/0x17 [ 3964.730969] [<ffffffff81205f7f>] walk_down_log_tree+0x6c/0x307 [ 3964.730977] [<ffffffff81206294>] walk_log_tree+0x7a/0x1bc [ 3964.730986] [<ffffffff81207ab8>] btrfs_recover_log_trees+0x9f/0x2ff [ 3964.731006] [<ffffffff81206a32>] ? replay_one_buffer+0x235/0x235 [ 3964.731015] [<ffffffff811dcbe5>] open_ctree+0x143d/0x1820 [ 3964.731023] [<ffffffff8128ffcc>] ? string.isra.3+0x3d/0xa4 [ 3964.731034] [<ffffffff811bf3a2>] btrfs_mount+0x36d/0x4cd [ 3964.731051] [<ffffffff810e8412>] ? pcpu_next_pop+0x38/0x45 [ 3964.731061] [<ffffffff8112a102>] ? alloc_vfsmnt+0xa6/0x192 [ 3964.731071] [<ffffffff81116207>] mount_fs+0x64/0x14d [ 3964.731079] [<ffffffff810e94d8>] ? __alloc_percpu+0xb/0xd [ 3964.731098] [<ffffffff8112a4bb>] vfs_kern_mount+0x64/0xde [ 3964.731107] [<ffffffff8112a89e>] do_kern_mount+0x48/0xda [ 3964.731115] [<ffffffff8112c366>] do_mount+0x6b1/0x714 [ 3964.731124] [<ffffffff810d2fb7>] ? __get_free_pages+0x9/0x45 [ 3964.731131] [<ffffffff8112c44c>] sys_mount+0x83/0xbd [ 3964.731150] [<ffffffff814b60fd>] system_call_fastpath+0x1a/0x1f [ 3964.731157] ---[ end trace cb9b09d3eae7696b ]--- [ 3964.731201] parent transid verify failed on 10056379431157275594 wanted 6327746181234947034 found 0 [ 3964.731227] ------------[ cut here ]------------ [ 3964.731287] kernel BUG at fs/btrfs/volumes.c:3707! [ 3964.731342] invalid opcode: 0000 [#1] PREEMPT SMP [ 3964.731415] Modules linked in: tun cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative ppdev rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev btusb hid_generic usbhid bluetooth media hid snd_hda_codec_realtek thinkpad_acpi snd_hda_intel snd_hda_codec nvram i915 snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq arc4 button battery iwldvm mac80211 iwlwifi iTCO_wdt coretemp iTCO_vendor_support psmouse kvm_intel snd_seq_device acpi_cpufreq drm_kms_helper drm video mperf xhci_hcd ehci_hcd i2c_i801 i2c_algo_b it kvm e1000e serio_raw evdev snd_timer processor i2c_core tpm_tis wmi tpm sdhci_pci mei sdhci pcspkr ac snd cfg80211 mmc_core lpc_ich soundcore rfkill usbcore tpm_bios ghash_clmulni_intel usb_common microcode raid456 multipath dm_snapshot dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor raid6_pq async_memcpy async_tx xor blowfish_x86_64 blowfish_common ecb xts gf128mul crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd thermal thermal_sys [ 3964.733325] CPU 0 [ 3964.733354] Pid: 10083, comm: mount Tainted: G W 3.6.3-amd64-preempt-20120903 #1 LENOVO 2429A78/2429A78 [ 3964.733464] RIP: 0010:[<ffffffff811fabfa>] [<ffffffff811fabfa>] btrfs_num_copies+0x42/0x8b [ 3964.733563] RSP: 0018:ffff88020f3c1948 EFLAGS: 00010246 [ 3964.733622] RAX: 0000000000000000 RBX: 8b8f6fcfc8ac9bca RCX: 0000000000000001 [ 3964.733698] RDX: ffffffffffffffff RSI: 8b8f6fcfc8ac9bca RDI: ffff88020f3c0000 [ 3964.733775] RBP: ffff88020f3c1978 R08: 00000000ffffffff R09: 00000000ffec3402 [ 3964.733850] R10: 00000000ffec3402 R11: 00000000000000c0 R12: ffff88021e0fc128 [ 3964.733925] R13: 0000000000000000 R14: 00000000fffffffb R15: 0000000000000000 [ 3964.734001] FS: 00007f9d21e957e0(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000 [ 3964.734086] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 3964.734148] CR2: 00007fff9d573b18 CR3: 0000000129052000 CR4: 00000000001407f0 [ 3964.734222] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3964.734297] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 3964.734374] Process mount (pid: 10083, threadinfo ffff88020f3c0000, task ffff88021e0d4580) [ 3964.734458] Stack: [ 3964.734485] ffff88020f3c1968 0000000000001000 ffff88021e16e000 ffff8801e946a1b8 [ 3964.734586] ffff88021e16e000 0000000000000000 ffff88020f3c19d8 ffffffff811d93d2 [ 3964.734689] ffff88017ee38828 57d0aba04157f3da 0000000000001000 ffff88017ee38820 [ 3964.734792] Call Trace: [ 3964.734830] [<ffffffff811d93d2>] btree_read_extent_buffer_pages.constprop.118+0xa8/0x105 [ 3964.734920] [<ffffffff811daf17>] btrfs_read_buffer+0x2a/0x2c [ 3964.734984] [<ffffffff81206143>] walk_down_log_tree+0x230/0x307 [ 3964.735052] [<ffffffff81206294>] walk_log_tree+0x7a/0x1bc [ 3964.735116] [<ffffffff81207ab8>] btrfs_recover_log_trees+0x9f/0x2ff [ 3964.735186] [<ffffffff81206a32>] ? replay_one_buffer+0x235/0x235 [ 3964.735257] [<ffffffff811dcbe5>] open_ctree+0x143d/0x1820 [ 3964.735319] [<ffffffff8128ffcc>] ? string.isra.3+0x3d/0xa4 [ 3964.735385] [<ffffffff811bf3a2>] btrfs_mount+0x36d/0x4cd [ 3964.735446] [<ffffffff810e8412>] ? pcpu_next_pop+0x38/0x45 [ 3964.735512] [<ffffffff8112a102>] ? alloc_vfsmnt+0xa6/0x192 [ 3964.735580] [<ffffffff81116207>] mount_fs+0x64/0x14d [ 3964.735638] [<ffffffff810e94d8>] ? __alloc_percpu+0xb/0xd [ 3964.735703] [<ffffffff8112a4bb>] vfs_kern_mount+0x64/0xde [ 3964.735768] [<ffffffff8112a89e>] do_kern_mount+0x48/0xda [ 3964.735829] [<ffffffff8112c366>] do_mount+0x6b1/0x714 [ 3964.735889] [<ffffffff810d2fb7>] ? __get_free_pages+0x9/0x45 [ 3964.735954] [<ffffffff8112c44c>] sys_mount+0x83/0xbd [ 3964.736014] [<ffffffff814b60fd>] system_call_fastpath+0x1a/0x1f [ 3964.736080] Code: 83 ec 18 48 89 55 d8 e8 05 69 2b 00 48 8b 55 d8 4c 89 ef 48 89 de e8 ad 28 ff ff 4c 89 e7 49 89 c5 e8 b3 6b 2b 00 4d 85 ed 75 02 <0f> 0b 49 8b 45 18 48 39 d8 77 09 49 03 45 20 48 39 d8 73 02 0f [ 3964.737062] RIP [<ffffffff811fabfa>] btrfs_num_copies+0x42/0x8b [ 3964.737139] RSP <ffff88020f3c1948> [ 3964.779792] ---[ end trace cb9b09d3eae7696c ]--- [ 3964.779807] Kernel panic - not syncing: Fatal exception -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2013-Jan-08 17:10 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Tue, Jan 08, 2013 at 08:49:58AM -0800, Marc MERLIN wrote:> Unfortunately my laptop deadlocks from time to time, and too often > it triggers this bug in btrfs which is quite hard to recover from. > > The bigger problem is that all the user sees (if anything) is seemingly > unrelated info, namely, "RIP: btrfs_num_copies+0x42/0x0b" or somesuch > http://marc.merlins.org/tmp/btrfs_num_copies.jpg > > It''s only if you have serial console, or netconsole, which we can''t really > assume the average users to have, that you can get the correct oops and bug > info. > I lost another 3 hours with many reboots and a recovery drive to recover my > root drive. > > Question #1: > I have hourly snapshots of my root filesystem, and I wasn''t able to mount > any of them. I got the BUG at fs/btrfs/volumes.c:3707 each time. > gandalfthegreat:~# mount -o ro,recovery /dev/mapper/root -o ''subvol=root_daily_20130108_00:01:02,defaults,compress=lzo,discard,nossd,space_cache,noatime'' > > If my log is damaged, why are all other snapshots also broken?Snapshots are not independent of each other. The filesystem as a whole is damaged -- if you can''t mount it, it won''t make a difference which subvolume you try to mount. A snapshot is not a backup; it won''t save you from a broken filesystem or dead hardware. At best, it''ll save you from accidental deletion of files.> Question #2: > This btrfs-zero-log business, which in the end fixed my problem, should > not be a routine recovery method, especially because the ooops you get on > your screen doesn''t have the proper info that tells you that it''s actually > the right bug as described on > https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_can.27t_mount_my_filesystem.2C_and_I_get_a_kernel_oops.21 > > Could mainline kernels be fixed not to oops so badly and in a hard to debug > way when this problem which happens too often (at least for me), is hit?Oopses in log playback are a bug. The last time we had such a bug which was identifiable and traceable (back in the 3.1-3.2 era, IIRC), it got fixed, eventually. So yes, this is a bug, it should be fixed, and you''re not the only person to have seen log tree replys fail in 3.6 and 3.7 kernels. Since you seem to be hitting the problem frequently and repeatably, could you help? Josef has said he''d like a copy of the filesystem image that btrfs-image produces when run against the broken FS (i.e. while the FS can''t mount) -- that would help track down the corruption problem, and make the kernel more robust in this area. Just as a warning, the output may be quite large: it contains all of your FS''s metadata.> If that helps, here''s what I got after the fact when trying to mount the > broken filesystem before zero''ing logs[snip] That information may also be helpful in conjunction with the btrfs-image dump of a broken FS. I''m not sure how much help it is on its own (but thanks for providing it anyway). Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great oxymorons of the world, no. 4: Future Perfect ---
Marc MERLIN
2013-Jan-08 18:09 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
See bottom, my filesystem mounts, but apparently it''s still corrupted as per btrfs-image output On Tue, Jan 08, 2013 at 05:10:12PM +0000, Hugo Mills wrote:> > Question #1: > > I have hourly snapshots of my root filesystem, and I wasn''t able to mount > > any of them. I got the BUG at fs/btrfs/volumes.c:3707 each time. > > gandalfthegreat:~# mount -o ro,recovery /dev/mapper/root -o ''subvol=root_daily_20130108_00:01:02,defaults,compress=lzo,discard,nossd,space_cache,noatime'' > > > > If my log is damaged, why are all other snapshots also broken? > > Snapshots are not independent of each other. The filesystem as a > whole is damaged -- if you can''t mount it, it won''t make a difference > which subvolume you try to mount. A snapshot is not a backup; it won''t > save you from a broken filesystem or dead hardware. At best, it''ll > save you from accidental deletion of files.Thanks for explaining. I guess it makes sense that the log is not a per subvolume thing, but a filesystem-wide thing. Last time I posted this problem, someone replied and suggested that I tried mounting an older snapshot, but now I understand that it won''t help.> Oopses in log playback are a bug. The last time we had such a bug > which was identifiable and traceable (back in the 3.1-3.2 era, IIRC), > it got fixed, eventually. So yes, this is a bug, it should be fixed, > and you''re not the only person to have seen log tree replys fail in > 3.6 and 3.7 kernels. > Since you seem to be hitting the problem frequently and repeatably, > could you help? Josef has said he''d like a copy of the filesystem > image that btrfs-image produces when run against the broken FS (i.e. > while the FS can''t mount) -- that would help track down the corruption > problem, and make the kernel more robust in this area. Just as a > warning, the output may be quite large: it contains all of your FS''s > metadata.Argh, I reported this here with 3.6.3 3 months ago and waited an entire week for someone to tell me what they wanted off my FS before I removed all evidence of the bug, but never got a reply asking for anything. Since I didn''t read any updates since then or found anything new on https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_can.27t_mount_my_filesystem.2C_and_I_get_a_kernel_oops.21 I just went ahead and fixed my laptop right away this morning, sorry about that. Because it''s my main work laptop, I don''t want to break it on purpose, or risk corruption that I may not notice and would creep in my incremental backups, so I''d rather not try and reproduce this on purpose, but if you want a way to reproduce this, I think pulling the sata cable off a drive while writing a few times should reproduce this pretty easily (or maybe even just pulling power although I know pulling power still lets some drives write some things before they shut down) I''m however likely to hit the problem again sooner or later, whether I want it or not. I''ll make sure to run btrfs-image next time. Ok, how about this. 1) I updated https://btrfs.wiki.kernel.org/index.php/Problem_FAQ to tell people to run btrfs-image. It''ll be easier for you to get what you need if it''s documented somewhere :) Please update it further to say what people should do since posting on the list does not always yield timely replies for people who need to recover soon-ish from backup if necessary. 2) You may still be in luck, maybe? and me not so much. I thought my filesystem was recovered, I''m running it right now, but: gandalfthegreat:~# btrfs-image -c 9 -t 8 /dev/mapper/cryptroot /var/tmp/fs_image Check tree block failed, want=5212229632, have=12481778023482407252 Check tree block failed, want=5212229632, have=12481778023482407252 Check tree block failed, want=5212229632, have=14440972074482314957 Check tree block failed, want=5212229632, have=12481778023482407252 Check tree block failed, want=5212229632, have=12481778023482407252 read block failed check_tree_block btrfs-image: btrfs-image.c:518: create_metadump: Assertion `!(ret < 0)'' failed. Aborted gandalfthegreat:~# l /var/tmp/fs_image -rw-r--r-- 1 root root 234413056 Jan 8 09:58 /var/tmp/fs_image No idea what version I have because it won''t say: gandalfthegreat:~# btrfs-image --version btrfs-image: invalid option -- ''-'' usage: btrfs-image [options] source target -r restore metadump image -c value compression level (0 ~ 9) -t value number of threads (1 ~ 32) gandalfthegreat:~# btrfs-image -v btrfs-image: invalid option -- ''v'' (...) Is my incomplete /var/tmp/fs_image useful, or anything else you want out of my maybe still corrupted filesystem? Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/
Josef Bacik
2013-Jan-08 18:25 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Tue, Jan 08, 2013 at 09:49:58AM -0700, Marc MERLIN wrote:> Unfortunately my laptop deadlocks from time to time, and too often > it triggers this bug in btrfs which is quite hard to recover from. >You are getting bad tree blocks which really isn''t the tree logs fault. Can you scrub your file system and make sure there''s not some sort of latent issue going on? And if you have problems again please try btrfs-next as I''ve fixed a few log replay bugs recently. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2013-Jan-08 18:46 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Tue, Jan 08, 2013 at 01:25:41PM -0500, Josef Bacik wrote:> On Tue, Jan 08, 2013 at 09:49:58AM -0700, Marc MERLIN wrote: > > Unfortunately my laptop deadlocks from time to time, and too often > > it triggers this bug in btrfs which is quite hard to recover from. > > > > You are getting bad tree blocks which really isn''t the tree logs fault. Can you > scrub your file system and make sure there''s not some sort of latent issue going > on? And if you have problems again please try btrfs-next as I''ve fixed a few > log replay bugs recently. Thanks,Thanks for your answer. I''ve only read about scrub in a mirorr situation, as per https://blogs.oracle.com/wim/entry/btrfs_scrub_go_fix_corruptions It''s a single device here, so if there are problems, I''m not sure how scrub will be able to fix them. Would you like me to go ahead and do it anyway? And before I further remove potential debug state 1) what about this problem? btrfs-image: btrfs-image.c:518: create_metadump: Assertion `!(ret < 0)'' failed. Aborted 2) anything else you want me to get off my filesystem before I scrub it? Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2013-Jan-10 16:20 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Tue, Jan 08, 2013 at 10:46:03AM -0800, Marc MERLIN wrote:> On Tue, Jan 08, 2013 at 01:25:41PM -0500, Josef Bacik wrote: > > On Tue, Jan 08, 2013 at 09:49:58AM -0700, Marc MERLIN wrote: > > > Unfortunately my laptop deadlocks from time to time, and too often > > > it triggers this bug in btrfs which is quite hard to recover from. > > > > > > > You are getting bad tree blocks which really isn''t the tree logs fault. Can you > > scrub your file system and make sure there''s not some sort of latent issue going > > on? And if you have problems again please try btrfs-next as I''ve fixed a few > > log replay bugs recently. Thanks, > > Thanks for your answer. > > I''ve only read about scrub in a mirorr situation, as per > https://blogs.oracle.com/wim/entry/btrfs_scrub_go_fix_corruptions > > It''s a single device here, so if there are problems, I''m not sure how scrub > will be able to fix them. > Would you like me to go ahead and do it anyway? > > And before I further remove potential debug state > 1) what about this problem? > btrfs-image: btrfs-image.c:518: create_metadump: Assertion `!(ret < 0)'' > failed. > Aborted > > 2) anything else you want me to get off my filesystem before I scrub it?Friendly ping :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2013-Jan-11 14:49 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Thu, Jan 10, 2013 at 09:20:09AM -0700, Marc MERLIN wrote:> On Tue, Jan 08, 2013 at 10:46:03AM -0800, Marc MERLIN wrote: > > On Tue, Jan 08, 2013 at 01:25:41PM -0500, Josef Bacik wrote: > > > On Tue, Jan 08, 2013 at 09:49:58AM -0700, Marc MERLIN wrote: > > > > Unfortunately my laptop deadlocks from time to time, and too often > > > > it triggers this bug in btrfs which is quite hard to recover from. > > > > > > > > > > You are getting bad tree blocks which really isn''t the tree logs fault. Can you > > > scrub your file system and make sure there''s not some sort of latent issue going > > > on? And if you have problems again please try btrfs-next as I''ve fixed a few > > > log replay bugs recently. Thanks, > > > > Thanks for your answer. > > > > I''ve only read about scrub in a mirorr situation, as per > > https://blogs.oracle.com/wim/entry/btrfs_scrub_go_fix_corruptions > > > > It''s a single device here, so if there are problems, I''m not sure how scrub > > will be able to fix them. > > Would you like me to go ahead and do it anyway? > >Well its mostly to verify you have some sort of latent corruption sitting around. If you have DUP it will be able to fix it, but if you don''t we''ll at least know something else is wrong and we can try and work out if fsck will fix it.> > And before I further remove potential debug state > > 1) what about this problem? > > btrfs-image: btrfs-image.c:518: create_metadump: Assertion `!(ret < 0)'' > > failed. > > AbortedProbably just related to whatever corruption it is you are seeing.> > > > 2) anything else you want me to get off my filesystem before I scrub it?Nope, hopefully it should be non-destructive. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2013-Jan-13 03:12 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Fri, Jan 11, 2013 at 09:49:52AM -0500, Josef Bacik wrote:> On Thu, Jan 10, 2013 at 09:20:09AM -0700, Marc MERLIN wrote: > > On Tue, Jan 08, 2013 at 10:46:03AM -0800, Marc MERLIN wrote: > > > On Tue, Jan 08, 2013 at 01:25:41PM -0500, Josef Bacik wrote: > > > > On Tue, Jan 08, 2013 at 09:49:58AM -0700, Marc MERLIN wrote: > > > > > Unfortunately my laptop deadlocks from time to time, and too often > > > > > it triggers this bug in btrfs which is quite hard to recover from. > > > > > > > > > > > > > You are getting bad tree blocks which really isn''t the tree logs fault. Can you > > > > scrub your file system and make sure there''s not some sort of latent issue going > > > > on? And if you have problems again please try btrfs-next as I''ve fixed a few > > > > log replay bugs recently. Thanks, > > > > > > Thanks for your answer. > > > > > > I''ve only read about scrub in a mirorr situation, as per > > > https://blogs.oracle.com/wim/entry/btrfs_scrub_go_fix_corruptions > > > > > > It''s a single device here, so if there are problems, I''m not sure how scrub > > > will be able to fix them. > > > Would you like me to go ahead and do it anyway? > > > > > Well its mostly to verify you have some sort of latent corruption sitting > around. If you have DUP it will be able to fix it, but if you don''t we''ll at > least know something else is wrong and we can try and work out if fsck will fix > it.Looks like I had no problems if I''m reading this right: scrub status:1 92584fa9-85cd-4df6-b182-d32198b76a0b:1|data_extents_scrubbed:7739406|tree_extents_scrubbed:2351353|data_bytes_scrubbed:347948277760|tree_bytes_scrubbed:9631141888|read_errors:0|csum_errors:0|verify_errors:0|no_csum:52600|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:474268106752|t_start:1357918734|t_resumed:0|duration:984|canceled:0|finished:1> > > And before I further remove potential debug state > > > 1) what about this problem? > > > btrfs-image: btrfs-image.c:518: create_metadump: Assertion `!(ret < 0)'' > > > failed. > > > Aborted > > Probably just related to whatever corruption it is you are seeing.So I have no corruption afterall, correct? That''s good news, but then it does mean that unclean sudden shutdowns in the wrong place. I still have a truncated fs_image if someone wants it, and with an apparently uncorrupted FS, btrfs-image is still dying for me: andalfthegreat:~# btrfs-image -c 9 /dev/mapper/cryptroot /var/tmp/fs_image2 Check tree block failed, want=4261896192, have=10797364022063960087 Check tree block failed, want=4261896192, have=10797364022063960087 Check tree block failed, want=4261896192, have=13996544474027288730 Check tree block failed, want=4261896192, have=10797364022063960087 Check tree block failed, want=4261896192, have=10797364022063960087 read block failed check_tree_block btrfs-image: btrfs-image.c:518: create_metadump: Assertion `!(ret < 0)'' failed. Aborted gandalfthegreat:~# Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2013-Jan-14 06:28 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Tue, Jan 08, 2013 at 05:10:12PM +0000, Hugo Mills wrote:> > If that helps, here''s what I got after the fact when trying to mount the > > broken filesystem before zero''ing logs > [snip] > > That information may also be helpful in conjunction with the > btrfs-image dump of a broken FS. I''m not sure how much help it is on > its own (but thanks for providing it anyway). > > Hugo.Ok, that didn''t take long. I had the bug again, and I was able to take a brtfs-image that completed. (It was failing earlier because I was taking the image of my root mounted filesystem, which apparently isn''t ok ;) ). gandalfthegreat:~# btrfs-image -c9 /dev/mapper/root /var/tmp/image Check tree block failed, want=73949184, have=14419929907352990360 Check tree block failed, want=73949184, have=14419929907352990360 Check tree block failed, want=73949184, have=12288424950725929455 Check tree block failed, want=73949184, have=14419929907352990360 Check tree block failed, want=73949184, have=14419929907352990360 read block failed check_tree_block Is here: http://marc.merlins.org/tmp/image (803M) Hopefully this helps. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-Jan-17 11:31 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Sat, Jan 12, 2013 at 07:12:12PM -0800, Marc MERLIN wrote:> On Fri, Jan 11, 2013 at 09:49:52AM -0500, Josef Bacik wrote: > > Probably just related to whatever corruption it is you are seeing. > > So I have no corruption afterall, correct? > > That''s good news, but then it does mean that unclean sudden shutdowns in the wrong place. > > I still have a truncated fs_image if someone wants it, and with an > apparently uncorrupted FS, btrfs-image is still dying for me: > andalfthegreat:~# btrfs-image -c 9 /dev/mapper/cryptroot /var/tmp/fs_image2 > Check tree block failed, want=4261896192, have=10797364022063960087 > Check tree block failed, want=4261896192, have=10797364022063960087 > Check tree block failed, want=4261896192, have=13996544474027288730 > Check tree block failed, want=4261896192, have=10797364022063960087 > Check tree block failed, want=4261896192, have=10797364022063960087want= are around 4G, have= numbers are way off, very likely a corruption. the number does not translate into a meaningful pattern. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2013-Jan-17 14:16 UTC
Re: kernel BUG at fs/btrfs/volumes.c:3707 still not fixed in 3.7.1 (btrfs-zero-log required) but shown as "RIP btrfs_num_copies"
On Thu, Jan 17, 2013 at 12:31:22PM +0100, David Sterba wrote:> On Sat, Jan 12, 2013 at 07:12:12PM -0800, Marc MERLIN wrote: > > On Fri, Jan 11, 2013 at 09:49:52AM -0500, Josef Bacik wrote: > > > Probably just related to whatever corruption it is you are seeing. > > > > So I have no corruption afterall, correct? > > > > That''s good news, but then it does mean that unclean sudden shutdowns in the wrong place. > > > > I still have a truncated fs_image if someone wants it, and with an > > apparently uncorrupted FS, btrfs-image is still dying for me: > > andalfthegreat:~# btrfs-image -c 9 /dev/mapper/cryptroot /var/tmp/fs_image2 > > Check tree block failed, want=4261896192, have=10797364022063960087 > > Check tree block failed, want=4261896192, have=10797364022063960087 > > Check tree block failed, want=4261896192, have=13996544474027288730 > > Check tree block failed, want=4261896192, have=10797364022063960087 > > Check tree block failed, want=4261896192, have=10797364022063960087 > > want= are around 4G, have= numbers are way off, very likely a > corruption. the number does not translate into a meaningful pattern.Thanks for your answer, I appreciate it. So this is the corruption that ends up in my log, and if I remove the log that happened before the laptop died, then the rest of the filesystem seems ok and scrub finds nothing wrong. My very uneducatd guess is that when my SSD craps out, just before my laptop locks up, it''s either responsible for causing that corrupted log to be written, or there is a bug in btrfs that happens where truncated writes happen before a device comes of the sata bus. Either way, I got this bug multiple times with the same device (which I just replaced yesterday, just in case). Here''s what I''ve seen so far in case there is a useful pattern. There is a mix of what I got from btrfs-zero-log and btrfs-image. I did get the problem more often than just those, but those are the ones I recorded before zero''ing the log. Check tree block failed, want=259264512, have=12301165138967429629 Check tree block failed, want=259264512, have=12301165138967429629 Check tree block failed, want=259264512, have=7949122546735189447 Check tree block failed, want=259264512, have=12301165138967429629 Check tree block failed, want=259264512, have=12301165138967429629 Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=12746175583536274708 Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=4268204032, have=2075122916315869932 Check tree block failed, want=5212229632, have=12481778023482407252 Check tree block failed, want=5212229632, have=12481778023482407252 Check tree block failed, want=5212229632, have=14440972074482314957 Check tree block failed, want=5212229632, have=12481778023482407252 Check tree block failed, want=5212229632, have=12481778023482407252 Check tree block failed, want=73949184, have=14419929907352990360 Check tree block failed, want=73949184, have=14419929907352990360 Check tree block failed, want=73949184, have=12288424950725929455 Check tree block failed, want=73949184, have=14419929907352990360 Check tree block failed, want=73949184, have=14419929907352990360 Check tree block failed, want=4261896192, have=10797364022063960087 Check tree block failed, want=4261896192, have=10797364022063960087 Check tree block failed, want=4261896192, have=13996544474027288730 Check tree block failed, want=4261896192, have=10797364022063960087 Check tree block failed, want=4261896192, have=10797364022063960087 I''m hopeful this is useful somehow, and hopefully btrfs will stop oops''ing at boot time, requiring a recovery disk and specialized knowledge to recover, when this happens. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html